[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

Lei Rao posted 5 patches 2 years, 9 months ago
[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Lei Rao 2 years, 9 months ago
The new function nvme_submit_vf_cmd() helps the host VF driver to issue
VF admin commands. It's helpful in some cases that the host NVMe driver
does not control VF's admin queue. For example, in the virtualization
device pass-through case, the VF controller's admin queue is governed
by the Guest NVMe driver. Host VF driver relies on PF device's admin
queue to control VF devices like vendor-specific live migration commands.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/nvme/host/pci.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 488ad7dabeb8..3d9c54d8e7fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3585,6 +3585,24 @@ static struct pci_driver nvme_driver = {
 	.err_handler	= &nvme_err_handler,
 };
 
+int nvme_submit_vf_cmd(struct pci_dev *dev, struct nvme_command *cmd,
+			      size_t *result, void *buffer, unsigned int bufflen)
+{
+	struct nvme_dev *ndev = NULL;
+	union nvme_result res = { };
+	int ret;
+
+	ndev = pci_iov_get_pf_drvdata(dev, &nvme_driver);
+	if (IS_ERR(ndev))
+		return PTR_ERR(ndev);
+	ret = __nvme_submit_sync_cmd(ndev->ctrl.admin_q, cmd, &res, buffer, bufflen,
+			      NVME_QID_ANY, 0, 0);
+	if (ret >= 0 && result)
+		*result = le32_to_cpu(res.u32);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_submit_vf_cmd);
+
 static int __init nvme_init(void)
 {
 	BUILD_BUG_ON(sizeof(struct nvme_create_cq) != 64);
-- 
2.34.1
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> VF admin commands. It's helpful in some cases that the host NVMe driver
> does not control VF's admin queue. For example, in the virtualization
> device pass-through case, the VF controller's admin queue is governed
> by the Guest NVMe driver. Host VF driver relies on PF device's admin
> queue to control VF devices like vendor-specific live migration commands.

WTF are you even smoking when you think this would be acceptable?
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > VF admin commands. It's helpful in some cases that the host NVMe driver
> > does not control VF's admin queue. For example, in the virtualization
> > device pass-through case, the VF controller's admin queue is governed
> > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > queue to control VF devices like vendor-specific live migration commands.
> 
> WTF are you even smoking when you think this would be acceptable?

Not speaking to NVMe - but this driver is clearly copying mlx5's live
migration driver, almost completely - including this basic function.

So, to explain why mlx5 works this way..

The VFIO approach is to fully assign an entire VF to the guest OS. The
entire VF assignment means every MMIO register *and all the DMA* of
the VF is owned by the guest operating system.

mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
state to perform a migration.

So, we must be able to use DMA to transfer the data. However, the VM
exclusively controls the DMA of the VF. The iommu_domain of the VF
belongs to the guest VM through VFIO, and we simply cannot mutate
it. Not only should not, but physically can not, ie when IOMMU nested
translation is in use and the IO page tables are in guest VM memory.

So the VF cannot be used to control the migration, or transfer the
migration data. This leaves only the PF.

Thus, mxl5 has the same sort of design where the VF VFIO driver
reaches into the PF kernel driver and asks the PF driver to perform
some commands targeting the PF's own VFs. The DMA is then done using
the RID of the PF, and reaches the kernel owned iommu_domain of the
PF. This way the entire operation is secure aginst meddling by the
guest.

We can contrast this with the hisilicon live migration driver that
does not use the PF for control. Instead it has a very small state
that the migration driver simply reads out of registers. The VF has a
page of registers that control pause/go of the queues and the VFIO
varient driver denies access to this page from the guest VM so that
the kernel VFIO driver has reliable control over the VF.

Without involving PASID this is broadly the only two choices for doing
SRIOV live migration, AFAIK.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> Not speaking to NVMe - but this driver is clearly copying mlx5's live
> migration driver, almost completely - including this basic function.

Maybe that's not a good idea in an NVMe environment, and maybe that
should have talked to the standards committee before spending their
time on cargo cult engineering.

Most importantly NVMe is very quiet on the relationship between
VFs and PFs, and there is no way to guarantee that a PF is, at the
NVMe level, much in control of a VF at all.  In other words this
concept really badly breaks NVMe abstractions.

> Thus, mxl5 has the same sort of design where the VF VFIO driver
> reaches into the PF kernel driver and asks the PF driver to perform
> some commands targeting the PF's own VFs. The DMA is then done using
> the RID of the PF, and reaches the kernel owned iommu_domain of the
> PF. This way the entire operation is secure aginst meddling by the
> guest.

And the works for you because you have a clearly defined relationship.
In NVMe we do not have this.  We'd either need to define a way
to query that relationship or find another way to deal with the
problem.  But in doubt the best design would be to drive VF
live migration entirely from the PF, turn the lookup from controlled
function to controlling function upside down, that is a list of
controlled functions (which could very well be, and in some designs
are, additional PFs and not VFs) by controlling function.  In fact
NVMe already has that list in it's architecture with the
"Secondary Controller List".
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 06, 2022 at 02:58:10PM +0100, Christoph Hellwig wrote:

> Most importantly NVMe is very quiet on the relationship between
> VFs and PFs, and there is no way to guarantee that a PF is, at the
> NVMe level, much in control of a VF at all.  In other words this
> concept really badly breaks NVMe abstractions.

Yeah, I think the spec effort is going to be interesting for sure.

From a pure Linux and implementation perspective a decision must be
made early on how to label the DMAs for kernel/qemu vs VM controlled
items at the PCI TLP level.

> controlled functions (which could very well be, and in some designs
> are, additional PFs and not VFs) by controlling function.  

In principle PF vs VF doesn't matter much - the question is really TLP
labeling. If the spec says RID A is the controlling RID and RID B is
the guest RID, then it doesn't matter if they have a PF/VF
relationship or PF/PF relationship.

We have locking issues in Linux SW connecting different SW drivers for
things that are not a PF/VF relationship, but perhaps that can be
solved.

Using VF RID / VF PASID is appealing at first glance, but there is
list of PCI emulation details that have to be worked out for that to
be good. eg what do you do with guest triggered FLR? Or guest
triggered memory disable? How do you handle PCIe AER? Also lack of
PASID support in CPUs is problematic.

Lots of trade offs..

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 06, 2022 at 11:22:33AM -0400, Jason Gunthorpe wrote:
> > controlled functions (which could very well be, and in some designs
> > are, additional PFs and not VFs) by controlling function.  
> 
> In principle PF vs VF doesn't matter much - the question is really TLP
> labeling. If the spec says RID A is the controlling RID and RID B is
> the guest RID, then it doesn't matter if they have a PF/VF
> relationship or PF/PF relationship.

Yes.  Or in fact if you use PASIDs inside a single function.

> We have locking issues in Linux SW connecting different SW drivers for
> things that are not a PF/VF relationship, but perhaps that can be
> solved.

And I think the only reasonable answer is that the entire workflow
must be 100% managed from the controlling function, and the controlled
function is just around for a ride, with the controlling function
enabling/disabling it as needed without ever interacting with software
that directly deals with the controlled function.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 06, 2022 at 04:38:11PM +0100, Christoph Hellwig wrote:

> > We have locking issues in Linux SW connecting different SW drivers for
> > things that are not a PF/VF relationship, but perhaps that can be
> > solved.
> 
> And I think the only reasonable answer is that the entire workflow
> must be 100% managed from the controlling function, and the controlled
> function is just around for a ride, with the controlling function
> enabling/disabling it as needed without ever interacting with software
> that directly deals with the controlled function.

That is a big deviation from where VFIO is right now, the controlled
function is the one with the VFIO driver, it should be the one that
drives the migration uAPI components.

Adding another uAPI that can manipulate the same VFIO device from some
unrelated chardev feels wrong.

There are certain things that need to be co-ordinated for eveything to
work. Like you can't suspend the VFIO device unless you promise to
also stop MMIO operations. Stuff like FLR interfers with the migration
operation and has to be co-ordinated. Some migration operation
failures, like load failure, have to be healed through FLR.

It really does not want to be two different uAPIs even if that is
convenient for the kernel.

I'd much rather try to fix the problems PASID brings that try to make
this work :\

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> That is a big deviation from where VFIO is right now, the controlled
> function is the one with the VFIO driver, it should be the one that
> drives the migration uAPI components.

Well, that is one way to see it, but I think the more natural
way to deal with it is to drive everyting from the controlling
function, because that is by definition much more in control.

More importantly any sane design will have easy ways to list and
manipulate all the controlled functions from the controlling
functions, while getting from the controlled function to the
controlling one is extremely awkward, as anything that can be
used for that is by definition and information leak.  It seems
mlx5 just gets away with that by saying controlled functions are
always VFs, and the controlling function is a PF, but that will
break down very easily, especially once you want to nest the
controlling scheme (and yes, I'm not making this up, as the
nesting scheme is being proposed for nvme privately).
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
> > That is a big deviation from where VFIO is right now, the controlled
> > function is the one with the VFIO driver, it should be the one that
> > drives the migration uAPI components.
> 
> Well, that is one way to see it, but I think the more natural
> way to deal with it is to drive everyting from the controlling
> function, because that is by definition much more in control.

Sure, the controlling function should (and does in mlx5) drive
everything here.

What the kernel is doing is providing the abstraction to link the
controlling function to the VFIO device in a general way.

We don't want to just punt this problem to user space and say 'good
luck finding the right cdev for migration control'. If the kernel
struggles to link them then userspace will not fare better on its own.

Especially, we do not want every VFIO device to have its own crazy way
for userspace to link the controlling/controlled functions
together. This is something the kernel has to abstract away.

So, IMHO, we must assume the kernel is aware of the relationship,
whatever algorithm it uses to become aware.

It just means the issue is doing the necessary cross-subsystem
locking.

That combined with the fact they really are two halfs of the same
thing - operations on the controlling function have to be sequenced
with operations on the VFIO device - makes me prefer the single uAPI.

> More importantly any sane design will have easy ways to list and
> manipulate all the controlled functions from the controlling
> functions, while getting from the controlled function to the
> controlling one is extremely awkward, as anything that can be
> used for that is by definition and information leak.  

We have spend some time looking at this for mlx5. It is a hard
problem. The VFIO driver cannot operate the device, eg it cannot do
MMIO and things, so it is limited to items in the PCI config space to
figure out the device identity.

> It seems mlx5 just gets away with that by saying controlled
> functions are always VFs, and the controlling function is a PF, but
> that will break down very easily, 

Yes, that is part of the current mlx5 model. It is not inherent to the
device design, but the problems with arbitary nesting were hard enough
they were not tackled at this point.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
> 
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.

Yes.  But the right interface for that is to issue the userspace
commands for anything that is not normal PCIe function level
to the controlling funtion, and to discover the controlled functions
based on the controlling functions.

In other words:  there should be absolutely no need to have any
special kernel support for the controlled function.  Instead the
controlling function enumerates all the function it controls exports
that to userspace and exposes the functionality to save state from
and restore state to the controlled functions.

> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.

Yes.  But the direction must go controlling to controlled, not the
other way around.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
> > What the kernel is doing is providing the abstraction to link the
> > controlling function to the VFIO device in a general way.
> > 
> > We don't want to just punt this problem to user space and say 'good
> > luck finding the right cdev for migration control'. If the kernel
> > struggles to link them then userspace will not fare better on its own.
> 
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.

I don't think we should mix up how the HW works, what PCI function the
commands are executed at, with how the uAPI is designed.

The VFIO design assumes that the "vfio migration driver" will talk to
both functions under the hood, and I don't see a fundamental problem
with this beyond it being awkward with the driver core.

It is done this way deliberately. When we did the work on it we found
there are problems with some device models, like when you suspend them
and then wrongly MMIO touch them they can trigger AER and maybe even a
machine check crash. One of the roles of the VFIO driver is to fix
these HW bugs and make the uAPI safe. Eg by revoking mmaps or
whatever.

Even more importantly, we do not want migration to ever be operated
unless VFIO is in control of the device. In general, migration resume
basically allows userspace to command the device to do effectively any
DMA. This is a kernel security problem if the device is not
constrained by a user iommu_domain - for security we must not allow
userspace to resume a VF that is under kernel control and potentially
linked to an passthrough iommu_domain.

VFIO provides the security model to make all of this safe - the two
concepts cannot become disconnected. Even if we create a new migration
uAPI it just means that the nvme driver has to be awkwardly aware of
VFIO VF drivers and interlock with their state, and the uAPI is
useless unless you already have a VFIO FD open.

Even the basic assumption that there would be a controlling/controlled
relationship is not universally true. The mdev type drivers, and 
SIOV-like devices are unlikely to have that. Once you can use PASID
the reasons to split things at the HW level go away, and a VF could
certainly self-migrate.

VFIO's goal is to abstract all of the above stuff. You get one char
device that inherently provides the security guarentees required to
operate migration. It provides all the necessary hooks to fix up HW
issues, which so far every merged device has:

 - mlx5 has a weird issue where FLR on the VF resets the migration
   context, we fix that in the VFIO driver (in mlx5 even though the
   commands are issued via the PF they are logically executed inside
   the VF context)

 - hi-silicon has security problems because it doesn't have
   controlling/controlled, so it needs to carve out BAR MMIO maps and
   other stuff

So while I agree that, in principle, migration and the VFIO VF are
seperate concerns our broken reality makes them linked.

Even the idea that started this thread - that PF/PF could be a problem
- seems to have been explored by the Pensando RFC which is using the
aux devices to connect arbitrary PF/PF for their model.

The logical model we have been using on complex multi-function devices
(like every DPU driver) has been to split the driver into subsystem
local code and thread all the pieces together with aux devices.

The model has a PCI driver that operates the lowest level of the
device, eg the 'admin queue' and then aux devices create
subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc)
that ride on the common API exported by the PCI driver.

So, when you see both Intel and Pensando proposing this kind of
layered model for NVMe where migration is subsystem-local to VFIO, I
think this is where the inspiration is coming from. Their native DPU
drivers already work this way.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> The VFIO design assumes that the "vfio migration driver" will talk to
> both functions under the hood, and I don't see a fundamental problem
> with this beyond it being awkward with the driver core.

And while that is a fine concept per see, the current incarnation of
that is fundamentally broken is it centered around the controlled
VM.  Which really can't work.

> Even the basic assumption that there would be a controlling/controlled
> relationship is not universally true. The mdev type drivers, and 
> SIOV-like devices are unlikely to have that. Once you can use PASID
> the reasons to split things at the HW level go away, and a VF could
> certainly self-migrate.

Even then you need a controlling and a controlled entity.  The
controlling entity even in SIOV remains a PCIe function.  The
controlled entity might just be a bunch of hardware resoures and
a PASID.  Making it important again that all migration is driven
by the controlling entity.

Also the whole concept that only VFIO can do live migration is
a little bogus.  With checkpoint and restart it absolutely
does make sense to live migrate a container, and with that
the hardware interface (e.g. nvme controller) assigned to it.

> So, when you see both Intel and Pensando proposing this kind of
> layered model for NVMe where migration is subsystem-local to VFIO, I
> think this is where the inspiration is coming from. Their native DPU
> drivers already work this way.

Maybe they should have talked to someone not high on their own
supply before designing this.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Wed, Dec 07, 2022 at 02:52:03PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> > The VFIO design assumes that the "vfio migration driver" will talk to
> > both functions under the hood, and I don't see a fundamental problem
> > with this beyond it being awkward with the driver core.
> 
> And while that is a fine concept per see, the current incarnation of
> that is fundamentally broken is it centered around the controlled
> VM.  Which really can't work.

I don't see why you keep saying this. It is centered around the struct
vfio_device object in the kernel, which is definately NOT the VM.

The struct vfio_device is the handle for the hypervisor to control
the physical assigned device - and it is the hypervisor that controls
the migration.

We do not need the hypervisor userspace to have a handle to the hidden
controlling function. It provides no additional functionality,
security or insight to what qemu needs to do. Keeping that
relationship abstracted inside the kernel is a reasonable choice and
is not "fundamentally broken".

> > Even the basic assumption that there would be a controlling/controlled
> > relationship is not universally true. The mdev type drivers, and 
> > SIOV-like devices are unlikely to have that. Once you can use PASID
> > the reasons to split things at the HW level go away, and a VF could
> > certainly self-migrate.
> 
> Even then you need a controlling and a controlled entity.  The
> controlling entity even in SIOV remains a PCIe function.  The
> controlled entity might just be a bunch of hardware resoures and
> a PASID.  Making it important again that all migration is driven
> by the controlling entity.

If they are the same driver implementing vfio_device you may be able
to claim they conceptually exist, but it is pretty artificial to draw
this kind of distinction inside a single driver.

> Also the whole concept that only VFIO can do live migration is
> a little bogus.  With checkpoint and restart it absolutely
> does make sense to live migrate a container, and with that
> the hardware interface (e.g. nvme controller) assigned to it.

I agree people may want to do this, but it is very unclear how SRIOV
live migration can help do this.

SRIOV live migration is all about not disturbing the kernel driver,
assuming it is the same kernel driver on both sides. If you have two
different kernel's there is nothing worth migrating. There isn't even
an assurance the dma API will have IOMMU mapped the same objects to
the same IOVAs. eg so you have re-establish your admin queue, IO
queues, etc after migration anyhow.

Let alone how to solve the security problems of allow userspace to
load arbitary FW blobs into a device with potentially insecure DMA
access..

At that point it isn't really the same kind of migration.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > And while that is a fine concept per see, the current incarnation of
> > that is fundamentally broken is it centered around the controlled
> > VM.  Which really can't work.
> 
> I don't see why you keep saying this. It is centered around the struct
> vfio_device object in the kernel, which is definately NOT the VM.

Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
confusing my mind so much that I start mistyping things.

> > Even then you need a controlling and a controlled entity.  The
> > controlling entity even in SIOV remains a PCIe function.  The
> > controlled entity might just be a bunch of hardware resoures and
> > a PASID.  Making it important again that all migration is driven
> > by the controlling entity.
> 
> If they are the same driver implementing vfio_device you may be able
> to claim they conceptually exist, but it is pretty artificial to draw
> this kind of distinction inside a single driver.

How are they in this reply?  I can't parse how this even relates to
what I wrote.

> > Also the whole concept that only VFIO can do live migration is
> > a little bogus.  With checkpoint and restart it absolutely
> > does make sense to live migrate a container, and with that
> > the hardware interface (e.g. nvme controller) assigned to it.
> 
> I agree people may want to do this, but it is very unclear how SRIOV
> live migration can help do this.

SRIOV live migration doesn't, because honestly there is no such
thing as "SRIOV" live migration to start with.

> Let alone how to solve the security problems of allow userspace to
> load arbitary FW blobs into a device with potentially insecure DMA
> access..

Any time you assign a PCI device to userspace you might get into that.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Wed, Dec 07, 2022 at 05:38:57PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 11:07:11AM -0400, Jason Gunthorpe wrote:
> > > And while that is a fine concept per see, the current incarnation of
> > > that is fundamentally broken is it centered around the controlled
> > > VM.  Which really can't work.
> > 
> > I don't see why you keep saying this. It is centered around the struct
> > vfio_device object in the kernel, which is definately NOT the VM.
> 
> Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> confusing my mind so much that I start mistyping things.

Well, what words do you want to use?

Regardless of VF/VM, it doesn't matter - my point is that the
vfio_device is the hypervisor control for *whatever* is under the
vfio_device and it is not desirable to break it up along arbitrary HW
boundaries.

I've given lots of reasons why not to do this now.

I strongly suspect it can work technically - as ugly as it is Pensando
shows an approach.

So I don't think I've learned anything more about your objection.

"fundamentally broken" doesn't help

It is a major point, because if we are going to have to rip apart all
the uAPI we built here and are completing the qemu work for, we need
to come to an understanding very soon.

And given how difficult it has been to get to this point, I want a
*really* good reason why we have to start again, and rewrite all the
drivers that exist and are coming.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> > confusing my mind so much that I start mistyping things.
> 
> Well, what words do you want to use?

The same I've used through this whole thread:  controlling and
controlled function.

> So I don't think I've learned anything more about your objection.
> 
> "fundamentally broken" doesn't help

The objection is that:

 - in hardware fundamentally only the controlling funtion can
   control live migration features on the controlled function,
   because the controlled function is assigned to a VM which has
   control over it.
 - for the same reason there is no portable way to even find
   the controlling function from a controlled function, unless
   you want to equate PF = controlling and VF = controlled,
   and even that breaks down in some corner cases
 - if you want to control live migration from the controlled
   VM you need a new vfio subdriver for a function that has
   absolutely no new functionality itself related to live
   migration (quirks for bugs excepted).

So by this architecture you build a convoluted mess where you
need tons of vfio subdrivers that mostly just talk to the
driver for the controlling function, which they can't even
sanely discover.  That's what I keep calling fundamentally
broken.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
> > > Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
> > > confusing my mind so much that I start mistyping things.
> > 
> > Well, what words do you want to use?
> 
> The same I've used through this whole thread:  controlling and
> controlled function.
> 
> > So I don't think I've learned anything more about your objection.
> > 
> > "fundamentally broken" doesn't help
> 
> The objection is that:
> 
>  - in hardware fundamentally only the controlling funtion can
>    control live migration features on the controlled function,
>    because the controlled function is assigned to a VM which has
>    control over it.

Yes

However hisilicon managed to do their implementation without this, or
rather you could say their "controlling function" is a single MMIO BAR
page in their PCI VF and their "controlled function" is the rest of
the PCI VF.

>  - for the same reason there is no portable way to even find
>    the controlling function from a controlled function, unless
>    you want to equate PF = controlling and VF = controlled,
>    and even that breaks down in some corner cases

As you say, the kernel must know the relationship between
controlling->controlled. Nothing else is sane.

If the kernel knows this information then we can find a way for the
vfio_device to have pointers to both controlling and controlled
objects. I have a suggestion below.

>  - if you want to control live migration from the controlled
>    VM you need a new vfio subdriver for a function that has
>    absolutely no new functionality itself related to live
>    migration (quirks for bugs excepted).

I see it differently, the VFIO driver *is* the live migration
driver. Look at all the drivers that have come through and they are
99% live migration code. They have, IMHO, properly split the live
migration concern out of their controlling/PF driver and placed it in
the "VFIO live migration driver".

We've done a pretty good job of allowing the VFIO live migration
driver to pretty much just focus on live migration stuff and delegate
the VFIO part to library code.

Excepting quirks and bugs sounds nice, except we actually can't ignore
them. Having all the VFIO capabilities is exactly how we are fixing
the quirks and bugs in the first place, and I don't see your vision
how we can continue to do that if we split all the live migration code
into yet another subsystem.

For instance how do I trap FLR like mlx5 must do if the
drivers/live_migration code cannot intercept the FLR VFIO ioctl?

How do I mangle and share the BAR like hisilicon does?

Which is really why this is in VFIO in the first place. It actually is
coupled in practice, if not in theory.

> So by this architecture you build a convoluted mess where you need
> tons of vfio subdrivers that mostly just talk to the driver for the
> controlling function, which they can't even sanely discover.  That's
> what I keep calling fundamentally broken.

The VFIO live migration drivers will look basically the same if you
put them under drivers/live_migration. This cannot be considered a
"convoluted mess" as splitting things by subsystem is best-practice,
AFAIK.

If we accept that drivers/vfio can be the "live migration subsystem"
then lets go all the way and have the controlling driver to call
vfio_device_group_register() to create the VFIO char device for the
controlled function.

This solves the "sanely discover" problem because of course the
controlling function driver knows what the controlled function is and
it can acquire both functions before it calls
vfio_device_group_register().

This is actually what I want to do anyhow for SIOV-like functions and
VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
really dislike that our current SRIOV model in Linux forces the VF to
instantly exist without a chance for the controlling driver to
provision it.

We have some challenges on how to do this in the kernel, but I think
we can overcome them. VFIO is ready for this thanks to all the
restructuring work we already did.

I'd really like to get away from VFIO having to do all this crazy
sysfs crap to activate its driver. I think there is a lot of appeal to
having, say, a nvmecli command that just commands the controlling
driver to provision a function, enable live migration, configure it
and then make it visible via VFIO. The same API regardless if the
underlying controlled function technology is PF/VF/SIOV.

At least we have been very interested in doing that for networking
devices.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.

Eww.  So you need to carefully filter the BAR and can't actually do
any DMA at all?  I'm not sure that is actually practical, especially
not for something with a lot of state.

> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.

So now we need to write a vfio shim for every function even if there
is absolutely nothing special about that function?  Migrating really
is the controlling functions behavior, and writing a new vfio bit
for every controlled thing just does not scale.

> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code.

Yes, and that's the problem, because they are associated with the
controlled function, and now we have a communication problem between
that vfio driver binding to the controlled function and the drive
that actually controlls live migration that is associated with the
controlling function.  In other words: you've created a giant mess.

> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them.

I'm not proposing to ignore them.  But they should not be needed most
of the time.

> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
> 
> How do I mangle and share the BAR like hisilicon does?
> 
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.

So you've created a long term userspace API around working around
around buggy and/or misdesigned early designs and now want to force
it down everyones throat?

Can we please take a step back and think about how things should work,
and only then think how to work around weirdo devices that do strange
things as a second step? 

> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.

While creating the VFs from the PF driver makes a lot more sense,
remember that vfio is absolutely not the only use case for VFs.
There are plenty use cases where you want to use them with the normal
kernel driver as well.  So the interface to create VFs needs a now
to decide if it should be vfio exported, or use the normal kernel
binding.

> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().

Yes.

> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.

For SIOV you have no other choice anyway.  But I agree that it is
the right thing to do for VFIO.  Now the next step is to control
live migration from the controlling function, so that for most sane
devices the controlled device does not need all the pointless
boilerplate of its own vfio driver.

> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.

Yes.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Mon, Dec 12, 2022 at 08:50:46AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote:
> > However hisilicon managed to do their implementation without this, or
> > rather you could say their "controlling function" is a single MMIO BAR
> > page in their PCI VF and their "controlled function" is the rest of
> > the PCI VF.
> 
> Eww.  So you need to carefully filter the BAR and can't actually do
> any DMA at all?  I'm not sure that is actually practical, especially
> not for something with a lot of state.

Indeed, but it is what they did and the HW should be supported by the
kernel, IMO.

> > If the kernel knows this information then we can find a way for the
> > vfio_device to have pointers to both controlling and controlled
> > objects. I have a suggestion below.
> 
> So now we need to write a vfio shim for every function even if there
> is absolutely nothing special about that function?  Migrating really
> is the controlling functions behavior, and writing a new vfio bit
> for every controlled thing just does not scale.

Huh? "does not scale?" We are looking at boilerplates of around 20-30
lines to make a VFIO driver for a real PCI device. Why is that even
something we should worry about optimizing?

And when you get into exciting future devices like SIOV you already
need to make a special VFIO driver anyhow.

> Yes, and that's the problem, because they are associated with the
> controlled function, and now we have a communication problem between
> that vfio driver binding to the controlled function and the drive
> that actually controlls live migration that is associated with the
> controlling function.  In other words: you've created a giant mess.

So far 100% of the drivers that have been presented, including the two
RFC ones, have entanglements between live migration and vfio. Shifting
things to dev/live_migration doesn't make the "communication problem"
away, it just shifted it into another subsystem.

This is my point, I've yet to even see a driver that meets your
theoretical standard that it can exist without vfio entanglement. So
I'd be much more interested in this idea if we had a stable of drivers
that obviously were harmed by VFIO. We don't have that, and I don't
even think that we ever will, considering both RFC drivers have
devices that also stepped on the mlx5 FLR problem too.

The FLR thing actually makes sense, becauase it is not actually the
controlling function that is doing the live migration inside the
devices. The controlling function is just a *proxy* to deliver
commands to the controlled function. So FLR on the controlled device
effects commands being executed on the controlling function. It is a
pain.

As it is, live migration is only useful with VFIO, so they are put
together to keep things simpler. The only complexity is the
controlled/controlling issue and for all existing SRIOV PF/VF
relationships we have an OK solution (at least it isn't buggy).

nvme's higher flexability needs more work, but that doesn't mean the
idea is so wrong. I think you are reall overstating the "mess"

> I'm not proposing to ignore them.  But they should not be needed most
> of the time.

I'm not seeing that in the drivers I've looked at.

> > Which is really why this is in VFIO in the first place. It actually is
> > coupled in practice, if not in theory.
> 
> So you've created a long term userspace API around working around
> around buggy and/or misdesigned early designs and now want to force
> it down everyones throat?

No, we coupled live migration and VFIO because they are never useful
apart, and we accept that we must find reasonable solutions to linking
the controlling/controlled device because it is necessary in all cases
we've seen.

> > If we accept that drivers/vfio can be the "live migration subsystem"
> > then lets go all the way and have the controlling driver to call
> > vfio_device_group_register() to create the VFIO char device for the
> > controlled function.
> 
> While creating the VFs from the PF driver makes a lot more sense,
> remember that vfio is absolutely not the only use case for VFs.
> There are plenty use cases where you want to use them with the normal
> kernel driver as well.  So the interface to create VFs needs a now
> to decide if it should be vfio exported, or use the normal kernel
> binding.

Yes, that is why this problem has been open for so long. Fixing it
well requires some reconsideration of how the driver core works :(

It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
etc.

When we create the function we really want to tell the device what
kind of function it is, and that also tells the kernel what driver
should be bound to it.

mlx5 even has weird limitations, like a controlled function that is
live migration capable has fewer features than a function that is
not. So the user must specify what parameters it wants the controlled
function to have..

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > So now we need to write a vfio shim for every function even if there
> > is absolutely nothing special about that function?  Migrating really
> > is the controlling functions behavior, and writing a new vfio bit
> > for every controlled thing just does not scale.
> 
> Huh? "does not scale?" We are looking at boilerplates of around 20-30
> lines to make a VFIO driver for a real PCI device. Why is that even
> something we should worry about optimizing?

But we need a new driver for every controlled function now, which
is very different from the classic VFIO model where we had one
vfio_pci.

> And when you get into exciting future devices like SIOV you already
> need to make a special VFIO driver anyhow.

You need to special support for it.  It's probably not another
Linux driver but part of the parent one, though.

> So far 100% of the drivers that have been presented, including the two
> RFC ones, have entanglements between live migration and vfio. Shifting
> things to dev/live_migration doesn't make the "communication problem"
> away, it just shifted it into another subsystem.

The main entanglement seems to be that it needs to support a vfio
interface for live migration while the actual commands go to the
parent device.

> This is my point, I've yet to even see a driver that meets your
> theoretical standard that it can exist without vfio entanglement.

It can't right now due to the VFIO design.

> > While creating the VFs from the PF driver makes a lot more sense,
> > remember that vfio is absolutely not the only use case for VFs.
> > There are plenty use cases where you want to use them with the normal
> > kernel driver as well.  So the interface to create VFs needs a now
> > to decide if it should be vfio exported, or use the normal kernel
> > binding.
> 
> Yes, that is why this problem has been open for so long. Fixing it
> well requires some reconsideration of how the driver core works :(
> 
> It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> etc.

This seems to violate the PCIe spec, which says:

"All VFs associated with a PF must be the same device type as the PF,
(e.g., the same network device type or the same storage device type.)",

which is also enforced by not allowing to read vendor/device/class
fields from VFs.

(not that I'm arguing that this is a good limit, but that's how
PCIe does it).

> When we create the function we really want to tell the device what
> kind of function it is, and that also tells the kernel what driver
> should be bound to it.

I'd rather have different ways to probe by passing a "kind" or "type"
argument along the device IDs during probing.  E.g. "driver"
and "vfio", and then only match for the kind the creator of the device
added them to the device model for. 

> mlx5 even has weird limitations, like a controlled function that is
> live migration capable has fewer features than a function that is
> not. So the user must specify what parameters it wants the controlled
> function to have..

I don't think that is weird.  If you want to live migrate, you need to

 a) make sure the feature set is compatible with the other side
 b) there is only state that actually is migratable

so I'd expect that for any other sufficiently complex device.  NVMe
for sure will have limits like this.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 13, 2022 at 05:08:07PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 13, 2022 at 10:01:03AM -0400, Jason Gunthorpe wrote:
> > > So now we need to write a vfio shim for every function even if there
> > > is absolutely nothing special about that function?  Migrating really
> > > is the controlling functions behavior, and writing a new vfio bit
> > > for every controlled thing just does not scale.
> > 
> > Huh? "does not scale?" We are looking at boilerplates of around 20-30
> > lines to make a VFIO driver for a real PCI device. Why is that even
> > something we should worry about optimizing?
> 
> But we need a new driver for every controlled function now, which
> is very different from the classic VFIO model where we had one
> vfio_pci.

To be fair, mainly vfio_pci had that model. Other uses of VFIO have
device specific drivers already. We have the reset drivers in vfio
platform, and the mdevs already. SIOV drivers are coming and they will
not be general either. I know a few coming non-migration VFIO PCI
variant drivers as well to deal with HW issues.

Remember, we did a bunch of work to make this reasonable. Userspace
can properly probe the correct VFIO driver for the HW it wants to use,
just like normal devices. If we spawn the VFIO from the controlling
function then it obviously will bring the correct driver along too.

The mental model I have for VFIO is that every vfio_device has a
driver, and we have three "universal" drivers that wildcard match to
many devices (pci, fsl, and platform acpi reset). Otherwise VFIO is
like every other driver subsystem out there, with physical devices and
matching drivers that support them.

Creating drivers for HW is not a problem, that is what a driver
subsystem is for. We already invested effort in VFIO to make this
scalable.

> > And when you get into exciting future devices like SIOV you already
> > need to make a special VFIO driver anyhow.
> 
> You need to special support for it.  It's probably not another
> Linux driver but part of the parent one, though.

The designs we have done in mlx5 are split. The "parent" has just
enough shim to describe what the SIOV is in terms of a 'slice of the
parents resources' and then we layer another driver, located in the
proper subsystem, to operate that slice. VDPA makes a
/dev/virtio-whatever, VFIO would make a fake PCI function, mlx5 makes
a netdev, etc.

It is not so different from how a PF/VF relationship works, just that
the SIOV is described by a struct auxillary_device not a struct
pci_dev.

I don't really like implementing VFIO drivers outside drivers/vfio, I
think that has historically had bad outcomes in other subsystems.

> > So far 100% of the drivers that have been presented, including the two
> > RFC ones, have entanglements between live migration and vfio. Shifting
> > things to dev/live_migration doesn't make the "communication problem"
> > away, it just shifted it into another subsystem.
> 
> The main entanglement seems to be that it needs to support a vfio
> interface for live migration while the actual commands go to the
> parent device.

Not at all, that is only a couple function calls in 4 of the drivers
so far.

The entanglement is that the live migration FSM and the VFIO device
operation are not isolated. I keep repeating this - mlx5 and the two
RFC drivers must trap VFIO operations and relay them to their
migration logic. hns has to mangle its BARs. These are things that
only exist on the VFIO side.

So, you are viewing live migration as orthogonal and separable to
VFIO, and I don't agree with this because I haven't yet seen any proof
in implementations.

Let's go through the nvme spec process and see how it works out. If
NVMe can address things which are tripping up other implemenations,
like FLR of the controlled function. Then we may have the first
example. If not, then it is just how things are.

FLR is trickey, it not obvious to me that you want a definition of
migration that isolates controlled function FLR from the migration
FSM..

There are advantages to having a reliable, universal, way to bring a
function back to a clean slate, including restoring it to full
operation (ie canceling any migration operation). The current
definition of FLR provides this.

> > It is worse than just VFIO vs one kernel driver, like mlx5 could spawn
> > a controlled function that is NVMe, VDPA, mlx5, virtio-net, VFIO,
> > etc.
> 
> This seems to violate the PCIe spec, which says:
> 
> "All VFs associated with a PF must be the same device type as the PF,
> (e.g., the same network device type or the same storage device type.)",

For VFs there are multiple PFs to follow the above, and for SIOV this
language doesn't apply.

It seems the PDS RFC driver does violate this spec requirement though..

> > When we create the function we really want to tell the device what
> > kind of function it is, and that also tells the kernel what driver
> > should be bound to it.
> 
> I'd rather have different ways to probe by passing a "kind" or "type"
> argument along the device IDs during probing.  E.g. "driver"
> and "vfio", and then only match for the kind the creator of the device
> added them to the device model for. 

Not everything can be done during driver probing. There are certainly
steps at SIOV instantiation time or VF provisioning that impact what
exactly is available on the controlled function. Eg on mlx5 when we
create a VDPA device it actually is different from a full-function
mlx5 device and that customization was done before any driver was
probed.

In fact, not only is it done before driver binding, but it can be
enforced as a security property from the DPU side when the DPU is the
thing creating the function.

I like the general idea of type to help specify the driver to probe,
we tried to work on something like that once and it didn't go far, but
I did like the concept of it.

> > mlx5 even has weird limitations, like a controlled function that is
> > live migration capable has fewer features than a function that is
> > not. So the user must specify what parameters it wants the controlled
> > function to have..
> 
> I don't think that is weird.  If you want to live migrate, you need to
> 
>  a) make sure the feature set is compatible with the other side
>  b) there is only state that actually is migratable
> 
> so I'd expect that for any other sufficiently complex device.  NVMe
> for sure will have limits like this.

Oy, this has been pretty hard to define in mlx5 already :( Hopefully
nvme-cli can sort it out for NVMe configurables.

Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Max Gurtovoy 2 years, 9 months ago
On 12/7/2022 10:08 PM, Jason Gunthorpe wrote:
> On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote:
>> On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote:
>>>> Sorry, I meant VF.  Your continued using of SR-IOV teminology now keeps
>>>> confusing my mind so much that I start mistyping things.
>>> Well, what words do you want to use?
>> The same I've used through this whole thread:  controlling and
>> controlled function.
>>
>>> So I don't think I've learned anything more about your objection.
>>>
>>> "fundamentally broken" doesn't help
>> The objection is that:
>>
>>   - in hardware fundamentally only the controlling funtion can
>>     control live migration features on the controlled function,
>>     because the controlled function is assigned to a VM which has
>>     control over it.
> Yes
>
> However hisilicon managed to do their implementation without this, or
> rather you could say their "controlling function" is a single MMIO BAR
> page in their PCI VF and their "controlled function" is the rest of
> the PCI VF.
>
>>   - for the same reason there is no portable way to even find
>>     the controlling function from a controlled function, unless
>>     you want to equate PF = controlling and VF = controlled,
>>     and even that breaks down in some corner cases
> As you say, the kernel must know the relationship between
> controlling->controlled. Nothing else is sane.
>
> If the kernel knows this information then we can find a way for the
> vfio_device to have pointers to both controlling and controlled
> objects. I have a suggestion below.
>
>>   - if you want to control live migration from the controlled
>>     VM you need a new vfio subdriver for a function that has
>>     absolutely no new functionality itself related to live
>>     migration (quirks for bugs excepted).
> I see it differently, the VFIO driver *is* the live migration
> driver. Look at all the drivers that have come through and they are
> 99% live migration code. They have, IMHO, properly split the live
> migration concern out of their controlling/PF driver and placed it in
> the "VFIO live migration driver".
>
> We've done a pretty good job of allowing the VFIO live migration
> driver to pretty much just focus on live migration stuff and delegate
> the VFIO part to library code.
>
> Excepting quirks and bugs sounds nice, except we actually can't ignore
> them. Having all the VFIO capabilities is exactly how we are fixing
> the quirks and bugs in the first place, and I don't see your vision
> how we can continue to do that if we split all the live migration code
> into yet another subsystem.
>
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
>
> How do I mangle and share the BAR like hisilicon does?
>
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
>
>> So by this architecture you build a convoluted mess where you need
>> tons of vfio subdrivers that mostly just talk to the driver for the
>> controlling function, which they can't even sanely discover.  That's
>> what I keep calling fundamentally broken.
> The VFIO live migration drivers will look basically the same if you
> put them under drivers/live_migration. This cannot be considered a
> "convoluted mess" as splitting things by subsystem is best-practice,
> AFAIK.
>
> If we accept that drivers/vfio can be the "live migration subsystem"
> then lets go all the way and have the controlling driver to call
> vfio_device_group_register() to create the VFIO char device for the
> controlled function.
>
> This solves the "sanely discover" problem because of course the
> controlling function driver knows what the controlled function is and
> it can acquire both functions before it calls
> vfio_device_group_register().
>
> This is actually what I want to do anyhow for SIOV-like functions and
> VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I
> really dislike that our current SRIOV model in Linux forces the VF to
> instantly exist without a chance for the controlling driver to
> provision it.
>
> We have some challenges on how to do this in the kernel, but I think
> we can overcome them. VFIO is ready for this thanks to all the
> restructuring work we already did.
>
> I'd really like to get away from VFIO having to do all this crazy
> sysfs crap to activate its driver. I think there is a lot of appeal to
> having, say, a nvmecli command that just commands the controlling
> driver to provision a function, enable live migration, configure it
> and then make it visible via VFIO. The same API regardless if the
> underlying controlled function technology is PF/VF/SIOV.
>
> At least we have been very interested in doing that for networking
> devices.
>
> Jason

Jason/Christoph,

As I mentioned earlier we have 2 orthogonal paths here: implementation 
and SPEC. They are for some reason mixed in this discussion.

I've tried to understand some SPEC related issues that were raised here, 
that if we fix them - the implementation will be possible and all the 
VFIO efforts we did can be re-used.

In high level I think that for the SPEC:

1. Need to define a concept of a "virtual subsystem". A primary 
controller will be able to create a virtual subsystem. Inside this 
subsystem the primary controller will be the master ("the controlling") 
of the migration process. It will also be able to add secondary 
controllers to this virtual subsystem and assign "virtual controller ID" 
to it.
something like:
- nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
--dev_vcid = 1
- nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
--secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

2. Now the primary controller have a list of ctrls inside it's virtual 
subsystem for migration. It has handle to it that doesn't go away after 
binding the controlled function to VFIO.

3. Same virtual subsystem should be created in the destination hypervisor.

4. Now, migration process starts using the VFIO uAPI - we will get to a 
point that VFIO driver of the controlled function needs to ask the 
controlling function to send admin commands to manage the migration process.
     How to do it ? implementation detail. We can set a pointer in 
pci_dev/dev structures or we can ask 
nvme_migration_handle_get(controlled_function) or NVMe can expose API's 
dedicated to migration nvme_state_save(controlled_function).


When creating a virtual subsystem and adding controllers to it, we can 
control any leakage or narrow some functionality that make migration 
impossible. This can be using more admin commands for example.
After the migration process is over, one can remove the secondary 
controller from the virtual subsystem and re-use it.

WDYT ?

Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
> 1. Need to define a concept of a "virtual subsystem". A primary controller 
> will be able to create a virtual subsystem. Inside this subsystem the 
> primary controller will be the master ("the controlling") of the migration 
> process. It will also be able to add secondary controllers to this virtual 
> subsystem and assign "virtual controller ID" to it.
> something like:
> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
> --dev_vcid = 1
> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1" 
> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20

Yes. Note that there is a bit more state than just the NQN.  You also
need at least a serial number, and also probably a different vendor
ID (the PCI vendor ID that is also mirror in Identify Controller and
the IEEE OUI), and new unique namespace identifier.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Max Gurtovoy 2 years, 9 months ago
On 12/12/2022 9:55 AM, Christoph Hellwig wrote:
> On Sun, Dec 11, 2022 at 01:39:37PM +0200, Max Gurtovoy wrote:
>> 1. Need to define a concept of a "virtual subsystem". A primary controller
>> will be able to create a virtual subsystem. Inside this subsystem the
>> primary controller will be the master ("the controlling") of the migration
>> process. It will also be able to add secondary controllers to this virtual
>> subsystem and assign "virtual controller ID" to it.
>> something like:
>> - nvme virtual_subsys_create --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --dev_vcid = 1
>> - nvme virtual_subsys_add --dev=/dev/nvme1 --virtual_nqn="my_v_nqn_1"
>> --secondary_dev=/dev/nvme2 --secondary_dev_vcid=20
> Yes. Note that there is a bit more state than just the NQN.  You also
> need at least a serial number, and also probably a different vendor
> ID (the PCI vendor ID that is also mirror in Identify Controller and
> the IEEE OUI), and new unique namespace identifier.

Yes for sure there is more bits we should consider.

I wanted to describe the high level.

I think that we can maybe say the when a secondary function is moved to 
a virtual subsystem its feature set of the virtual ctrl is narrowed to 
mandatory NVMe features set.
And we'll provide an API to set/extend it's feature set to a maximum of 
the feature set that the original ctrl of the secondary function had.
Then the sys admin will configure the virtual ctrl in both src/dst hosts.

The high level idea is to have a programmable way to set the features of 
a virtual controller inside a virtual subsystem that is also programmable.
RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Tian, Kevin 2 years, 9 months ago
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Thursday, December 8, 2022 4:08 AM
> 
> For instance how do I trap FLR like mlx5 must do if the
> drivers/live_migration code cannot intercept the FLR VFIO ioctl?
> 
> How do I mangle and share the BAR like hisilicon does?
> 
> Which is really why this is in VFIO in the first place. It actually is
> coupled in practice, if not in theory.
> 

Those are good questions which I also buy in to stay with the
current VFIO framework.

Actually the same controlling vs. controlled design choice also exists
in the hardware side. There are plenty of SR-IOV devices supporting
doorbells for VF (controlled function) to call services on PF (controlling
function) while the doorbell interface is implemented on the VF side.

If following the direction having controlling function to explicitly
provide services then all those doorbells should be deprecated
and instead we want explicit communications between VF driver
and PF driver.

From userspace driver p.o.v. the VFIO uAPI is kind of a device
programming interface. Here we just present everything related
to the controlled device itself (including the state management)
via a centralized portal though in the kernel there might be linkage
out of VFIO to reach the controlling driver. kind of a sw doorbell. 😊

btw just to add more background to this work. Half a year ago Lei
actually did a flavor [1] in the other way.

The controlling function of Intel IPU card also supports a network gRPC 
protocol to manage the state of controlled NVMe function. 

Then that series attempted to introduce an out-of-band migration
framework in Qemu so instead of doing in-line state management
via kernel VFIO uAPI Qemu can turn to an external plugin which
forwards the state cmd via gRPC to the controlling function.

Just that the controlling driver is not in the kernel.

It's dropped as the inline way was preferred.

Thanks
Kevin

[1] https://lore.kernel.org/all/20220524061848.1615706-14-lei.rao@intel.com/T/
RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Dong, Eddie 2 years, 9 months ago
> 
> If following the direction having controlling function to explicitly provide
> services then all those doorbells should be deprecated and instead we want
> explicit communications between VF driver and PF driver.

Hardware mechanism of communication (door bell here) can be used to 
support broader scenario when VF driver and PF driver run in 2 different
OS kernels (when the VF is assigned to a VM).

> 
> From userspace driver p.o.v. the VFIO uAPI is kind of a device programming
> interface. Here we just present everything related to the controlled device
> itself (including the state management) via a centralized portal though in the
> kernel there might be linkage out of VFIO to reach the controlling driver. kind
> of a sw doorbell. 😊

Yes.
And sw doorbell is more efficient than hardware doorbell 😊 .
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Max Gurtovoy 2 years, 9 months ago
On 12/7/2022 9:54 AM, Christoph Hellwig wrote:
> On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote:
>> What the kernel is doing is providing the abstraction to link the
>> controlling function to the VFIO device in a general way.
>>
>> We don't want to just punt this problem to user space and say 'good
>> luck finding the right cdev for migration control'. If the kernel
>> struggles to link them then userspace will not fare better on its own.
> Yes.  But the right interface for that is to issue the userspace
> commands for anything that is not normal PCIe function level
> to the controlling funtion, and to discover the controlled functions
> based on the controlling functions.
>
> In other words:  there should be absolutely no need to have any
> special kernel support for the controlled function.  Instead the
> controlling function enumerates all the function it controls exports
> that to userspace and exposes the functionality to save state from
> and restore state to the controlled functions.

Why is it preferred that the migration SW will talk directly to the PF 
and not via VFIO interface ?

It's just an implementation detail.

I feel like it's even sounds more reasonable to have a common API like 
we have today to save_state/resume_state/quiesce_device/freeze_device 
and each device implementation will translate this functionality to its 
own SPEC.

If I understand your direction is to have QEMU code to talk to 
nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

The controlled device is not aware of any of the migration process. Only 
the migration SW, system admin and controlling device.

I see 2 orthogonal discussions here: NVMe standardization for LM and 
Linux implementation for LM.

For the NVMe standardization: I think we all agree, in high level, that 
primary controller manages the LM of the secondary controllers. Primary 
controller can list the secondary controllers. Primary controller expose 
APIs using its admin_queue to manage LM process of its secondary 
controllers. LM Capabilities will be exposed using identify_ctrl admin 
cmd of the primary controller.

For the Linux implementation: the direction we started last year is to 
have vendor specific (mlx5/hisi/..) or protocol specific 
(nvme/virtio/..) vfio drivers. We built an infrastructure to do that by 
dividing the vfio_pci driver to vfio_pci and vfio_pci_core and updated 
uAPIs as well to support the P2P case for live migration. Dirty page 
tracking is also progressing. More work is still to be done to improve 
this infrastructure for sure.
I hope that all the above efforts are going to be used also with NVMe LM 
implementation unless there is something NVMe specific in the way of 
migrating PCI functions that I can't see now.
If there is something that is NVMe specific for LM then the migration SW 
and QEMU will need to be aware of that, and in this awareness we lose 
the benefit of generic VFIO interface.

>
>> Especially, we do not want every VFIO device to have its own crazy way
>> for userspace to link the controlling/controlled functions
>> together. This is something the kernel has to abstract away.
> Yes.  But the direction must go controlling to controlled, not the
> other way around.

So in the source:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable 
(controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration process.


And in the destination:

1. We enable SRIOV on the NVMe driver

2. We list all the secondary controllers: nvme1, nvme2, nvme3

3. We allow migration resume to nvme1, nvme2, nvme3 - now these VFs are 
resumable (controlling to controlled).

4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver

5. We pass these functions to VM

6. We start migration resume process.


>   

>
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
> Why is it preferred that the migration SW will talk directly to the PF and 
> not via VFIO interface ?

It should never talk directly to any hardware, but through a kernel
interface, and that's probably vfio.  But that interface needs to
centered around the controlling function for all the reasons I've
written down multiple times now.

> It's just an implementation detail.

No, it's not.  While you could come up with awkward ways to map how
the hardware interface must work to a completely contrary kernel
interface that's just going to create the need for lots of boilerplate
code _and_ confuses users.  The function that is beeing migrated can
fundamentally not be in control of itself.  Any interface that pretends
it is broken and a long term nightmare for users and implementers.

> I feel like it's even sounds more reasonable to have a common API like we 
> have today to save_state/resume_state/quiesce_device/freeze_device and each 
> device implementation will translate this functionality to its own SPEC.

Absolutely.

> If I understand your direction is to have QEMU code to talk to 
> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.

No.

> The controlled device is not aware of any of the migration process. Only 
> the migration SW, system admin and controlling device.

Exactly.

> So in the source:
>
> 1. We enable SRIOV on the NVMe driver

Again.  Nothing in live migration is tied to SR-IOV at all.  SR-IOV
is just one way to get multiple functions.

> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>
> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable 
> (controlling to controlled).
>
> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>
> 5. We pass these functions to VM

And you need to pass the controlling function (or rather a handle for
it), because there is absolutely no sane way to discover that from
the controlled function as it can't have that information by the
fact that it is beeing passed to unprivilged VMs.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Max Gurtovoy 2 years, 9 months ago
On 12/7/2022 3:46 PM, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 12:59:00PM +0200, Max Gurtovoy wrote:
>> Why is it preferred that the migration SW will talk directly to the PF and
>> not via VFIO interface ?
> It should never talk directly to any hardware, but through a kernel
> interface, and that's probably vfio.  But that interface needs to
> centered around the controlling function for all the reasons I've
> written down multiple times now.
>
>> It's just an implementation detail.
> No, it's not.  While you could come up with awkward ways to map how
> the hardware interface must work to a completely contrary kernel
> interface that's just going to create the need for lots of boilerplate
> code _and_ confuses users.  The function that is beeing migrated can
> fundamentally not be in control of itself.  Any interface that pretends
> it is broken and a long term nightmare for users and implementers.

We're defining the SPEC and interfaces now :)

Bellow is some possible direction I can think of.

>> I feel like it's even sounds more reasonable to have a common API like we
>> have today to save_state/resume_state/quiesce_device/freeze_device and each
>> device implementation will translate this functionality to its own SPEC.
> Absolutely.
>
>> If I understand your direction is to have QEMU code to talk to
>> nvmecli/new_mlx5cli/my_device_cli to do that and I'm not sure it's needed.
> No.
great.
>
>> The controlled device is not aware of any of the migration process. Only
>> the migration SW, system admin and controlling device.
> Exactly.
>
>> So in the source:
>>
>> 1. We enable SRIOV on the NVMe driver
> Again.  Nothing in live migration is tied to SR-IOV at all.  SR-IOV
> is just one way to get multiple functions.

Sure.

It's just an example. It can be some mdev.

>
>> 2. We list all the secondary controllers: nvme1, nvme2, nvme3
>>
>> 3. We allow migrating nvme1, nvme2, nvme3 - now these VFs are migratable
>> (controlling to controlled).
>>
>> 4. We bind nvme1, nvme2, nvme3 to VFIO NVMe driver
>>
>> 5. We pass these functions to VM
> And you need to pass the controlling function (or rather a handle for
> it), because there is absolutely no sane way to discover that from
> the controlled function as it can't have that information by the
> fact that it is beeing passed to unprivilged VMs.

Just thinking out loud:

When we perform step #3 we are narrowing it's scope and maybe some caps 
that you're concerned of. After this setting, the controlled function is 
in LM mode (we should define what does that mean in order to be able to 
migrate it correctly) and the controlling function is the migration 
master of it. Both can be aware of that. The only one that can master 
the controlled function is the controlling function in LM mode. Thus, it 
will be easy to keep that handle inside the kernel for VFs and for MDEVs 
as well.
Although I'm not against passing this handle to migration SW somehow in 
the command line of the QEMU but I still can't completely agree it's 
necessary.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 04:50:00PM +0200, Max Gurtovoy wrote:
> When we perform step #3 we are narrowing it's scope and maybe some caps 
> that you're concerned of. After this setting, the controlled function is in 
> LM mode (we should define what does that mean in order to be able to 
> migrate it correctly) and the controlling function is the migration master 
> of it. Both can be aware of that. The only one that can master the 
> controlled function is the controlling function in LM mode. Thus, it will 
> be easy to keep that handle inside the kernel for VFs and for MDEVs as 
> well.

Maybe.  So you'd introduce a kernel linkage that both side would have
to be part of?
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Max Gurtovoy 2 years, 9 months ago
On 12/6/2022 9:15 PM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
>>> That is a big deviation from where VFIO is right now, the controlled
>>> function is the one with the VFIO driver, it should be the one that
>>> drives the migration uAPI components.
>> Well, that is one way to see it, but I think the more natural
>> way to deal with it is to drive everyting from the controlling
>> function, because that is by definition much more in control.
> Sure, the controlling function should (and does in mlx5) drive
> everything here.
>
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
>
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.
>
> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.
>
> So, IMHO, we must assume the kernel is aware of the relationship,
> whatever algorithm it uses to become aware.
>
> It just means the issue is doing the necessary cross-subsystem
> locking.

Agree. This is not the first time we have cross-subsystem interactions 
in the kernel, right ?

>
> That combined with the fact they really are two halfs of the same
> thing - operations on the controlling function have to be sequenced
> with operations on the VFIO device - makes me prefer the single uAPI.
>
>> More importantly any sane design will have easy ways to list and
>> manipulate all the controlled functions from the controlling
>> functions, while getting from the controlled function to the
>> controlling one is extremely awkward, as anything that can be
>> used for that is by definition and information leak.
> We have spend some time looking at this for mlx5. It is a hard
> problem. The VFIO driver cannot operate the device, eg it cannot do
> MMIO and things, so it is limited to items in the PCI config space to
> figure out the device identity.

Christoph,

I'm not sure how awkward is for migration driver to ask the controlling 
device driver to operate a migration action.

The controlling device driver can expose limited API for that matter.

I don't see why is it wrong to submit some admin commands between 
subsystems - we already have a way to send admin commands from linux cli 
or from nvmet drivers to an NVMe device.

Also the concept of primary controller that control it's secondary 
controllers is already in the SPEC so we can use it. It's not introduced 
in this RFC but we're here to fix it.

In our case the primary controller is the PF and the secondary 
controllers are the VFs.

If we'll think of  a concept of new optional "Migration Management" 
admin commands that will be supported by primary controllers to manage 
the migration process of its secondary controllers - we can at least 
solve the initial step of migrating VFs by its parent PF.

>
>> It seems mlx5 just gets away with that by saying controlled
>> functions are always VFs, and the controlling function is a PF, but
>> that will break down very easily,
> Yes, that is part of the current mlx5 model. It is not inherent to the
> device design, but the problems with arbitary nesting were hard enough
> they were not tackled at this point.

Agree. There are a lot of challenges in the non-nested migration that we 
can just limit the NVMe SPEC to it and extend later on - like we did in 
other features such as copy-Offload.

Most of the infrastructure work for LM was done in the VFIO subsystem in 
the last year so we can now focus on the Architecture aspects of NVMe.

>
> Jason
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> I'm not sure how awkward is for migration driver to ask the controlling 
> device driver to operate a migration action.

It can't.  That's the whole point.  The controlled function that is
being migrate must be absolutely unaware of that (except for things
like quiescing access or FLRs that could happen anyway), because
otherwise your have a fundamental information leak.

> The controlling device driver can expose limited API for that matter.

No, it can't.  It must be in charge.

> Also the concept of primary controller that control it's secondary 
> controllers is already in the SPEC so we can use it. It's not introduced in 
> this RFC but we're here to fix it.

Yes, it is as I've pointed out multiple times.  But, that relationship
is only visible to the primary controller.  It also doesn't help with
the general problem where the secondary controller must be able to
completely change it's identify and thus the subsystem.

> In our case the primary controller is the PF and the secondary controllers 
> are the VFs.

Yes, that's your case, and probably a very common one.  But also far from
the only one, so there is no way Linux or the specification can rely
on that dumb fact.  Never mind that there are virtualization schemes
(look at the s390 code) where the PF to VF relationship gets lost.
RE: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Tian, Kevin 2 years, 9 months ago
> From: Christoph Hellwig <hch@lst.de>
> Sent: Wednesday, December 7, 2022 3:59 PM
> 
> On Wed, Dec 07, 2022 at 04:30:20AM +0200, Max Gurtovoy wrote:
> > I'm not sure how awkward is for migration driver to ask the controlling
> > device driver to operate a migration action.
> 
> It can't.  That's the whole point.  The controlled function that is
> being migrate must be absolutely unaware of that (except for things
> like quiescing access or FLRs that could happen anyway), because
> otherwise your have a fundamental information leak.
> 

Can you elaborate which information is leaked?

No matter who provides the uAPI (controlling or controlled), end of
the day it's the same cmd invoked on the controlling device to save/
restore the state of the controlled device itself.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Christoph Hellwig 2 years, 9 months ago
On Fri, Dec 09, 2022 at 02:11:21AM +0000, Tian, Kevin wrote:
> > It can't.  That's the whole point.  The controlled function that is
> > being migrate must be absolutely unaware of that (except for things
> > like quiescing access or FLRs that could happen anyway), because
> > otherwise your have a fundamental information leak.
> > 
> 
> Can you elaborate which information is leaked?

Information about what controllers exist, what namespaces exist, etc.
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Keith Busch 2 years, 9 months ago
On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> > On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > > VF admin commands. It's helpful in some cases that the host NVMe driver
> > > does not control VF's admin queue. For example, in the virtualization
> > > device pass-through case, the VF controller's admin queue is governed
> > > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > > queue to control VF devices like vendor-specific live migration commands.
> > 
> > WTF are you even smoking when you think this would be acceptable?
> 
> Not speaking to NVMe - but this driver is clearly copying mlx5's live
> migration driver, almost completely - including this basic function.
> 
> So, to explain why mlx5 works this way..
> 
> The VFIO approach is to fully assign an entire VF to the guest OS. The
> entire VF assignment means every MMIO register *and all the DMA* of
> the VF is owned by the guest operating system.
> 
> mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
> state to perform a migration.

For storage, though, you can't just transfer the controller state. You have to
transfer all the namespace user data, too. So potentially many terabytes?
Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.
Posted by Jason Gunthorpe 2 years, 9 months ago
On Tue, Dec 06, 2022 at 01:51:32PM +0000, Keith Busch wrote:
> On Tue, Dec 06, 2022 at 09:44:08AM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 06, 2022 at 07:19:40AM +0100, Christoph Hellwig wrote:
> > > On Tue, Dec 06, 2022 at 01:58:12PM +0800, Lei Rao wrote:
> > > > The new function nvme_submit_vf_cmd() helps the host VF driver to issue
> > > > VF admin commands. It's helpful in some cases that the host NVMe driver
> > > > does not control VF's admin queue. For example, in the virtualization
> > > > device pass-through case, the VF controller's admin queue is governed
> > > > by the Guest NVMe driver. Host VF driver relies on PF device's admin
> > > > queue to control VF devices like vendor-specific live migration commands.
> > > 
> > > WTF are you even smoking when you think this would be acceptable?
> > 
> > Not speaking to NVMe - but this driver is clearly copying mlx5's live
> > migration driver, almost completely - including this basic function.
> > 
> > So, to explain why mlx5 works this way..
> > 
> > The VFIO approach is to fully assign an entire VF to the guest OS. The
> > entire VF assignment means every MMIO register *and all the DMA* of
> > the VF is owned by the guest operating system.
> > 
> > mlx5 needs to transfer hundreds of megabytes to gigabytes of in-device
> > state to perform a migration.
> 
> For storage, though, you can't just transfer the controller state. You have to
> transfer all the namespace user data, too. So potentially many terabytes?

There are two different scenarios - lets call it shared medium and
local medium.

If the medium is shared then only the controller state needs to be
transfered. The controller state would include enough information to
locate and identify the shared medium.

This would apply to cases like DPU/smart NIC, multi-port physical
drives, etc.

local medium will require the medium copy. Within Linux I don't have a
clear sense if that should be done within the VFIO migration
framework, or if it would better to have its own operations.

For the NVMe spec I'd strongly suggest keeping medium copy as its own
set of commands.

It will be interesting to see how to standardize all these scenarios :)

Jason