VFIO live update support

[RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Vipin Sharma 3 months, 3 weeks ago

Return true in can_preserve() callback of live update file handler, if
VFIO can preserve the passed VFIO cdev file. Return -EOPNOTSUPP from
prepare() callback for now to fail any attempt to preserve VFIO cdev in
live update.

The VFIO cdev opened check ensures that the file is actually used for
VFIO cdev and not for VFIO device FD which can be obtained from the VFIO
group.

Returning true from can_preserve() tells Live Update Orchestrator that
VFIO can try to preserve the given file during live update. Actual
preservation logic will be added in future patches, therefore, for now,
prepare call will fail.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 drivers/vfio/pci/vfio_pci_liveupdate.c | 16 +++++++++++++++-
 drivers/vfio/vfio_main.c               |  3 ++-
 include/linux/vfio.h                   |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c b/drivers/vfio/pci/vfio_pci_liveupdate.c
index 088f7698a72c..2ce2c11cb51c 100644
--- a/drivers/vfio/pci/vfio_pci_liveupdate.c
+++ b/drivers/vfio/pci/vfio_pci_liveupdate.c
@@ -8,10 +8,17 @@
  */
 
 #include <linux/liveupdate.h>
+#include <linux/vfio.h>
 #include <linux/errno.h>
 
 #include "vfio_pci_priv.h"
 
+static int vfio_pci_liveupdate_prepare(struct liveupdate_file_handler *handler,
+				       struct file *file, u64 *data)
+{
+	return -EOPNOTSUPP;
+}
+
 static int vfio_pci_liveupdate_retrieve(struct liveupdate_file_handler *handler,
 					u64 data, struct file **file)
 {
@@ -21,10 +28,17 @@ static int vfio_pci_liveupdate_retrieve(struct liveupdate_file_handler *handler,
 static bool vfio_pci_liveupdate_can_preserve(struct liveupdate_file_handler *handler,
 					     struct file *file)
 {
-	return -EOPNOTSUPP;
+	struct vfio_device *device = vfio_device_from_file(file);
+
+	if (!device)
+		return false;
+
+	guard(mutex)(&device->dev_set->lock);
+	return vfio_device_cdev_opened(device);
 }
 
 static const struct liveupdate_file_ops vfio_pci_luo_fops = {
+	.prepare = vfio_pci_liveupdate_prepare,
 	.retrieve = vfio_pci_liveupdate_retrieve,
 	.can_preserve = vfio_pci_liveupdate_can_preserve,
 	.owner = THIS_MODULE,
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 38c8e9350a60..4cb47c1564f4 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1386,7 +1386,7 @@ const struct file_operations vfio_device_fops = {
 #endif
 };
 
-static struct vfio_device *vfio_device_from_file(struct file *file)
+struct vfio_device *vfio_device_from_file(struct file *file)
 {
 	struct vfio_device_file *df = file->private_data;
 
@@ -1394,6 +1394,7 @@ static struct vfio_device *vfio_device_from_file(struct file *file)
 		return NULL;
 	return df->device;
 }
+EXPORT_SYMBOL_GPL(vfio_device_from_file);
 
 /**
  * vfio_file_is_valid - True if the file is valid vfio file
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index eb563f538dee..2443d24aa237 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -385,4 +385,6 @@ int vfio_virqfd_enable(void *opaque, int (*handler)(void *, void *),
 void vfio_virqfd_disable(struct virqfd **pvirqfd);
 void vfio_virqfd_flush_thread(struct virqfd **pvirqfd);
 
+struct vfio_device *vfio_device_from_file(struct file *file);
+
 #endif /* VFIO_H */
-- 
2.51.0.858.gf9c4a03a3a-goog

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Jacob Pan 3 months, 2 weeks ago

On Fri, 17 Oct 2025 17:06:58 -0700
Vipin Sharma <vipinsh@google.com> wrote:

> Return true in can_preserve() callback of live update file handler, if
> VFIO can preserve the passed VFIO cdev file. Return -EOPNOTSUPP from
> prepare() callback for now to fail any attempt to preserve VFIO cdev
> in live update.
> 
> The VFIO cdev opened check ensures that the file is actually used for
> VFIO cdev and not for VFIO device FD which can be obtained from the
> VFIO group.
> 
> Returning true from can_preserve() tells Live Update Orchestrator that
> VFIO can try to preserve the given file during live update. Actual
> preservation logic will be added in future patches, therefore, for
> now, prepare call will fail.
> 
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  drivers/vfio/pci/vfio_pci_liveupdate.c | 16 +++++++++++++++-
>  drivers/vfio/vfio_main.c               |  3 ++-
>  include/linux/vfio.h                   |  2 ++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_liveupdate.c
> b/drivers/vfio/pci/vfio_pci_liveupdate.c index
> 088f7698a72c..2ce2c11cb51c 100644 ---
> a/drivers/vfio/pci/vfio_pci_liveupdate.c +++
> b/drivers/vfio/pci/vfio_pci_liveupdate.c @@ -8,10 +8,17 @@
>   */
>  
>  #include <linux/liveupdate.h>
> +#include <linux/vfio.h>
>  #include <linux/errno.h>
>  
>  #include "vfio_pci_priv.h"
>  
> +static int vfio_pci_liveupdate_prepare(struct
> liveupdate_file_handler *handler,
> +				       struct file *file, u64 *data)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  static int vfio_pci_liveupdate_retrieve(struct
> liveupdate_file_handler *handler, u64 data, struct file **file)
>  {
> @@ -21,10 +28,17 @@ static int vfio_pci_liveupdate_retrieve(struct
> liveupdate_file_handler *handler, static bool
> vfio_pci_liveupdate_can_preserve(struct liveupdate_file_handler
> *handler, struct file *file) {
> -	return -EOPNOTSUPP;
> +	struct vfio_device *device = vfio_device_from_file(file);
> +
> +	if (!device)
> +		return false;
> +
> +	guard(mutex)(&device->dev_set->lock);
> +	return vfio_device_cdev_opened(device);
IIUC, vfio_device_cdev_opened(device) will only return true after
vfio_df_ioctl_bind_iommufd(). Where it does:
	device->cdev_opened = true;

Does this imply that devices not bound to an iommufd cannot be
preserved?

If so, I am confused about your cover letter step #15
> 15. It makes usual bind iommufd and attach page table calls.

Does it mean after restoration, we have to bind iommufd again?

I have a separate question regarding noiommu devices. I’m currently
working on adding noiommu mode support for VFIO cdev under iommufd.
From my understanding, these devices should naturally be included in
your patchset, provided that I ensure the noiommu cdev follows the same
open/bind process. Is that correct?

>  }
>  
>  static const struct liveupdate_file_ops vfio_pci_luo_fops = {
> +	.prepare = vfio_pci_liveupdate_prepare,
>  	.retrieve = vfio_pci_liveupdate_retrieve,
>  	.can_preserve = vfio_pci_liveupdate_can_preserve,
>  	.owner = THIS_MODULE,
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 38c8e9350a60..4cb47c1564f4 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1386,7 +1386,7 @@ const struct file_operations vfio_device_fops =
> { #endif
>  };
>  
> -static struct vfio_device *vfio_device_from_file(struct file *file)
> +struct vfio_device *vfio_device_from_file(struct file *file)
>  {
>  	struct vfio_device_file *df = file->private_data;
>  
> @@ -1394,6 +1394,7 @@ static struct vfio_device
> *vfio_device_from_file(struct file *file) return NULL;
>  	return df->device;
>  }
> +EXPORT_SYMBOL_GPL(vfio_device_from_file);
>  
>  /**
>   * vfio_file_is_valid - True if the file is valid vfio file
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index eb563f538dee..2443d24aa237 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -385,4 +385,6 @@ int vfio_virqfd_enable(void *opaque, int
> (*handler)(void *, void *), void vfio_virqfd_disable(struct virqfd
> **pvirqfd); void vfio_virqfd_flush_thread(struct virqfd **pvirqfd);
>  
> +struct vfio_device *vfio_device_from_file(struct file *file);
> +
>  #endif /* VFIO_H */

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by David Matlack 3 months, 1 week ago

On 2025-10-27 01:44 PM, Jacob Pan wrote:
> On Fri, 17 Oct 2025 17:06:58 -0700 Vipin Sharma <vipinsh@google.com> wrote:
> >  static int vfio_pci_liveupdate_retrieve(struct
> > liveupdate_file_handler *handler, u64 data, struct file **file)
> >  {
> > @@ -21,10 +28,17 @@ static int vfio_pci_liveupdate_retrieve(struct
> > liveupdate_file_handler *handler, static bool
> > vfio_pci_liveupdate_can_preserve(struct liveupdate_file_handler
> > *handler, struct file *file) {
> > -	return -EOPNOTSUPP;
> > +	struct vfio_device *device = vfio_device_from_file(file);
> > +
> > +	if (!device)
> > +		return false;
> > +
> > +	guard(mutex)(&device->dev_set->lock);
> > +	return vfio_device_cdev_opened(device);
>
> IIUC, vfio_device_cdev_opened(device) will only return true after
> vfio_df_ioctl_bind_iommufd(). Where it does:
> 	device->cdev_opened = true;
> 
> Does this imply that devices not bound to an iommufd cannot be
> preserved?

Event if being bound to an iommufd is required, it seems wrong to check
it in can_preserve(), as the device can just be unbound from the iommufd
before preserve().

I think can_preserve() just needs to check if this is a VFIO cdev file,
i.e. vfio_device_from_file() returns non-NULL.

> 
> If so, I am confused about your cover letter step #15
> > 15. It makes usual bind iommufd and attach page table calls.
> 
> Does it mean after restoration, we have to bind iommufd again?

This is still being discussed. These are the two options currently:

 - When userspace retrieves the iommufd from LUO after kexec, the kernel
   will internally restore all VFIO cdevs and bind them to the iommufd
   in a single step.

 - Userspace will retrieve the iommufd and cdevs from LUO separately,
   and then bind each cdev to the iommufd like they were before kexec.

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Pasha Tatashin 3 months, 1 week ago

On Thu, Oct 30, 2025 at 7:10 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2025-10-27 01:44 PM, Jacob Pan wrote:
> > On Fri, 17 Oct 2025 17:06:58 -0700 Vipin Sharma <vipinsh@google.com> wrote:
> > >  static int vfio_pci_liveupdate_retrieve(struct
> > > liveupdate_file_handler *handler, u64 data, struct file **file)
> > >  {
> > > @@ -21,10 +28,17 @@ static int vfio_pci_liveupdate_retrieve(struct
> > > liveupdate_file_handler *handler, static bool
> > > vfio_pci_liveupdate_can_preserve(struct liveupdate_file_handler
> > > *handler, struct file *file) {
> > > -   return -EOPNOTSUPP;
> > > +   struct vfio_device *device = vfio_device_from_file(file);
> > > +
> > > +   if (!device)
> > > +           return false;
> > > +
> > > +   guard(mutex)(&device->dev_set->lock);
> > > +   return vfio_device_cdev_opened(device);
> >
> > IIUC, vfio_device_cdev_opened(device) will only return true after
> > vfio_df_ioctl_bind_iommufd(). Where it does:
> >       device->cdev_opened = true;
> >
> > Does this imply that devices not bound to an iommufd cannot be
> > preserved?
>
> Event if being bound to an iommufd is required, it seems wrong to check
> it in can_preserve(), as the device can just be unbound from the iommufd
> before preserve().
>
> I think can_preserve() just needs to check if this is a VFIO cdev file,
> i.e. vfio_device_from_file() returns non-NULL.

+1, can_preserve() must be fast, as it might be called on every single
FD that is being preserved, to check if type is correct.
So, simply check if "struct file" is cdev via ops check perhaps via
and thats it. It should be a very simple operation

>
> >
> > If so, I am confused about your cover letter step #15
> > > 15. It makes usual bind iommufd and attach page table calls.
> >
> > Does it mean after restoration, we have to bind iommufd again?
>
> This is still being discussed. These are the two options currently:
>
>  - When userspace retrieves the iommufd from LUO after kexec, the kernel
>    will internally restore all VFIO cdevs and bind them to the iommufd
>    in a single step.
>
>  - Userspace will retrieve the iommufd and cdevs from LUO separately,
>    and then bind each cdev to the iommufd like they were before kexec.

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by David Matlack 3 months, 1 week ago

On Thu, Oct 30, 2025 at 5:19 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> On Thu, Oct 30, 2025 at 7:10 PM David Matlack <dmatlack@google.com> wrote:
> > On 2025-10-27 01:44 PM, Jacob Pan wrote:
> > > On Fri, 17 Oct 2025 17:06:58 -0700 Vipin Sharma <vipinsh@google.com> wrote:
> > > > +   guard(mutex)(&device->dev_set->lock);
> > > > +   return vfio_device_cdev_opened(device);
> > >
> > > IIUC, vfio_device_cdev_opened(device) will only return true after
> > > vfio_df_ioctl_bind_iommufd(). Where it does:
> > >       device->cdev_opened = true;
> > >
> > > Does this imply that devices not bound to an iommufd cannot be
> > > preserved?
> >
> > Event if being bound to an iommufd is required, it seems wrong to check
> > it in can_preserve(), as the device can just be unbound from the iommufd
> > before preserve().
> >
> > I think can_preserve() just needs to check if this is a VFIO cdev file,
> > i.e. vfio_device_from_file() returns non-NULL.
>
> +1, can_preserve() must be fast, as it might be called on every single
> FD that is being preserved, to check if type is correct.
> So, simply check if "struct file" is cdev via ops check perhaps via
> and thats it. It should be a very simple operation

Small correction, vfio_device_from_file() checks if file->fops are
&vfio_device_fops. But device files acquired via group FDs use the
same ops. So I think we actually need to check "device &&
!device->group" here to identify VFIO cdev files, and then check
device->ops == &vfio_pci_ops to make sure this is a vfio-pci device.

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Jason Gunthorpe 3 months, 1 week ago

On Mon, Oct 27, 2025 at 01:44:30PM -0700, Jacob Pan wrote:
> I have a separate question regarding noiommu devices. I’m currently
> working on adding noiommu mode support for VFIO cdev under iommufd.

Oh how is that going? I was just thinking about that again..

After writing the generic pt self test it occured to me we now have
enough infrastructure for iommufd to internally create its own
iommu_domain with a AMDv1 page table for the noiommu devices. It would
then be so easy to feed that through the existing machinery and have
all the pinning/etc work.

Then only an ioctl to read back the physical addresses from this
special domain would be needed

It actually sort of feels pretty easy..

Jason

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Jacob Pan 3 months, 1 week ago

On Tue, 28 Oct 2025 10:28:55 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Oct 27, 2025 at 01:44:30PM -0700, Jacob Pan wrote:
> > I have a separate question regarding noiommu devices. I’m currently
> > working on adding noiommu mode support for VFIO cdev under iommufd.
> >  
> 
> Oh how is that going? I was just thinking about that again..
> 
I initially tried to create a special VFIO no-iommu iommu_domain
without an iommu driver, but I found it difficult without iommu_group
and other machinery. I also had a special vfio_device_ops
vfio_pci_noiommu_ops with special vfio_iommufd_noiommu_bind to create
iommufd_acess object as in Yi's original patch.

My current approach is that I have a special noiommu driver that handles
the special iommu_domain. It seems much cleaner though some extra code
overhead. I have a working prototype that has:
# tree /dev/vfio/    
/dev/vfio/           
|-- 7                
|-- devices          
|   `-- noiommu-vfio0
`-- vfio             

And the typical:
/sys/class/iommu/noiommu/
|-- devices
|   |-- 0000:00:00.0 -> ../../../../pci0000:00/0000:00:00.0
|   |-- 0000:00:01.0 -> ../../../../pci0000:00/0000:00:01.0
|   |-- 0000:00:02.0 -> ../../../../pci0000:00/0000:00:02.0
|   |-- 0000:00:03.0 -> ../../../../pci0000:00/0000:00:03.0
|   |-- 0000:00:04.0 -> ../../../../pci0000:00/0000:00:04.0
|   |-- 0000:00:05.0 -> ../../../../pci0000:00/0000:00:05.0
|   |-- 0000:01:00.0 -> ../../../../pci0000:00/0000:00:04.0/0000:0

The following user test can pass:
1. __iommufd = open("/dev/iommu", O_RDWR);
2. devfd = open a noiommu cdev
3. ioas_id = ioas_alloc(__iommufd)
4. iommufd_bind(__iommufd, devfd)
5. successfully do an ioas map, e.g.
ioctl(iommufd, IOMMU_IOAS_MAP, &map) 
This will call pfn_reader_user_pin() but the noiommu driver does
nothing for mapping.

I am still debugging some cases, would like to have a direction check
before going too far.

> After writing the generic pt self test it occured to me we now have
> enough infrastructure for iommufd to internally create its own
> iommu_domain with a AMDv1 page table for the noiommu devices. It would
> then be so easy to feed that through the existing machinery and have
> all the pinning/etc work.
> 
Could you elaborate a little more? noiommu devices don't have page
tables. Are you saying iommufd can create its own iommu_domain w/o a
vendor iommu driver? Let me catch up with your v7 :)

> Then only an ioctl to read back the physical addresses from this
> special domain would be needed
> 
Yes, that was part of your original suggestion to avoid /proc pagemap.
I have not added that yet. Do you think this warrant a new ioctl or
just return it in
	struct iommu_ioas_map map = {
		.size = sizeof(map),
		.flags = IOMMU_IOAS_MAP_READABLE,
		.ioas_id = ioas_id,
		.iova = iova,
		.user_va = uvaddr,
		.length = size,
	};

> It actually sort of feels pretty easy..
> 
> Jason

Re: [RFC PATCH 06/21] vfio/pci: Accept live update preservation request for VFIO cdev

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Oct 28, 2025 at 10:39:45AM -0700, Jacob Pan wrote:

> My current approach is that I have a special noiommu driver that handles
> the special iommu_domain. It seems much cleaner though some extra code
> overhead. I have a working prototype that has:

Oh interesting, maybe that is OK and reasonable.. My first worry is
that we don't well support iommu driver hot unplug, but if it is very
carefully controlled I think we can make it safe. iommufd selftests is
already doing this and I've been trying to make sure it stays safe
without races or memory leaks..

Binding is going to also need some fiddling because we don't want to
mess with the fwspec on a real struct device..

But maybe we can have some kind of direct 'bind iommu driver to struct
device' call?

> The following user test can pass:
> 1. __iommufd = open("/dev/iommu", O_RDWR);
> 2. devfd = open a noiommu cdev
> 3. ioas_id = ioas_alloc(__iommufd)
> 4. iommufd_bind(__iommufd, devfd)
> 5. successfully do an ioas map, e.g.
> ioctl(iommufd, IOMMU_IOAS_MAP, &map) 
> This will call pfn_reader_user_pin() but the noiommu driver does
> nothing for mapping.

Make sense.

So you can't have a paging iommu_domain that doesn't have a map
function - that just won't work for iommufd. What you should do is use
the iommu pt stuff and have the noiommu driver implement its paging
domain using the amdv1 format.

That will give you map/unmap/iova_to_phys and then iommufd will
immediately full work.

Look at how that series handles the selftest, the simple selftest
iommu_domain is very close to what you need. It is pretty small code
wise.

> > After writing the generic pt self test it occured to me we now have
> > enough infrastructure for iommufd to internally create its own
> > iommu_domain with a AMDv1 page table for the noiommu devices. It would
> > then be so easy to feed that through the existing machinery and have
> > all the pinning/etc work.
>
> Could you elaborate a little more? noiommu devices don't have page
> tables. Are you saying iommufd can create its own iommu_domain w/o a
> vendor iommu driver? Let me catch up with your v7 :)

That was my suggestion, but it seems you tried that and decided it was
too hard with groups/etc. OK.

Adding a dummy iommu driver solves that and you still get to the same
place where there is a paging iommu domain that implements an actual
page table with map/unmap/iova_to_phys. From this perspective iommufd
will be entirely happy and will do all the required pinning and
unpinning.

> > Then only an ioctl to read back the physical addresses from this
> > special domain would be needed
>
> Yes, that was part of your original suggestion to avoid /proc pagemap.
> I have not added that yet. Do you think this warrant a new ioctl or
> just return it in

I think a new ioctl is probably the right idea..

Jason