[RFC PATCH 13/15] iommufd: Persist iommu domains for live update

Samiullah Khawaja posted 15 patches 4 months, 2 weeks ago
There is a newer version of this series
[RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 2 weeks ago
From: YiFei Zhu <zhuyifei@google.com>

Iterate through all the IOAS objects and the underlying hwpt_paging
objects. Persist each iommu domain using API iommu_domain_preserve.

This is temporary as only the domains attached to the persisted devices
need to preserved.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Samiullah Khawaja <skhawaja@google.com>
---
 drivers/iommu/iommufd/liveupdate.c | 47 ++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/drivers/iommu/iommufd/liveupdate.c b/drivers/iommu/iommufd/liveupdate.c
index 1bdd5a82af90..0af0c6fadff1 100644
--- a/drivers/iommu/iommufd/liveupdate.c
+++ b/drivers/iommu/iommufd/liveupdate.c
@@ -8,9 +8,52 @@
 #include <linux/kexec_handover.h>
 #include <linux/liveupdate.h>
 #include <linux/mm.h>
+#include <linux/pci.h>
 
 #include "iommufd_private.h"
 
+static int iommufd_save_ioas(struct iommufd_ctx *ictx,
+			     struct iommufd_lu *iommufd_lu)
+{
+	struct iommufd_hwpt_paging *hwpt_paging;
+	struct iommufd_ioas *ioas = NULL;
+	struct iommufd_object *obj;
+	unsigned long index;
+	int rc;
+
+	/* Iterate each ioas. */
+	xa_for_each(&ictx->objects, index, obj) {
+		if (obj->type != IOMMUFD_OBJ_IOAS)
+			continue;
+
+		ioas = (struct iommufd_ioas *)obj;
+		mutex_lock(&ioas->mutex);
+
+		/*
+		 * TODO: Iterate over each device of this iommufd and only save
+		 * hwpt/domain if the device is persisted.
+		 */
+		list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
+			if (!hwpt_paging->common.domain)
+				continue;
+
+			rc = iommu_domain_preserve(hwpt_paging->common.domain);
+			if (rc)
+				goto err;
+		}
+
+		mutex_unlock(&ioas->mutex);
+		ioas = NULL;
+	}
+
+	return 0;
+
+err:
+	if (ioas)
+		mutex_unlock(&ioas->mutex);
+	return rc;
+}
+
 static int iommufd_liveupdate_prepare(struct liveupdate_file_handler *handler,
 				      struct file *file, u64 *data)
 {
@@ -33,6 +76,10 @@ static int iommufd_liveupdate_prepare(struct liveupdate_file_handler *handler,
 
 	iommufd_lu = folio_address(folio_lu);
 
+	rc = iommufd_save_ioas(ictx, iommufd_lu);
+	if (rc)
+		goto err_folio_put;
+
 	rc = kho_preserve_folio(folio_lu);
 	if (rc)
 		goto err_folio_put;
-- 
2.51.0.536.g15c5d4f767-goog
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> +			     struct iommufd_lu *iommufd_lu)
> +{
> +	struct iommufd_hwpt_paging *hwpt_paging;
> +	struct iommufd_ioas *ioas = NULL;
> +	struct iommufd_object *obj;
> +	unsigned long index;
> +	int rc;
> +
> +	/* Iterate each ioas. */
> +	xa_for_each(&ictx->objects, index, obj) {
> +		if (obj->type != IOMMUFD_OBJ_IOAS)
> +			continue;

Wrong locking

> +
> +		ioas = (struct iommufd_ioas *)obj;
> +		mutex_lock(&ioas->mutex);
> +
> +		/*
> +		 * TODO: Iterate over each device of this iommufd and only save
> +		 * hwpt/domain if the device is persisted.
> +		 */
> +		list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> +			if (!hwpt_paging->common.domain)
> +				continue;

I don't think this should be automatic. The user should directly
serialize/unserialize HWPTs by ID.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
On Mon, Sep 29, 2025 at 12:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > +                          struct iommufd_lu *iommufd_lu)
> > +{
> > +     struct iommufd_hwpt_paging *hwpt_paging;
> > +     struct iommufd_ioas *ioas = NULL;
> > +     struct iommufd_object *obj;
> > +     unsigned long index;
> > +     int rc;
> > +
> > +     /* Iterate each ioas. */
> > +     xa_for_each(&ictx->objects, index, obj) {
> > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > +                     continue;
>
> Wrong locking
>
> > +
> > +             ioas = (struct iommufd_ioas *)obj;
> > +             mutex_lock(&ioas->mutex);
> > +
> > +             /*
> > +              * TODO: Iterate over each device of this iommufd and only save
> > +              * hwpt/domain if the device is persisted.
> > +              */
> > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > +                     if (!hwpt_paging->common.domain)
> > +                             continue;
>
> I don't think this should be automatic. The user should directly
> serialize/unserialize HWPTs by ID.

Why not?  Live Updated uAPI is handled through FDs, and both iommufd
and vfiofd have to be preserved; I assume we can automatically
determine the hwpt to be preserved through dependencies. Why would we
delegate this to the user?

Pasha
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Tue, Sep 30, 2025 at 09:07:48AM -0400, Pasha Tatashin wrote:
> On Mon, Sep 29, 2025 at 12:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > > +                          struct iommufd_lu *iommufd_lu)
> > > +{
> > > +     struct iommufd_hwpt_paging *hwpt_paging;
> > > +     struct iommufd_ioas *ioas = NULL;
> > > +     struct iommufd_object *obj;
> > > +     unsigned long index;
> > > +     int rc;
> > > +
> > > +     /* Iterate each ioas. */
> > > +     xa_for_each(&ictx->objects, index, obj) {
> > > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > > +                     continue;
> >
> > Wrong locking
> >
> > > +
> > > +             ioas = (struct iommufd_ioas *)obj;
> > > +             mutex_lock(&ioas->mutex);
> > > +
> > > +             /*
> > > +              * TODO: Iterate over each device of this iommufd and only save
> > > +              * hwpt/domain if the device is persisted.
> > > +              */
> > > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > > +                     if (!hwpt_paging->common.domain)
> > > +                             continue;
> >
> > I don't think this should be automatic. The user should directly
> > serialize/unserialize HWPTs by ID.
> 
> Why not?  Live Updated uAPI is handled through FDs, and both iommufd
> and vfiofd have to be preserved; I assume we can automatically
> determine the hwpt to be preserved through dependencies. Why would we
> delegate this to the user?

There are HWPTs outside the IOAS so it is inconsisent.

We are not going to reconstruct the IOAS.

The IDR ids of the HWPT may not be available on restore (we cannot
make this ABI), so without userspace expressly labeling them and
recovering the new IDR ids it doesn't work.

Finally we expect to discard the preserved HWPTs and replace them we
rebuilt ones at least as a first step. Userspace needs to sequence all
of this..

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Tue, Sep 30, 2025 at 6:59 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 30, 2025 at 09:07:48AM -0400, Pasha Tatashin wrote:
> > On Mon, Sep 29, 2025 at 12:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > > > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > > > +                          struct iommufd_lu *iommufd_lu)
> > > > +{
> > > > +     struct iommufd_hwpt_paging *hwpt_paging;
> > > > +     struct iommufd_ioas *ioas = NULL;
> > > > +     struct iommufd_object *obj;
> > > > +     unsigned long index;
> > > > +     int rc;
> > > > +
> > > > +     /* Iterate each ioas. */
> > > > +     xa_for_each(&ictx->objects, index, obj) {
> > > > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > > > +                     continue;
> > >
> > > Wrong locking
> > >
> > > > +
> > > > +             ioas = (struct iommufd_ioas *)obj;
> > > > +             mutex_lock(&ioas->mutex);
> > > > +
> > > > +             /*
> > > > +              * TODO: Iterate over each device of this iommufd and only save
> > > > +              * hwpt/domain if the device is persisted.
> > > > +              */
> > > > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > > > +                     if (!hwpt_paging->common.domain)
> > > > +                             continue;
> > >
> > > I don't think this should be automatic. The user should directly
> > > serialize/unserialize HWPTs by ID.
> >
> > Why not?  Live Updated uAPI is handled through FDs, and both iommufd
> > and vfiofd have to be preserved; I assume we can automatically
> > determine the hwpt to be preserved through dependencies. Why would we
> > delegate this to the user?
>
> There are HWPTs outside the IOAS so it is inconsisent.

This makes sense. But if I understand correctly a HWPT should be
associated one way or another to a preserved device or IOAS. Also the
nested ones will have parent HWPT. Can we not look at the dependencies
here and find the HWPTs that need to preserved.
>
> We are not going to reconstruct the IOAS.
>
> The IDR ids of the HWPT may not be available on restore (we cannot
> make this ABI), so without userspace expressly labeling them and
> recovering the new IDR ids it doesn't work.
>
> Finally we expect to discard the preserved HWPTs and replace them we
> rebuilt ones at least as a first step. Userspace needs to sequence all
> of this..

But if we discard the old HWPTs and replace them with the new ones, we
shouldn't need labeling of the old HWPTs? We would definitely need to
sequence the replacement and discard of the old ones, but that can
also be inferred through the dependencies between the new HWPTs?
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Tue, Sep 30, 2025 at 01:02:31PM -0700, Samiullah Khawaja wrote:
> > There are HWPTs outside the IOAS so it is inconsisent.
> 
> This makes sense. But if I understand correctly a HWPT should be
> associated one way or another to a preserved device or IOAS. Also the
> nested ones will have parent HWPT. Can we not look at the dependencies
> here and find the HWPTs that need to preserved.

Maybe in some capacity, but I would say more of don't allow preserving
things that depend on things not already preserved somehow.

> > Finally we expect to discard the preserved HWPTs and replace them we
> > rebuilt ones at least as a first step. Userspace needs to sequence all
> > of this..
> 
> But if we discard the old HWPTs and replace them with the new ones, we
> shouldn't need labeling of the old HWPTs? We would definitely need to
> sequence the replacement and discard of the old ones, but that can
> also be inferred through the dependencies between the new HWPTs?

It depends how this ends up being designed and who is responsible to
free the restored iommu_domain.

The iommu core code should be restoring the iommu_domain as soon as
the attached device is plugged in and attaching the preserved domain
instead of something else during the device probe sequence

This logic should not be in drivers.

From there you either put the hwpt back into iommufd and have it free
the iommu_domain when it destroys the hwpt

Or you have the iommu core code free the iommu_domain at some point
after iommufd has replaced the attachment with a new iommu_domain?

I'm not sure which is a better option..

Also there is an interesting behavior to note that if the iommu driver
restores a domain then it will also prevent a non-vfio driver from
binding to that device.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Tue, Sep 30, 2025 at 2:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 30, 2025 at 01:02:31PM -0700, Samiullah Khawaja wrote:
> > > There are HWPTs outside the IOAS so it is inconsisent.
> >
> > This makes sense. But if I understand correctly a HWPT should be
> > associated one way or another to a preserved device or IOAS. Also the
> > nested ones will have parent HWPT. Can we not look at the dependencies
> > here and find the HWPTs that need to preserved.
>
> Maybe in some capacity, but I would say more of don't allow preserving
> things that depend on things not already preserved somehow.

I agree. I think this makes sense. Users can explicitly indicate that
they want to preserve HWPTs and iommufd can enforce the dependencies.
>
> > > Finally we expect to discard the preserved HWPTs and replace them we
> > > rebuilt ones at least as a first step. Userspace needs to sequence all
> > > of this..
> >
> > But if we discard the old HWPTs and replace them with the new ones, we
> > shouldn't need labeling of the old HWPTs? We would definitely need to
> > sequence the replacement and discard of the old ones, but that can
> > also be inferred through the dependencies between the new HWPTs?
>
> It depends how this ends up being designed and who is responsible to
> free the restored iommu_domain.

Agreed. I think it depends on how much is restored from the previous
kernel. Discussed further below inline.
>
> The iommu core code should be restoring the iommu_domain as soon as
> the attached device is plugged in and attaching the preserved domain
> instead of something else during the device probe sequence
>
> This logic should not be in drivers.
>
> From there you either put the hwpt back into iommufd and have it free
> the iommu_domain when it destroys the hwpt
>
> Or you have the iommu core code free the iommu_domain at some point
> after iommufd has replaced the attachment with a new iommu_domain?

But we cannot do the replacement during domain attachment because
userspace might not have fully prepared the new domain with all the
required DMA mappings. Replace during LUO finish?

This is actually very close to what I had in mind for the "Hotswap"
model. My thought was:

1. During boot, the IOMMU core sets up a default domain but doesn't
program the context entries for the preserved device. The hardware
keeps on using the old preserved tables.
2. Userspace restores the iommufd, creates a new HWPT/domain and
populates mappings.
3. On FINISH, the IOMMU core updates the context entries of preserved
devices to point to the new domain.

I have a sequence diagram for this in the cover letter also.

I understand the desire to have the preserved iommu domain be restored
during boot so the device has a default domain and there is an owner
of the attached restored domain, but that would prevent the iommfud
from cooking a clean new domain.

Maybe we can refine the "Hotswap" model I had in mind. Basically on
boot the core restores the preserved iommu domain, but core lets
iommufd attach a new domain with preserved devices without replacing
the underlying context entries? The core replaces the context entries
when the iommufd indicates that the domain is fully prepared (during
luo finish).
>
> I'm not sure which is a better option..
>
> Also there is an interesting behavior to note that if the iommu driver
> restores a domain then it will also prevent a non-vfio driver from
> binding to that device.

Agreed. I think in the "Hotswap" approach I discussed above, if we
don't restore the domain, the core can just commit the context entries
of the new default domain if a non-vfio driver is bound to the device.
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Tue, Sep 30, 2025 at 04:15:43PM -0700, Samiullah Khawaja wrote:

> > The iommu core code should be restoring the iommu_domain as soon as
> > the attached device is plugged in and attaching the preserved domain
> > instead of something else during the device probe sequence
> >
> > This logic should not be in drivers.
> >
> > From there you either put the hwpt back into iommufd and have it free
> > the iommu_domain when it destroys the hwpt
> >
> > Or you have the iommu core code free the iommu_domain at some point
> > after iommufd has replaced the attachment with a new iommu_domain?
> 
> But we cannot do the replacement during domain attachment because
> userspace might not have fully prepared the new domain with all the
> required DMA mappings. Replace during LUO finish?

The idea is the kernel will restore the iommu_domain during early boot
in the iommu_core and then attach it. This should "rewrite" the IOMMU
HW context for that device with identical content. Drivers must be
enhanced to support this hitless rewrite (AMD and ARM are already
done).

At this point the kernel is operating normally with a normal domain
and a normal driver, no special luo stuff.

Later iommufd will come along and establish a HWPT that has an
identical translation. Then we replace the luo domain with the new
HWPT and free the luo domain.

> 1. During boot, the IOMMU core sets up a default domain but doesn't
> program the context entries for the preserved device. The hardware
> keeps on using the old preserved tables.

When the iommu driver first starts up it can take over the context
memory from the predecessor kernel. But it has to go through it and
clear out most of the context entries.

Only context entries belonging to devices marked for preservation
should be kept unchanged.

Later we probe the struct device to the iommu and do as I said above
to restore consistency.

> 2. Userspace restores the iommufd, creates a new HWPT/domain and
> populates mappings.

Yes

> 3. On FINISH, the IOMMU core updates the context entries of preserved
> devices to point to the new domain.

No, finish should never do anything on the restore path, IMHO. User
should directly attach the newly created HWPT when it is ready.

> I understand the desire to have the preserved iommu domain be restored
> during boot so the device has a default domain and there is an owner
> of the attached restored domain, but that would prevent the iommfud
> from cooking a clean new domain.

The "default domain" is the "DMA API domain" and it has to be created
and setup always. The change here is instead of attaching the default
domain we attach the luo restored domain at early boot.

This sets the device into an "owned" mode but vfio can still attach
and nothing prevents iommufd from building a new hwpt and attaching
it.

> Maybe we can refine the "Hotswap" model I had in mind. Basically on
> boot the core restores the preserved iommu domain, but core lets
> iommufd attach a new domain with preserved devices without replacing
> the underlying context entries? 

Replace the context entries. If everything is working properly the
preserved domain should compute an identical context entry, so no
reason to not just "replace" it which should be a NOP.

> > Also there is an interesting behavior to note that if the iommu driver
> > restores a domain then it will also prevent a non-vfio driver from
> > binding to that device.
> 
> Agreed. I think in the "Hotswap" approach I discussed above, if we
> don't restore the domain, the core can just commit the context entries
> of the new default domain if a non-vfio driver is bound to the device.

As I said, the owned nature of the device will prevent attaching a
non-vfio driver in the first place.

So the only path forward for userspace is to attach vfio, and then
iommufd should take over that luo restored iommu_domain and eventually
free it.

You might consider that finish should de-own the device if vfio didn't
claim it. But that is a bit tricky since it needs a FLR before the
domains can be switched around.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Wed, Oct 1, 2025 at 4:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 30, 2025 at 04:15:43PM -0700, Samiullah Khawaja wrote:
>
> > > The iommu core code should be restoring the iommu_domain as soon as
> > > the attached device is plugged in and attaching the preserved domain
> > > instead of something else during the device probe sequence
> > >
> > > This logic should not be in drivers.
> > >
> > > From there you either put the hwpt back into iommufd and have it free
> > > the iommu_domain when it destroys the hwpt
> > >
> > > Or you have the iommu core code free the iommu_domain at some point
> > > after iommufd has replaced the attachment with a new iommu_domain?
> >
> > But we cannot do the replacement during domain attachment because
> > userspace might not have fully prepared the new domain with all the
> > required DMA mappings. Replace during LUO finish?
>
> The idea is the kernel will restore the iommu_domain during early boot
> in the iommu_core and then attach it. This should "rewrite" the IOMMU
> HW context for that device with identical content. Drivers must be
> enhanced to support this hitless rewrite (AMD and ARM are already
> done).
>
> At this point the kernel is operating normally with a normal domain
> and a normal driver, no special luo stuff.
>
> Later iommufd will come along and establish a HWPT that has an
> identical translation. Then we replace the luo domain with the new
> HWPT and free the luo domain.
>
> > 1. During boot, the IOMMU core sets up a default domain but doesn't
> > program the context entries for the preserved device. The hardware
> > keeps on using the old preserved tables.
>
> When the iommu driver first starts up it can take over the context
> memory from the predecessor kernel. But it has to go through it and
> clear out most of the context entries.
>
> Only context entries belonging to devices marked for preservation
> should be kept unchanged.

Agreed. We have to sanitize these and remove unused entries. I think
the same goes for any PASID tables.
>
> Later we probe the struct device to the iommu and do as I said above
> to restore consistency.
>
> > 2. Userspace restores the iommufd, creates a new HWPT/domain and
> > populates mappings.
>
> Yes
>
> > 3. On FINISH, the IOMMU core updates the context entries of preserved
> > devices to point to the new domain.
>
> No, finish should never do anything on the restore path, IMHO. User
> should directly attach the newly created HWPT when it is ready.

Makes sense. But if the user never replaces the restored iommu_domain
with a new HWPT, we will have to discard the old (restored) domain on
finish since it doesn't have any associated HWPT. I see you already
hinted at this below. This needs to be handled carefully considering
the vfio cdev FD state also. Discussed further below.
>
> > I understand the desire to have the preserved iommu domain be restored
> > during boot so the device has a default domain and there is an owner
> > of the attached restored domain, but that would prevent the iommfud
> > from cooking a clean new domain.
>
> The "default domain" is the "DMA API domain" and it has to be created
> and setup always. The change here is instead of attaching the default
> domain we attach the luo restored domain at early boot.

Oh... I meant the group->domain instead of group->default_domain.
Should have written active domain instead of default domain.
>
> This sets the device into an "owned" mode but vfio can still attach
> and nothing prevents iommufd from building a new hwpt and attaching
> it.

This is the part that I was concerned about since I was looking into
the auto_domain. Users that attach to ioas directly and use
auto_domain would not be able to restore the mappings before attaching
to the device. But users that use HWPT directly should be able to
prepare a new domain and hotswap when ready. But I think a new
interface can be built to support IOAS only use cases also. We can
revisit this later.
>
> > Maybe we can refine the "Hotswap" model I had in mind. Basically on
> > boot the core restores the preserved iommu domain, but core lets
> > iommufd attach a new domain with preserved devices without replacing
> > the underlying context entries?
>
> Replace the context entries. If everything is working properly the
> preserved domain should compute an identical context entry, so no
> reason to not just "replace" it which should be a NOP.
>
> > > Also there is an interesting behavior to note that if the iommu driver
> > > restores a domain then it will also prevent a non-vfio driver from
> > > binding to that device.
> >
> > Agreed. I think in the "Hotswap" approach I discussed above, if we
> > don't restore the domain, the core can just commit the context entries
> > of the new default domain if a non-vfio driver is bound to the device.
>
> As I said, the owned nature of the device will prevent attaching a
> non-vfio driver in the first place.
>
> So the only path forward for userspace is to attach vfio, and then
> iommufd should take over that luo restored iommu_domain and eventually
> free it.
>
> You might consider that finish should de-own the device if vfio didn't
> claim it. But that is a bit tricky since it needs a FLR before the
> domains can be switched around.

That's a good point. But it might be tricky since the ownership of the
device is with the vfio cdev FD. So if vfio cdev FD is never
restored/reclaimed the device can be FLR'd. iommufd will follow along
and discard the domain.

The more interesting case might be where cdev is restored and bound to
iommufd but the user never recreates and hotswaps a new HWPT. In this
case we can discard the restored iommu_domain and replace it with the
blocking domain as it should have been if the device was not
preserved.
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Wed, Oct 01, 2025 at 06:00:58PM -0700, Samiullah Khawaja wrote:
> > No, finish should never do anything on the restore path, IMHO. User
> > should directly attach the newly created HWPT when it is ready.
> 
> Makes sense. But if the user never replaces the restored iommu_domain
> with a new HWPT, we will have to discard the old (restored) domain on
> finish since it doesn't have any associated HWPT. I see you already
> hinted at this below. This needs to be handled carefully considering
> the vfio cdev FD state also. Discussed further below.

I think the simplest thing is the domain exists forever until
userspace attaches an iommufd, takes ownership of it and frees it.
Nothing to do with finish.

While the domain is attached iommu_device_use_default_domain() will
fail.

> This is the part that I was concerned about since I was looking into
> the auto_domain. Users that attach to ioas directly and use
> auto_domain would not be able to restore the mappings before attaching
> to the device.

IMHO luo users need to be sophisticated enough to avoid auto_domain.

> That's a good point. But it might be tricky since the ownership of the
> device is with the vfio cdev FD. So if vfio cdev FD is never
> restored/reclaimed the device can be FLR'd. iommufd will follow along
> and discard the domain.

Honestly, I keep wanting things to be kept as simple as possible with
as few exception flows as necessary.

If we make it so that iommu_device_claim_dma_owner() is aware of luo
and the only way vfio can get ownership is if it is also restoring the
luo session then that sounds perfect.

Attaching a non-luo VFIO would be blocked by the kernel so we never
get these inconsistencies.

> The more interesting case might be where cdev is restored and bound to
> iommufd but the user never recreates and hotswaps a new HWPT. In this
> case we can discard the restored iommu_domain and replace it with the
> blocking domain as it should have been if the device was not
> preserved.

Maybe the HWPT has to be auto-created inside the iommufd as soon as it
is attached. The "restore" ioctl would just return back the ID of this
already created HWPT.

Again, this seems to avoid special cases as once we exit the special
luo mode of iommu_device_claim_dma_owner() iommufd is always
responsible for the iommu_domain.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Thu, Oct 2, 2025 at 6:41 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 01, 2025 at 06:00:58PM -0700, Samiullah Khawaja wrote:
> > > No, finish should never do anything on the restore path, IMHO. User
> > > should directly attach the newly created HWPT when it is ready.
> >
> > Makes sense. But if the user never replaces the restored iommu_domain
> > with a new HWPT, we will have to discard the old (restored) domain on
> > finish since it doesn't have any associated HWPT. I see you already
> > hinted at this below. This needs to be handled carefully considering
> > the vfio cdev FD state also. Discussed further below.
>
> I think the simplest thing is the domain exists forever until
> userspace attaches an iommufd, takes ownership of it and frees it.
> Nothing to do with finish.

Hmm.. I think this is tricky. There needs to be a way to clean up and
discard the old state if the userspace doesn't need it. And I think
the LUO (session) FINISH event is that trigger. Basically if the LUO
session manager (VMM or LUOD) decides that the finish needs to happen
and the iommufd (or the underlying HWPTs) are not restored, it means
that LUOD has decided that the VM is not going to come up and the
preserved state and resources (domain, device, memory) need to be
freed/released. If we don't do this in "FINISH" then the system will
be in a stuck state and the VM scheduler cannot schedule another VM
using the same device and resources.
>
> While the domain is attached iommu_device_use_default_domain() will
> fail.

Yes this makes sense.
>
> > This is the part that I was concerned about since I was looking into
> > the auto_domain. Users that attach to ioas directly and use
> > auto_domain would not be able to restore the mappings before attaching
> > to the device.
>
> IMHO luo users need to be sophisticated enough to avoid auto_domain.

Agreed.
>
> > That's a good point. But it might be tricky since the ownership of the
> > device is with the vfio cdev FD. So if vfio cdev FD is never
> > restored/reclaimed the device can be FLR'd. iommufd will follow along
> > and discard the domain.
>
> Honestly, I keep wanting things to be kept as simple as possible with
> as few exception flows as necessary.
>
> If we make it so that iommu_device_claim_dma_owner() is aware of luo
> and the only way vfio can get ownership is if it is also restoring the
> luo session then that sounds perfect.
>
> Attaching a non-luo VFIO would be blocked by the kernel so we never
> get these inconsistencies.
>
> > The more interesting case might be where cdev is restored and bound to
> > iommufd but the user never recreates and hotswaps a new HWPT. In this
> > case we can discard the restored iommu_domain and replace it with the
> > blocking domain as it should have been if the device was not
> > preserved.
>
> Maybe the HWPT has to be auto-created inside the iommufd as soon as it
> is attached. The "restore" ioctl would just return back the ID of this
> already created HWPT.

Once we return the ID, do we make this HWPT mutable? Or is this
re-created HWPT just a handle to keep the domain ownership?

I think if we make it mutable, this will really complicate the design
and we will get into the sanity checking about attach/detach and
map/unmap calls on this HWPT. I think keeping the restored domain
attached to the preserved device until it is hotswapped with a new
HWPT is cleaner and simpler as you desire it to be.

I think if we consider FINISH a point where everything is supposed to
be reclaimed or discarded then this problem is solved. This should
also allow LUOD to cleanup the resources and create new VMs using the
same device and resources. I see you suggested in the other thread
with Pasha that we can make FINISH fail if things are not reclaimed, I
think that also means that the system would be stuck in this state
indefinitely. Maybe this is correct since the domain is owned by VFIO
and needs to be released by it.

>
> Again, this seems to avoid special cases as once we exit the special
> luo mode of iommu_device_claim_dma_owner() iommufd is always
> responsible for the iommu_domain.
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Thu, Oct 02, 2025 at 10:03:05AM -0700, Samiullah Khawaja wrote:
> > I think the simplest thing is the domain exists forever until
> > userspace attaches an iommufd, takes ownership of it and frees it.
> > Nothing to do with finish.
> 
> Hmm.. I think this is tricky. There needs to be a way to clean up and
> discard the old state if the userspace doesn't need it.

Why?

Isn't "userspace doesn't need it" some extermely weird unused corner
case?

This should not be automatic or divorced from userspace, if the
operator would like to switch something out of LUO then they should
have userspace that co-ordinates this. Receive the iommufd, close it,
install a normal kernel driver.

Why make special code in the kernel to sequence this automatically?

> session manager (VMM or LUOD) decides that the finish needs to happen
> and the iommufd (or the underlying HWPTs) are not restored, it means
> that LUOD has decided that the VM is not going to come up and the
> preserved state and resources (domain, device, memory) need to be
> freed/released. 

I've been assuming if luo fails so catastrophically the whole node
would reboot to recover.

Is there really a case where you might say a kexec happens and a
single VM out of many doesn't survive? Seems weird..

So to repeat above, if this is something people want then the
userspace should complete luo restoring the failed vm and then turn
around and free up all the resources. Why should the kernel
automatically do the same operations?

Maybe userspace needs some contingency flow where there is a dedicated
reaper program for a luo session. The VMM crashes during restore, OK,
we pass the luo FD to a reaper and it cleans up the objects in the
session and closes it.

> > Maybe the HWPT has to be auto-created inside the iommufd as soon as it
> > is attached. The "restore" ioctl would just return back the ID of this
> > already created HWPT.
> 
> Once we return the ID, do we make this HWPT mutable? Or is this
> re-created HWPT just a handle to keep the domain ownership?

That's a bigger question..

For starting I was imagining that the restored iommu_domain was
immutable, eg it does not have map and unmap operations. It never
becomes mutable.

As I outlined this special luo immutable domain is then attached
during early boot, which sould be a NOP, and gets turned into a HWPT
during iommufd restoration. The only thing userspace should be able to
do with that HWPT handle is destroy it after replacing it.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months ago
On Thu, Oct 2, 2025 at 1:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:03:05AM -0700, Samiullah Khawaja wrote:
> > > I think the simplest thing is the domain exists forever until
> > > userspace attaches an iommufd, takes ownership of it and frees it.
> > > Nothing to do with finish.
> >
> > Hmm.. I think this is tricky. There needs to be a way to clean up and
> > discard the old state if the userspace doesn't need it.
>
> Why?
>
> Isn't "userspace doesn't need it" some extermely weird unused corner
> case?

It might be a corner case, but at cloud scale, even rare cases happen.
For example, if four VMs are resumed and one crashes while retrieving
half of its resources, we can't simply reboot the machine because of
that. We must have a way to recover the machine to a normal state,
even if some resources are not reclaimed. I would say that finish must
be properly backward-ordered, but we still should release resources
that are not reclaimed during finish, as well as those that were
reclaimed but later closed.

Pasha
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months ago
On Thu, Oct 09, 2025 at 09:28:44PM -0400, Pasha Tatashin wrote:
> On Thu, Oct 2, 2025 at 1:37 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Oct 02, 2025 at 10:03:05AM -0700, Samiullah Khawaja wrote:
> > > > I think the simplest thing is the domain exists forever until
> > > > userspace attaches an iommufd, takes ownership of it and frees it.
> > > > Nothing to do with finish.
> > >
> > > Hmm.. I think this is tricky. There needs to be a way to clean up and
> > > discard the old state if the userspace doesn't need it.
> >
> > Why?
> >
> > Isn't "userspace doesn't need it" some extermely weird unused corner
> > case?
> 
> It might be a corner case, but at cloud scale, even rare cases happen.
> For example, if four VMs are resumed and one crashes while retrieving
> half of its resources, we can't simply reboot the machine because of
> that. We must have a way to recover the machine to a normal state,
> even if some resources are not reclaimed. I would say that finish must
> be properly backward-ordered, but we still should release resources
> that are not reclaimed during finish, as well as those that were
> reclaimed but later closed.

Sure, but as I said, userspace should deal with most of this, and I
think we should lean into the worst error flows end up "leaking"
resources. They are not actually leaked, the luo still holds them and
userspace could still try again later to restore and free them. They
will get cleaned up on the next kexec, and kexec to recover from a
partially failed kexec is not an unreasonable plan...

This means think carefully about the userspace restore sequence so it
is more reliable. Like don't restore the memfd as the first thing :)

Only if there are real measurements that this is not sufficent would I
think about teaching the kernel to do a non-restore flow where it
directly destroys the object in a way that cannot fail. Eg the memfd
can directly free the page list instead of allocating an xarray. This
is alot more complex error path code to add to the kernel so lets not
do it without a strong justification.

You also can't do it until something sequences the vfio and iommufd
parts to unfreeze the memfd, this is very complicated error flows as
well.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Thu, Oct 2, 2025 at 10:37 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:03:05AM -0700, Samiullah Khawaja wrote:
> > > I think the simplest thing is the domain exists forever until
> > > userspace attaches an iommufd, takes ownership of it and frees it.
> > > Nothing to do with finish.
> >
> > Hmm.. I think this is tricky. There needs to be a way to clean up and
> > discard the old state if the userspace doesn't need it.
>
> Why?
>
> Isn't "userspace doesn't need it" some extermely weird unused corner
> case?
>
> This should not be automatic or divorced from userspace, if the
> operator would like to switch something out of LUO then they should
> have userspace that co-ordinates this. Receive the iommufd, close it,
> install a normal kernel driver.
>
> Why make special code in the kernel to sequence this automatically?
>
> > session manager (VMM or LUOD) decides that the finish needs to happen
> > and the iommufd (or the underlying HWPTs) are not restored, it means
> > that LUOD has decided that the VM is not going to come up and the
> > preserved state and resources (domain, device, memory) need to be
> > freed/released.
>
> I've been assuming if luo fails so catastrophically the whole node
> would reboot to recover.
>
> Is there really a case where you might say a kexec happens and a
> single VM out of many doesn't survive? Seems weird..
>
> So to repeat above, if this is something people want then the
> userspace should complete luo restoring the failed vm and then turn
> around and free up all the resources. Why should the kernel
> automatically do the same operations?
>
> Maybe userspace needs some contingency flow where there is a dedicated
> reaper program for a luo session. The VMM crashes during restore, OK,
> we pass the luo FD to a reaper and it cleans up the objects in the
> session and closes it.

These are all great points. I agree, it makes sense. It keeps the
FINISH lightweight and makes the domain ownership model very clean. I
will further discuss the memfd dependency scenario in the other
thread.
>
> > > Maybe the HWPT has to be auto-created inside the iommufd as soon as it
> > > is attached. The "restore" ioctl would just return back the ID of this
> > > already created HWPT.
> >
> > Once we return the ID, do we make this HWPT mutable? Or is this
> > re-created HWPT just a handle to keep the domain ownership?
>
> That's a bigger question..
>
> For starting I was imagining that the restored iommu_domain was
> immutable, eg it does not have map and unmap operations. It never
> becomes mutable.
>
> As I outlined this special luo immutable domain is then attached
> during early boot, which sould be a NOP, and gets turned into a HWPT
> during iommufd restoration. The only thing userspace should be able to
> do with that HWPT handle is destroy it after replacing it.

Okay, this is great. An immutable HWPT associated with the restored
iommu_domain confirms my intuition that this is just a handle to the
underlying domain. The user can destroy it when it is replaced, or
when iommufd is closed without HWPT replacement.
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
On Thu, Oct 2, 2025 at 9:41 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 01, 2025 at 06:00:58PM -0700, Samiullah Khawaja wrote:
> > > No, finish should never do anything on the restore path, IMHO. User
> > > should directly attach the newly created HWPT when it is ready.
> >
> > Makes sense. But if the user never replaces the restored iommu_domain
> > with a new HWPT, we will have to discard the old (restored) domain on
> > finish since it doesn't have any associated HWPT. I see you already
> > hinted at this below. This needs to be handled carefully considering
> > the vfio cdev FD state also. Discussed further below.
>
> I think the simplest thing is the domain exists forever until
> userspace attaches an iommufd, takes ownership of it and frees it.
> Nothing to do with finish.
>
> While the domain is attached iommu_device_use_default_domain() will
> fail.

Ah you answered my question from my previous email, let me talk to Sami.
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
> > 3. On FINISH, the IOMMU core updates the context entries of preserved
> > devices to point to the new domain.
>
> No, finish should never do anything on the restore path, IMHO. User
> should directly attach the newly created HWPT when it is ready.

But, finish is our indicator that a particular session (VM) is out of
blackout, and now we are free to do slow things, such as
re-allocating/recreating page tables. Why start it before a VM is out
of blackout?
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Wed, Oct 01, 2025 at 03:28:56PM -0400, Pasha Tatashin wrote:
> > > 3. On FINISH, the IOMMU core updates the context entries of preserved
> > > devices to point to the new domain.
> >
> > No, finish should never do anything on the restore path, IMHO. User
> > should directly attach the newly created HWPT when it is ready.
> 
> But, finish is our indicator that a particular session (VM) is out of
> blackout, and now we are free to do slow things, such as
> re-allocating/recreating page tables. Why start it before a VM is out
> of blackout?

Things should be paired.. The suspend side is

 start luo - "brown out" - kernel does basically nothing as the luo is empty
 add all sorts of things to sessions
 finish - kernel does last minute things

While the resume is the symmetric opposite:

 kexec boot - kernel restores the critical stuff it needs to boot to
               userspace
 userspace does all sorts of stuff and gets things out of the sessions
 finish - luo should be empty now as everything was taken out by
          userspace

I think when things come out of luo they should be fully operational
immediately.

Finish on resume shouldn't indicate anything specific beyond the luo
should be empty and everything should have been restored. It isn't
like finish on pre-kexec.

Userspace decides how it sequences things and what steps it takes
before ending blackout and resuming the VM.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
On Thu, Oct 2, 2025 at 7:57 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Oct 01, 2025 at 03:28:56PM -0400, Pasha Tatashin wrote:
> > > > 3. On FINISH, the IOMMU core updates the context entries of preserved
> > > > devices to point to the new domain.
> > >
> > > No, finish should never do anything on the restore path, IMHO. User
> > > should directly attach the newly created HWPT when it is ready.
> >
> > But, finish is our indicator that a particular session (VM) is out of
> > blackout, and now we are free to do slow things, such as
> > re-allocating/recreating page tables. Why start it before a VM is out
> > of blackout?
>
> Things should be paired.. The suspend side is
>
>  start luo - "brown out" - kernel does basically nothing as the luo is empty
>  add all sorts of things to sessions
>  finish - kernel does last minute things
>
> While the resume is the symmetric opposite:
>
>  kexec boot - kernel restores the critical stuff it needs to boot to
>                userspace
>  userspace does all sorts of stuff and gets things out of the sessions
>  finish - luo should be empty now as everything was taken out by
>           userspace

I see, so you are proposing that finish() is basically a no-op for
IOMMU as long as everything was properly reclaimed by userspace.

> I think when things come out of luo they should be fully operational
> immediately.

I agree. Once we are in "normal" mode, we should be done with all
live-update specifics. In this state, the kernel must be fully
operational without limitations or pending background work that could
reduce VM performance. Also, any session was not reclaimed before
finish(), it and all resources associated with it should be terminated
during finish.

> Finish on resume shouldn't indicate anything specific beyond the luo
> should be empty and everything should have been restored. It isn't
> like finish on pre-kexec.
>
> Userspace decides how it sequences things and what steps it takes
> before ending blackout and resuming the VM.

This is a fair statement: userspace knows when vCPUs are resumed and
can decide when to do the HWPT swap. Following that logic, what if we
provide a specific ioctl() to perform the swap? Userspace could then
call that ioctl() prior to finish(), and during the finish() callback,
we would only need to do a quick sanity check that everything is in
order (i.e., resources were retrieved and the HWPTs were swapped).

What do we do if the user reclaimed iommufd but did not swap HWPT or
did not perform some other ioctl() before finish(), simply print a
kernel warnings and let it be, or force swapping during finish before
going into normal mode?

Pasha
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Thu, Oct 02, 2025 at 10:43:45AM -0400, Pasha Tatashin wrote:
> > Finish on resume shouldn't indicate anything specific beyond the luo
> > should be empty and everything should have been restored. It isn't
> > like finish on pre-kexec.
> >
> > Userspace decides how it sequences things and what steps it takes
> > before ending blackout and resuming the VM.
> 
> This is a fair statement: userspace knows when vCPUs are resumed and
> can decide when to do the HWPT swap. Following that logic, what if we
> provide a specific ioctl() to perform the swap?

Yeah, that is what I've been talking about. The ioctl already exists
in iommufd..

> What do we do if the user reclaimed iommufd but did not swap HWPT or
> did not perform some other ioctl() before finish(), simply print a
> kernel warnings and let it be, or force swapping during finish before
> going into normal mode?

The problem we haven't discussed how to solve is the linkage between
the iommu_domain and the memfd.

Since the preserved iommu_domain is referring to memory owned by the
memfd and the pins don't get restored until the iommufd starts and
generates new pins. Thus we need to keep the memfd in a frozen state.

Maybe that is the real use case for finish - things like memfd remain
frozen until finish concludes.

However, keeping with the keep it simple theme, finish can just not
succeed if there are stray objects that userspace has not cleaned up
floating around. Eg a simple refcount and iommu_domain decrs it when
it is destroyed.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Thu, Oct 2, 2025 at 8:10 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 10:43:45AM -0400, Pasha Tatashin wrote:
> > > Finish on resume shouldn't indicate anything specific beyond the luo
> > > should be empty and everything should have been restored. It isn't
> > > like finish on pre-kexec.
> > >
> > > Userspace decides how it sequences things and what steps it takes
> > > before ending blackout and resuming the VM.
> >
> > This is a fair statement: userspace knows when vCPUs are resumed and
> > can decide when to do the HWPT swap. Following that logic, what if we
> > provide a specific ioctl() to perform the swap?
>
> Yeah, that is what I've been talking about. The ioctl already exists
> in iommufd..

Yes, I agree. We can use the existing ioctl and the hotswap happens
when userspace attaches the new HWPT to the device. That has been my
understanding as well.

Userspace should indeed have full autonomy to perform the hotswap
whenever the VMM (and HWPT) is ready.
>
> > What do we do if the user reclaimed iommufd but did not swap HWPT or
> > did not perform some other ioctl() before finish(), simply print a
> > kernel warnings and let it be, or force swapping during finish before
> > going into normal mode?
>
> The problem we haven't discussed how to solve is the linkage between
> the iommu_domain and the memfd.
>
> Since the preserved iommu_domain is referring to memory owned by the
> memfd and the pins don't get restored until the iommufd starts and
> generates new pins. Thus we need to keep the memfd in a frozen state.

Yes, there are dependencies between preserved FDs, and we need to
consider them during LUO PREPARE (preservation). We can use an LUO
helper in the can_preserve callback to check if a dependency is also
going to be preserved. I discuss the restore part below.
>
> Maybe that is the real use case for finish - things like memfd remain
> frozen until finish concludes.

Yes, for memfd LUO file_handler, maybe that is the purpose of FINISH.

But that gets us into the discussion of whether a dependency is
ready/allowed to mutate and FINISH. How would LUO file_handler of a
dependency know that it is safe to mutate/finish? Maybe LUO calls the
iommufd FINISH first and if it fails the dependencies don't get a
FINISH call.

I had a quick discussion with Pasha to see how LUO can help with FD
dependencies and FINISH order. Perhaps we need a new LUO API that
iommufd can call before live update, explicitly telling LUO that it
depends on an FD that is going to be preserved.

>
> However, keeping with the keep it simple theme, finish can just not
> succeed if there are stray objects that userspace has not cleaned up
> floating around. Eg a simple refcount and iommu_domain decrs it when
> it is destroyed.
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Thu, Oct 02, 2025 at 12:29:25PM -0700, Samiullah Khawaja wrote:
> I had a quick discussion with Pasha to see how LUO can help with FD
> dependencies and FINISH order. Perhaps we need a new LUO API that
> iommufd can call before live update, explicitly telling LUO that it
> depends on an FD that is going to be preserved.

Keeping track of a dependency graph is possible.

But I wonder if it is really needed to be fine grained.

If a memfd remains frozen until finish, and finish can't happen until
all luo objects that are internally refering to outside memory
indicate they are done, don't we get the same outcome?

Is there a reason a specific memfd should be unfrozen before finish?

Maybe finish is too broad grained? What if each session had a finish?
All the objects in the session are cleaned up, invoke the session
finish and the memfd's in the session unfreeze?

Otherwise to build a dependency graph we'd need things like
iommu_domain to record all the memfds/etc stored within it and
preserve that and so on. This information has to come from the IOAS in
iommfd so it is quite a bit more weirdness to inject.

Whereas if we have the preserving iommufd do a sequence where it
pushes all the ioas pages (memfd/etc) to luo, and only then permits
the hwpt to be preserved to the same session, we get the same basic
tracking without needing to store a graph.

Donno...

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
> Maybe finish is too broad grained? What if each session had a finish?
> All the objects in the session are cleaned up, invoke the session
> finish and the memfd's in the session unfreeze?

All sessions have their own finish:
https://lore.kernel.org/all/20250929010321.3462457-15-pasha.tatashin@soleen.com
LIVEUPDATE_SESSION_SET_EVENT

Each session can go into a "finished" state independently. However, I
am still thinking about whether a dependency graph is needed. I feel
that if we require FDs to be added to a session in a specific order
(i.e., dependencies must be added first), and every subsequent FD
checks that all prerequisites are already in the session via the
existing can_preserve() callback, we should be okay, as long as we
finish() them in reverse order.

There are two issues:
1. What do we do with LIVEUPDATE_SESSION_UNPRESERVE_FD ?
We can simply remove this IOCTL all together. Stuff can be unpreserved
by simply closing session FD.

2. Remembering this order on the way back, and since we are using the
token as an iterator, that is not going to work, unless the graph is
also preserved. However, now that we have sessions and the token
values are independent for each session, I am thinking we can go back
to the model where the kernel issues tokens when FDs are preserved, as
each session will always start from token=0. This way FD preservation
order and token order will always match.

Pasha
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Thu, Oct 02, 2025 at 05:30:53PM -0400, Pasha Tatashin wrote:
> > Maybe finish is too broad grained? What if each session had a finish?
> > All the objects in the session are cleaned up, invoke the session
> > finish and the memfd's in the session unfreeze?
> 
> All sessions have their own finish:
> https://lore.kernel.org/all/20250929010321.3462457-15-pasha.tatashin@soleen.com
> LIVEUPDATE_SESSION_SET_EVENT
> 
> Each session can go into a "finished" state independently. However, I
> am still thinking about whether a dependency graph is needed. I feel
> that if we require FDs to be added to a session in a specific order
> (i.e., dependencies must be added first), and every subsequent FD
> checks that all prerequisites are already in the session via the
> existing can_preserve() callback, we should be okay, as long as we
> finish() them in reverse order.

I don't think it is quite that simple, like "finishing" an
iommu_domain cannot reconnect it back to the memfd. The only way to
finish it in the current sketch is to delete it.

So if you have a notion that finish is disallowed and when it is
actually finished maybe the order doesn't matter?

eg it doesn't matter what order we unfreeze memfds in.

This sort of assumes that something outside luo is still ensuring that
no disallowed operations are happening to the objects. eg nobody is
trying to ftruncate a memfd.

But I don't quite know what other objects besides memfd are going to
have this special frozen state??

> There are two issues:
> 1. What do we do with LIVEUPDATE_SESSION_UNPRESERVE_FD ?
> We can simply remove this IOCTL all together. Stuff can be unpreserved
> by simply closing session FD.

This is for serialize error handling? It does make sense if some sub
component of a session fails to serialize you'd just give up and close
the whole session.

> 2. Remembering this order on the way back, and since we are using the
> token as an iterator, that is not going to work, unless the graph is
> also preserved. However, now that we have sessions and the token
> values are independent for each session, I am thinking we can go back
> to the model where the kernel issues tokens when FDs are preserved, as
> each session will always start from token=0. This way FD preservation
> order and token order will always match.

You could just encode a preservation order numer in a seperate field?

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Thu, Oct 2, 2025 at 3:58 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Oct 02, 2025 at 05:30:53PM -0400, Pasha Tatashin wrote:
> > > Maybe finish is too broad grained? What if each session had a finish?
> > > All the objects in the session are cleaned up, invoke the session
> > > finish and the memfd's in the session unfreeze?
> >
> > All sessions have their own finish:
> > https://lore.kernel.org/all/20250929010321.3462457-15-pasha.tatashin@soleen.com
> > LIVEUPDATE_SESSION_SET_EVENT
> >
> > Each session can go into a "finished" state independently. However, I
> > am still thinking about whether a dependency graph is needed. I feel
> > that if we require FDs to be added to a session in a specific order
> > (i.e., dependencies must be added first), and every subsequent FD
> > checks that all prerequisites are already in the session via the
> > existing can_preserve() callback, we should be okay, as long as we
> > finish() them in reverse order.
>
> I don't think it is quite that simple, like "finishing" an
> iommu_domain cannot reconnect it back to the memfd. The only way to
> finish it in the current sketch is to delete it.

Agreed. But I think we don't need to reconnect the iommu_domain back
to the memfd it depended on. All we need to ensure is that the memfd
remains immutable until the new HWPT replaces the old one that is
pointing to the restored iommu_domain. Until that replacement is done,
iommufd's FINISH callback would keep failing, which would prevent its
dependencies (like memfd) from receiving their FINISH calls and so it
keeps them immutable.
>
> So if you have a notion that finish is disallowed and when it is
> actually finished maybe the order doesn't matter?

I think FINISH for FDs in a SESSION is not atomic. If a dependency
memfd gets its FINISH call first, it might make itself mutable before
the iommufd FINISH callback fails because old HWPT is not replaced
yet. By then, it would be too late; the memfd has already become
mutable. That is why order would be needed.
>
> eg it doesn't matter what order we unfreeze memfds in.
>
> This sort of assumes that something outside luo is still ensuring that
> no disallowed operations are happening to the objects. eg nobody is
> trying to ftruncate a memfd.
>
> But I don't quite know what other objects besides memfd are going to
> have this special frozen state??
>
> > There are two issues:
> > 1. What do we do with LIVEUPDATE_SESSION_UNPRESERVE_FD ?
> > We can simply remove this IOCTL all together. Stuff can be unpreserved
> > by simply closing session FD.
>
> This is for serialize error handling? It does make sense if some sub
> component of a session fails to serialize you'd just give up and close
> the whole session.
>
> > 2. Remembering this order on the way back, and since we are using the
> > token as an iterator, that is not going to work, unless the graph is
> > also preserved. However, now that we have sessions and the token
> > values are independent for each session, I am thinking we can go back
> > to the model where the kernel issues tokens when FDs are preserved, as
> > each session will always start from token=0. This way FD preservation
> > order and token order will always match.
>
> You could just encode a preservation order numer in a seperate field?
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Thu, Oct 02, 2025 at 04:56:57PM -0700, Samiullah Khawaja wrote:
> > So if you have a notion that finish is disallowed and when it is
> > actually finished maybe the order doesn't matter?
> 
> I think FINISH for FDs in a SESSION is not atomic. If a dependency
> memfd gets its FINISH call first, it might make itself mutable before
> the iommufd FINISH callback fails because old HWPT is not replaced
> yet. By then, it would be too late; the memfd has already become
> mutable. That is why order would be needed.

I'm thinking of having an counter in the session and the iommu_domain
holds it elevated until it is destroyed. Finish can't even start until
the counter is 0.

If the counter is 0 then it is fine to unfreeze all the remaning
objects in any order.

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
On Tue, Sep 30, 2025 at 9:59 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Sep 30, 2025 at 09:07:48AM -0400, Pasha Tatashin wrote:
> > On Mon, Sep 29, 2025 at 12:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > > > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > > > +                          struct iommufd_lu *iommufd_lu)
> > > > +{
> > > > +     struct iommufd_hwpt_paging *hwpt_paging;
> > > > +     struct iommufd_ioas *ioas = NULL;
> > > > +     struct iommufd_object *obj;
> > > > +     unsigned long index;
> > > > +     int rc;
> > > > +
> > > > +     /* Iterate each ioas. */
> > > > +     xa_for_each(&ictx->objects, index, obj) {
> > > > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > > > +                     continue;
> > >
> > > Wrong locking
> > >
> > > > +
> > > > +             ioas = (struct iommufd_ioas *)obj;
> > > > +             mutex_lock(&ioas->mutex);
> > > > +
> > > > +             /*
> > > > +              * TODO: Iterate over each device of this iommufd and only save
> > > > +              * hwpt/domain if the device is persisted.
> > > > +              */
> > > > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > > > +                     if (!hwpt_paging->common.domain)
> > > > +                             continue;
> > >
> > > I don't think this should be automatic. The user should directly
> > > serialize/unserialize HWPTs by ID.
> >
> > Why not?  Live Updated uAPI is handled through FDs, and both iommufd
> > and vfiofd have to be preserved; I assume we can automatically
> > determine the hwpt to be preserved through dependencies. Why would we
> > delegate this to the user?
>
> There are HWPTs outside the IOAS so it is inconsisent.
>
> We are not going to reconstruct the IOAS.
>
> The IDR ids of the HWPT may not be available on restore (we cannot
> make this ABI), so without userspace expressly labeling them and
> recovering the new IDR ids it doesn't work.
>
> Finally we expect to discard the preserved HWPTs and replace them we
> rebuilt ones at least as a first step. Userspace needs to sequence all
> of this..

The way LUOv4 is implemented, "LUO sessions" are always participating
LU. Once a user adds file descriptors to a session, that session and
its contents are automatically carried across multiple consecutive
live updates. The user only needs to act if they explicitly want to
remove an FD and opt-out of preservation, or close session. This is
consistent and convenient for long-running VM that should survive by
default.

I was hoping for a similar "preserve by default" or "opt-in-once"
model for iommufd objects that are put into the LUO session to avoid a
flurry of IOCTLs to re-register before every single live update.

On the other hand, userspace still has to issue IOCTLs after retrieval
to bring the restored FDs and associated objects back to a workable
state. Perhaps, we could do something like "Yes, I'm actively using
this object again, so please preserve it if another live update
happens." ?

Pasha
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Tue, Sep 30, 2025 at 11:09:59AM -0400, Pasha Tatashin wrote:
>
> The way LUOv4 is implemented, "LUO sessions" are always participating
> LU. Once a user adds file descriptors to a session, that session and
> its contents are automatically carried across multiple consecutive
> live updates. The user only needs to act if they explicitly want to
> remove an FD and opt-out of preservation, or close session. This is
> consistent and convenient for long-running VM that should survive by
> default.

I don't think this is a good idea. Each kernel should decide on its
own what and how things get included and manage the labels, from
scratch.

If you do this then alot more stuff becomes ABI and I think it will
turn into a huge PITA.

The userspace already has to have the code to setup the luo if it is
on a clean reboot - what is the point of not running that every time?

Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Samiullah Khawaja 4 months, 1 week ago
On Mon, Sep 29, 2025 at 9:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > +                          struct iommufd_lu *iommufd_lu)
> > +{
> > +     struct iommufd_hwpt_paging *hwpt_paging;
> > +     struct iommufd_ioas *ioas = NULL;
> > +     struct iommufd_object *obj;
> > +     unsigned long index;
> > +     int rc;
> > +
> > +     /* Iterate each ioas. */
> > +     xa_for_each(&ictx->objects, index, obj) {
> > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > +                     continue;
>
> Wrong locking
>
> > +
> > +             ioas = (struct iommufd_ioas *)obj;
> > +             mutex_lock(&ioas->mutex);
> > +
> > +             /*
> > +              * TODO: Iterate over each device of this iommufd and only save
> > +              * hwpt/domain if the device is persisted.
> > +              */
> > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > +                     if (!hwpt_paging->common.domain)
> > +                             continue;
>
> I don't think this should be automatic. The user should directly
> serialize/unserialize HWPTs by ID.
Interesting. So the user should be able to serialize/unserialize HWPTs
before the Live Update PREPARE event? But what if a device was marked
for preservation but the user never serialized the attached HWPT,
would that be considered an error during LUO PREPARE or should iommufd
serialize the remaining HWPTs here?
>
> Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Pasha Tatashin 4 months, 1 week ago
On Mon, Sep 29, 2025 at 1:32 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Mon, Sep 29, 2025 at 9:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > > +                          struct iommufd_lu *iommufd_lu)
> > > +{
> > > +     struct iommufd_hwpt_paging *hwpt_paging;
> > > +     struct iommufd_ioas *ioas = NULL;
> > > +     struct iommufd_object *obj;
> > > +     unsigned long index;
> > > +     int rc;
> > > +
> > > +     /* Iterate each ioas. */
> > > +     xa_for_each(&ictx->objects, index, obj) {
> > > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > > +                     continue;
> >
> > Wrong locking
> >
> > > +
> > > +             ioas = (struct iommufd_ioas *)obj;
> > > +             mutex_lock(&ioas->mutex);
> > > +
> > > +             /*
> > > +              * TODO: Iterate over each device of this iommufd and only save
> > > +              * hwpt/domain if the device is persisted.
> > > +              */
> > > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > > +                     if (!hwpt_paging->common.domain)
> > > +                             continue;
> >
> > I don't think this should be automatic. The user should directly
> > serialize/unserialize HWPTs by ID.
> Interesting. So the user should be able to serialize/unserialize HWPTs
> before the Live Update PREPARE event? But what if a device was marked
> for preservation but the user never serialized the attached HWPT,
> would that be considered an error during LUO PREPARE or should iommufd
> serialize the remaining HWPTs here?

Users ~can~ serialize their sessions before system-wide prepare event.
During prepare event all unserialized sessions and their FDs are going
to be serialized anyways.

Pasha

> >
> > Jason
Re: [RFC PATCH 13/15] iommufd: Persist iommu domains for live update
Posted by Jason Gunthorpe 4 months, 1 week ago
On Mon, Sep 29, 2025 at 10:32:22AM -0700, Samiullah Khawaja wrote:
> On Mon, Sep 29, 2025 at 9:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sun, Sep 28, 2025 at 07:06:21PM +0000, Samiullah Khawaja wrote:
> > > +static int iommufd_save_ioas(struct iommufd_ctx *ictx,
> > > +                          struct iommufd_lu *iommufd_lu)
> > > +{
> > > +     struct iommufd_hwpt_paging *hwpt_paging;
> > > +     struct iommufd_ioas *ioas = NULL;
> > > +     struct iommufd_object *obj;
> > > +     unsigned long index;
> > > +     int rc;
> > > +
> > > +     /* Iterate each ioas. */
> > > +     xa_for_each(&ictx->objects, index, obj) {
> > > +             if (obj->type != IOMMUFD_OBJ_IOAS)
> > > +                     continue;
> >
> > Wrong locking
> >
> > > +
> > > +             ioas = (struct iommufd_ioas *)obj;
> > > +             mutex_lock(&ioas->mutex);
> > > +
> > > +             /*
> > > +              * TODO: Iterate over each device of this iommufd and only save
> > > +              * hwpt/domain if the device is persisted.
> > > +              */
> > > +             list_for_each_entry(hwpt_paging, &ioas->hwpt_list, hwpt_item) {
> > > +                     if (!hwpt_paging->common.domain)
> > > +                             continue;
> >
> > I don't think this should be automatic. The user should directly
> > serialize/unserialize HWPTs by ID.
> Interesting. So the user should be able to serialize/unserialize HWPTs
> before the Live Update PREPARE event? But what if a device was marked
> for preservation but the user never serialized the attached HWPT,
> would that be considered an error during LUO PREPARE or should iommufd
> serialize the remaining HWPTs here?

yes that would be an error

I also think your patch series is a bit upside down, you should
present the iommufd and core pieces first, then come with a driver
implementation last.

It will be easier to understand the context that having a driver
implementation appear out of no where with no callers..

And everything should be driven by iommufd in this step, the iommu
driver should not be magically auto-preserving itself. Just preserve
the drivers linked to devices being preserved by iommufd.

Jason