[RFC 00/18] Pkernfs: Support persistence for live update

James Gowans posted 18 patches 2 years ago
drivers/iommu/Makefile           |   1 +
drivers/iommu/dma-iommu.c        |   2 +-
drivers/iommu/intel/dmar.c       |   1 +
drivers/iommu/intel/iommu.c      |  93 +++++++++++++---
drivers/iommu/intel/iommu.h      |   5 +
drivers/iommu/iommu.c            |  22 ++--
drivers/iommu/pgtable_alloc.c    |  43 +++++++
drivers/iommu/pgtable_alloc.h    |  10 ++
drivers/pci/pci-driver.c         |   4 +-
drivers/vfio/container.c         |  27 +++++
drivers/vfio/pci/vfio_pci_core.c |  20 ++--
drivers/vfio/vfio.h              |   2 +
drivers/vfio/vfio_iommu_type1.c  |  51 ++++++---
fs/Kconfig                       |   1 +
fs/Makefile                      |   3 +
fs/pkernfs/Kconfig               |   9 ++
fs/pkernfs/Makefile              |   6 +
fs/pkernfs/allocator.c           |  51 +++++++++
fs/pkernfs/dir.c                 |  43 +++++++
fs/pkernfs/file.c                |  93 ++++++++++++++++
fs/pkernfs/inode.c               | 185 +++++++++++++++++++++++++++++++
fs/pkernfs/iommu.c               | 163 +++++++++++++++++++++++++++
fs/pkernfs/pkernfs.c             | 115 +++++++++++++++++++
fs/pkernfs/pkernfs.h             |  61 ++++++++++
include/linux/init.h             |   1 +
include/linux/iommu.h            |   6 +-
include/linux/pkernfs.h          |  38 +++++++
include/uapi/linux/vfio.h        |  10 ++
init/main.c                      |  10 ++
29 files changed, 1029 insertions(+), 47 deletions(-)
create mode 100644 drivers/iommu/pgtable_alloc.c
create mode 100644 drivers/iommu/pgtable_alloc.h
create mode 100644 fs/pkernfs/Kconfig
create mode 100644 fs/pkernfs/Makefile
create mode 100644 fs/pkernfs/allocator.c
create mode 100644 fs/pkernfs/dir.c
create mode 100644 fs/pkernfs/file.c
create mode 100644 fs/pkernfs/inode.c
create mode 100644 fs/pkernfs/iommu.c
create mode 100644 fs/pkernfs/pkernfs.c
create mode 100644 fs/pkernfs/pkernfs.h
create mode 100644 include/linux/pkernfs.h
[RFC 00/18] Pkernfs: Support persistence for live update
Posted by James Gowans 2 years ago
This RFC is to solicit feedback on the approach of implementing support for live
update via an in-memory filesystem responsible for storing all live update state
as files in the filesystem.

Hypervisor live update is a mechanism to support updating a hypervisor via kexec
in a way that has limited impact to running virtual machines. This is done by
pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
processes and then deserialising/resuming the VMs so that they continue running
from where they were. Virtual machines can have PCI devices passed through and
in order to support live update it’s necessary to persist the IOMMU page tables
so that the devices can continue to do DMA to guest RAM during kexec.

This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
which explored ways in which the live update problem could be tackled;
this was one of them:
https://lpc.events/event/17/contributions/1485/

The approach sketched out in this RFC introduces a new in-memory filesystem,
pkernfs. Pkernfs takes over ownership separate from Linux memory
management system RAM which is carved out from the normal MM allocator
and donated to pkernfs. Files are created in pkernfs for a few purposes:
There are a few things that need to be preserved and re-hydrated after
kexec to support this:

* Guest memory: to be able to restore the VM its memory must be
preserved.  This is achieved by using a regular file in pkernfs for guest RAM.
As this guest RAM is not part of the normal linux core mm allocator and
has no struct pages, it can be removed from the direct map which
improves security posture for guest RAM. Similar to memfd_secret.

* IOMMU root page tables: for the IOMMU to have any ability to do DMA
during kexec it needs root page tables to look up per-domain page
tables. IOMMU root page tables are stored in a special path in pkernfs:
iommu/root-pgtables.  The intel IOMMU driver is modified to hook into
pkernfs to get the chunk of memory that it can use to allocate root
pgtables.

* IOMMU domain page tables: in order for VM-initiated DMA operations to
continue running while kexec is happening the IOVA to PA address
translations for persisted devices needs to continue to work. Similar to
root pgtables the per-domain page tables for persisted devices are
allocated from a pkernfs file so they they are also persisted across
kexec. This is done by using pkernfs files for IOMMU domain page
tables. Not all devices are persistent, so VFIO is updated to support
defining persistent page tables on passed through devices.

* Updates to IOMMU and PCI are needed to make device handover across
kexec work properly. Although not fully complete some of the changed
needed around avoiding device re-setting and re-probing are sketched
in this RFC.

Guest RAM and IOMMU state are just the first two things needed for live update.
Pkernfs opens the door for other kernel state which can improve kexec or add
more capabilities to live update to also be persisted as new files.

The main aspect we’re looking for feedback/opinions on here is the concept of
putting all persistent state in a single filesystem: combining guest RAM and
IOMMU pgtables in one store. Also, the question of a hard separation between
persistent memory and ephemeral memory, compared to allowing arbitrary pages to
be persisted. Pkernfs does it via a hard separation defined at boot time, other
approaches could make the carving out of persistent pages dynamic.

Sign-offs are intentionally omitted to make it clear that this is a
concept sketch RFC and not intended for merging.

On CC are folks who have sent RFCs around this problem space before, as
well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.

== Alternatives ==

There have been other RFCs which cover some aspect of the live update problem
space. So far, all public approaches with KVM neglected device assignment which
introduces a new dimension of problems. Prior art in this space includes:

1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
across kexec. It also supports specifying persisted memory page which could be
used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.

2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
be used for guest RAM and is preserved across kexec by passing a pointer to the
root page.

3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
chunk of memory specified via kernel cmdline parameter. It is not persistent but
aims to remove the need for struct page overhead.

4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
to allocate persistent memory, and restore that memory after kexec. They do do
not attempt to provide the ability to store userspace accessible state or have a
filesystem interface.

== How to use ==

Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
offset 1 GiB:
  memmap=1G%1G nopat pkernfs=1G!1G

Mount pkernfs somewhere, for example:
  mount -t pkernfs /mnt/pkernfs

Allocate a file for guest RAM:
  touch /mnt/pkernfs/guest-ram
  truncate -s 100M /mnt/pkernfs/guest-ram

Add QEMU cmdline option to use this as guest RAM:
  -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
  -M q35,memory-backend=pc.ram

Allocate a file for IOMMU domain page tables:
  touch /mnt/pkernfs/iommu/dom-0
  truncate -s 2M /mnt/pkernfs/iommu/dom-0

That file must be supplied to VFIO when creating the IOMMU container, via the
VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]

After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
state. When doing DMA mapping specify the additional flag
VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
Example: [5].

== Limitations ==

This is a RFC design to sketch out the concept so that there can be a discussion
about the general approach. There are many gaps and hacks; the idea is to keep
this RFC as simple as possible. Limitations include:

* Needing to supply the physical memory range for pkernfs as a kernel cmdline
parameter. Better would be to allocate memory for pkernfs dynamically on first
boot and pass that across kexec. Doing so would require additional integration
with memblocks and some ability to pass the dynamically allocated ranges
across. KHO [0] could support this.

* A single filesystem with no support for NUMA awareness. Better would be to
support multiple named pkernfs mounts which can cover different NUMA nodes.

* Skeletal filesystem code. There’s just enough functionality to make it usable to
demonstrate the concept of using files for guest RAM and IOMMU state.

* Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
RAM files being deleted or resized while IOMMU mappings are set up which would
allow DMA to freed memory. Better integration with guest RAM files and
IOMMU/VFIO is necessary.

* Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
Really we should move the abstraction one level up and make the whole VFIO
container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
container file and all of the DMA mappings inside VFIO would already be set up.

* Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
wasteful and an hard upper limit on file size.

[0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/
[1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
[2] https://lkml.org/lkml/2020/12/7/342
[3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
[5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
[6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0


James Gowans (18):
  pkernfs: Introduce filesystem skeleton
  pkernfs: Add persistent inodes hooked into directies
  pkernfs: Define an allocator for persistent pages
  pkernfs: support file truncation
  pkernfs: add file mmap callback
  init: Add liveupdate cmdline param
  pkernfs: Add file type for IOMMU root pgtables
  iommu: Add allocator for pgtables from persistent region
  intel-iommu: Use pkernfs for root/context pgtable pages
  iommu/intel: zap context table entries on kexec
  dma-iommu: Always enable deferred attaches for liveupdate
  pkernfs: Add IOMMU domain pgtables file
  vfio: add ioctl to define persistent pgtables on container
  intel-iommu: Allocate domain pgtable pages from pkernfs
  pkernfs: register device memory for IOMMU domain pgtables
  vfio: support not mapping IOMMU pgtables on live-update
  pci: Don't clear bus master is persistence enabled
  vfio-pci: Assume device working after liveupdate

 drivers/iommu/Makefile           |   1 +
 drivers/iommu/dma-iommu.c        |   2 +-
 drivers/iommu/intel/dmar.c       |   1 +
 drivers/iommu/intel/iommu.c      |  93 +++++++++++++---
 drivers/iommu/intel/iommu.h      |   5 +
 drivers/iommu/iommu.c            |  22 ++--
 drivers/iommu/pgtable_alloc.c    |  43 +++++++
 drivers/iommu/pgtable_alloc.h    |  10 ++
 drivers/pci/pci-driver.c         |   4 +-
 drivers/vfio/container.c         |  27 +++++
 drivers/vfio/pci/vfio_pci_core.c |  20 ++--
 drivers/vfio/vfio.h              |   2 +
 drivers/vfio/vfio_iommu_type1.c  |  51 ++++++---
 fs/Kconfig                       |   1 +
 fs/Makefile                      |   3 +
 fs/pkernfs/Kconfig               |   9 ++
 fs/pkernfs/Makefile              |   6 +
 fs/pkernfs/allocator.c           |  51 +++++++++
 fs/pkernfs/dir.c                 |  43 +++++++
 fs/pkernfs/file.c                |  93 ++++++++++++++++
 fs/pkernfs/inode.c               | 185 +++++++++++++++++++++++++++++++
 fs/pkernfs/iommu.c               | 163 +++++++++++++++++++++++++++
 fs/pkernfs/pkernfs.c             | 115 +++++++++++++++++++
 fs/pkernfs/pkernfs.h             |  61 ++++++++++
 include/linux/init.h             |   1 +
 include/linux/iommu.h            |   6 +-
 include/linux/pkernfs.h          |  38 +++++++
 include/uapi/linux/vfio.h        |  10 ++
 init/main.c                      |  10 ++
 29 files changed, 1029 insertions(+), 47 deletions(-)
 create mode 100644 drivers/iommu/pgtable_alloc.c
 create mode 100644 drivers/iommu/pgtable_alloc.h
 create mode 100644 fs/pkernfs/Kconfig
 create mode 100644 fs/pkernfs/Makefile
 create mode 100644 fs/pkernfs/allocator.c
 create mode 100644 fs/pkernfs/dir.c
 create mode 100644 fs/pkernfs/file.c
 create mode 100644 fs/pkernfs/inode.c
 create mode 100644 fs/pkernfs/iommu.c
 create mode 100644 fs/pkernfs/pkernfs.c
 create mode 100644 fs/pkernfs/pkernfs.h
 create mode 100644 include/linux/pkernfs.h

-- 
2.40.1

Re: [RFC 00/18] Pkernfs: Support persistence for live update
Posted by Jason Gunthorpe 2 years ago
On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:

> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.

I think if you are going to attempt something like this then the end
result must bring things back to having the same data structures fully
restored.

It is fine that the pkernfs holds some persistant memory that
guarentees the IOMMU can remain programmed and the VM pages can become
fixed across the kexec

But once the VMM starts to restore it self we need to get back to the
original configuration:
 - A mmap that points to the VM's physical pages
 - An iommufd IOAS that points to the above mmap
 - An iommufd HWPT that represents that same mapping
 - An iommu_domain programmed into HW that the HWPT

Ie you can't just reboot and leave the IOMMU hanging out in some
undefined land - especially in latest kernels!

For vt-d you need to retain the entire root table and all the required
context entries too, The restarting iommu needs to understand that it
has to "restore" a temporary iommu_domain from the pkernfs.

You can later reconstitute a proper iommu_domain from the VMM and
atomic switch.

So, I'm surprised to see this approach where things just live forever
in the kernfs, I don't see how "restore" is going to work very well
like this.

I would think that a save/restore mentalitity would make more
sense. For instance you could make a special iommu_domain that is fixed
and lives in the pkernfs. The operation would be to copy from the live
iommu_domain to the fixed one and then replace the iommu HW to the
fixed one.

In the post-kexec world the iommu would recreate that special domain
and point the iommu at it. (copying the root and context descriptions
out of the pkernfs). Then somehow that would get into iommufd and VFIO
so that it could take over that special mapping during its startup.

Then you'd build the normal operating ioas and hwpt (with all the
right page refcounts/etc) then switch to it and free the pkernfs
memory.

It seems alot less invasive to me. The special case is clearly a
special case and doesn't mess up the normal operation of the drivers.

It becomes more like kdump where the iommu driver is running in a
fairly normal mode, just with some stuff copied from the prior kernel.

Your text spent alot of time talking about the design of how the pages
persist, which is interesting, but it seems like only a small part of
the problem. Actually using that mechanism in a sane way and cover all
the functional issues in the HW drivers is going to be really
challenging.

> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

I doubt this.. It probably needs to be much finer grained actually,
otherwise you are going to be serializing everything. Somehow I think
you are better to serialize a minimum and try to reconstruct
everything else in userspace. Like conserving iommufd IDs would be a
huge PITA.

There are also going to be lots of security questions here, like we
can't just let userspace feed in any garbage and violate vfio and
iommu invariants.

Jason
Re: [RFC 00/18] Pkernfs: Support persistence for live update
Posted by Alex Williamson 2 years ago
On Mon, 5 Feb 2024 12:01:45 +0000
James Gowans <jgowans@amazon.com> wrote:

> This RFC is to solicit feedback on the approach of implementing support for live
> update via an in-memory filesystem responsible for storing all live update state
> as files in the filesystem.
> 
> Hypervisor live update is a mechanism to support updating a hypervisor via kexec
> in a way that has limited impact to running virtual machines. This is done by
> pausing/serialising running VMs, kexec-ing into a new kernel, starting new VMM
> processes and then deserialising/resuming the VMs so that they continue running
> from where they were. Virtual machines can have PCI devices passed through and
> in order to support live update it’s necessary to persist the IOMMU page tables
> so that the devices can continue to do DMA to guest RAM during kexec.
> 
> This RFC is a follow-on from a discussion held during LPC 2023 KVM MC
> which explored ways in which the live update problem could be tackled;
> this was one of them:
> https://lpc.events/event/17/contributions/1485/
> 
> The approach sketched out in this RFC introduces a new in-memory filesystem,
> pkernfs. Pkernfs takes over ownership separate from Linux memory
> management system RAM which is carved out from the normal MM allocator
> and donated to pkernfs. Files are created in pkernfs for a few purposes:
> There are a few things that need to be preserved and re-hydrated after
> kexec to support this:
> 
> * Guest memory: to be able to restore the VM its memory must be
> preserved.  This is achieved by using a regular file in pkernfs for guest RAM.
> As this guest RAM is not part of the normal linux core mm allocator and
> has no struct pages, it can be removed from the direct map which
> improves security posture for guest RAM. Similar to memfd_secret.
> 
> * IOMMU root page tables: for the IOMMU to have any ability to do DMA
> during kexec it needs root page tables to look up per-domain page
> tables. IOMMU root page tables are stored in a special path in pkernfs:
> iommu/root-pgtables.  The intel IOMMU driver is modified to hook into
> pkernfs to get the chunk of memory that it can use to allocate root
> pgtables.
> 
> * IOMMU domain page tables: in order for VM-initiated DMA operations to
> continue running while kexec is happening the IOVA to PA address
> translations for persisted devices needs to continue to work. Similar to
> root pgtables the per-domain page tables for persisted devices are
> allocated from a pkernfs file so they they are also persisted across
> kexec. This is done by using pkernfs files for IOMMU domain page
> tables. Not all devices are persistent, so VFIO is updated to support
> defining persistent page tables on passed through devices.
> 
> * Updates to IOMMU and PCI are needed to make device handover across
> kexec work properly. Although not fully complete some of the changed
> needed around avoiding device re-setting and re-probing are sketched
> in this RFC.
> 
> Guest RAM and IOMMU state are just the first two things needed for live update.
> Pkernfs opens the door for other kernel state which can improve kexec or add
> more capabilities to live update to also be persisted as new files.
> 
> The main aspect we’re looking for feedback/opinions on here is the concept of
> putting all persistent state in a single filesystem: combining guest RAM and
> IOMMU pgtables in one store. Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot time, other
> approaches could make the carving out of persistent pages dynamic.
> 
> Sign-offs are intentionally omitted to make it clear that this is a
> concept sketch RFC and not intended for merging.
> 
> On CC are folks who have sent RFCs around this problem space before, as
> well as filesystem, kexec, IOMMU, MM and KVM lists and maintainers.
> 
> == Alternatives ==
> 
> There have been other RFCs which cover some aspect of the live update problem
> space. So far, all public approaches with KVM neglected device assignment which
> introduces a new dimension of problems. Prior art in this space includes:
> 
> 1) Kexec Hand Over (KHO) [0]: This is a generic mechanism to pass kernel state
> across kexec. It also supports specifying persisted memory page which could be
> used to carve out IOMMU pgtable pages from the new kernel’s buddy allocator.
> 
> 2) PKRAM [1]: Tmpfs-style filesystem which dynamically allocates memory which can
> be used for guest RAM and is preserved across kexec by passing a pointer to the
> root page.
> 
> 3) DMEMFS [2]: Similar to pkernfs, DMEMFS is a filesystem on top of a reserved
> chunk of memory specified via kernel cmdline parameter. It is not persistent but
> aims to remove the need for struct page overhead.
> 
> 4) Kernel memory pools [3, 4]: These provide a mechanism for kernel modules/drivers
> to allocate persistent memory, and restore that memory after kexec. They do do
> not attempt to provide the ability to store userspace accessible state or have a
> filesystem interface.
> 
> == How to use ==
> 
> Use the mmemap and pkernfs cmd line args to carve memory out of system RAM and
> donate it to pkernfs. For example to carve out 1 GiB of RAM starting at physical
> offset 1 GiB:
>   memmap=1G%1G nopat pkernfs=1G!1G
> 
> Mount pkernfs somewhere, for example:
>   mount -t pkernfs /mnt/pkernfs
> 
> Allocate a file for guest RAM:
>   touch /mnt/pkernfs/guest-ram
>   truncate -s 100M /mnt/pkernfs/guest-ram
> 
> Add QEMU cmdline option to use this as guest RAM:
>   -object memory-backend-file,id=pc.ram,size=100M,mem-path=/mnt/pkernfs/guest-ram,share=yes
>   -M q35,memory-backend=pc.ram
> 
> Allocate a file for IOMMU domain page tables:
>   touch /mnt/pkernfs/iommu/dom-0
>   truncate -s 2M /mnt/pkernfs/iommu/dom-0
> 
> That file must be supplied to VFIO when creating the IOMMU container, via the
> VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. Example: [4]
> 
> After kexec, re-mount pkernfs, re-used those files for guest RAM and IOMMU
> state. When doing DMA mapping specify the additional flag
> VFIO_DMA_MAP_FLAG_LIVE_UPDATE to indicate that IOVAs are set up already.
> Example: [5].
> 
> == Limitations ==
> 
> This is a RFC design to sketch out the concept so that there can be a discussion
> about the general approach. There are many gaps and hacks; the idea is to keep
> this RFC as simple as possible. Limitations include:
> 
> * Needing to supply the physical memory range for pkernfs as a kernel cmdline
> parameter. Better would be to allocate memory for pkernfs dynamically on first
> boot and pass that across kexec. Doing so would require additional integration
> with memblocks and some ability to pass the dynamically allocated ranges
> across. KHO [0] could support this.
> 
> * A single filesystem with no support for NUMA awareness. Better would be to
> support multiple named pkernfs mounts which can cover different NUMA nodes.
> 
> * Skeletal filesystem code. There’s just enough functionality to make it usable to
> demonstrate the concept of using files for guest RAM and IOMMU state.
> 
> * Use-after-frees for IOMMU mappings. Currently nothing stops the pkernfs guest
> RAM files being deleted or resized while IOMMU mappings are set up which would
> allow DMA to freed memory. Better integration with guest RAM files and
> IOMMU/VFIO is necessary.
> 
> * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> Really we should move the abstraction one level up and make the whole VFIO
> container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> container file and all of the DMA mappings inside VFIO would already be set up.

Note that the vfio container is on a path towards deprecation, this
should be refocused on vfio relative to iommufd.  There would need to
be a strong argument for a container/type1 extension to support this,
iommufd would need to be the first class implementation.  Thanks,

Alex
 
> * Inefficient use of filesystem space. Every mappings block is 2 MiB which is both
> wasteful and an hard upper limit on file size.
> 
> [0] https://lore.kernel.org/kexec/20231213000452.88295-1-graf@amazon.com/
> [1] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/
> [2] https://lkml.org/lkml/2020/12/7/342
> [3] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
> [4] https://lore.kernel.org/all/2023082506-enchanted-tripping-d1d5@gregkh/#t
> [5] https://github.com/jgowans/qemu/commit/e84cfb8186d71f797ef1f72d57d873222a9b479e
> [6] https://github.com/jgowans/qemu/commit/6e4f17f703eaf2a6f1e4cb2576d61683eaee02b0
> 
> 
> James Gowans (18):
>   pkernfs: Introduce filesystem skeleton
>   pkernfs: Add persistent inodes hooked into directies
>   pkernfs: Define an allocator for persistent pages
>   pkernfs: support file truncation
>   pkernfs: add file mmap callback
>   init: Add liveupdate cmdline param
>   pkernfs: Add file type for IOMMU root pgtables
>   iommu: Add allocator for pgtables from persistent region
>   intel-iommu: Use pkernfs for root/context pgtable pages
>   iommu/intel: zap context table entries on kexec
>   dma-iommu: Always enable deferred attaches for liveupdate
>   pkernfs: Add IOMMU domain pgtables file
>   vfio: add ioctl to define persistent pgtables on container
>   intel-iommu: Allocate domain pgtable pages from pkernfs
>   pkernfs: register device memory for IOMMU domain pgtables
>   vfio: support not mapping IOMMU pgtables on live-update
>   pci: Don't clear bus master is persistence enabled
>   vfio-pci: Assume device working after liveupdate
> 
>  drivers/iommu/Makefile           |   1 +
>  drivers/iommu/dma-iommu.c        |   2 +-
>  drivers/iommu/intel/dmar.c       |   1 +
>  drivers/iommu/intel/iommu.c      |  93 +++++++++++++---
>  drivers/iommu/intel/iommu.h      |   5 +
>  drivers/iommu/iommu.c            |  22 ++--
>  drivers/iommu/pgtable_alloc.c    |  43 +++++++
>  drivers/iommu/pgtable_alloc.h    |  10 ++
>  drivers/pci/pci-driver.c         |   4 +-
>  drivers/vfio/container.c         |  27 +++++
>  drivers/vfio/pci/vfio_pci_core.c |  20 ++--
>  drivers/vfio/vfio.h              |   2 +
>  drivers/vfio/vfio_iommu_type1.c  |  51 ++++++---
>  fs/Kconfig                       |   1 +
>  fs/Makefile                      |   3 +
>  fs/pkernfs/Kconfig               |   9 ++
>  fs/pkernfs/Makefile              |   6 +
>  fs/pkernfs/allocator.c           |  51 +++++++++
>  fs/pkernfs/dir.c                 |  43 +++++++
>  fs/pkernfs/file.c                |  93 ++++++++++++++++
>  fs/pkernfs/inode.c               | 185 +++++++++++++++++++++++++++++++
>  fs/pkernfs/iommu.c               | 163 +++++++++++++++++++++++++++
>  fs/pkernfs/pkernfs.c             | 115 +++++++++++++++++++
>  fs/pkernfs/pkernfs.h             |  61 ++++++++++
>  include/linux/init.h             |   1 +
>  include/linux/iommu.h            |   6 +-
>  include/linux/pkernfs.h          |  38 +++++++
>  include/uapi/linux/vfio.h        |  10 ++
>  init/main.c                      |  10 ++
>  29 files changed, 1029 insertions(+), 47 deletions(-)
>  create mode 100644 drivers/iommu/pgtable_alloc.c
>  create mode 100644 drivers/iommu/pgtable_alloc.h
>  create mode 100644 fs/pkernfs/Kconfig
>  create mode 100644 fs/pkernfs/Makefile
>  create mode 100644 fs/pkernfs/allocator.c
>  create mode 100644 fs/pkernfs/dir.c
>  create mode 100644 fs/pkernfs/file.c
>  create mode 100644 fs/pkernfs/inode.c
>  create mode 100644 fs/pkernfs/iommu.c
>  create mode 100644 fs/pkernfs/pkernfs.c
>  create mode 100644 fs/pkernfs/pkernfs.h
>  create mode 100644 include/linux/pkernfs.h
> 
Re: [RFC 00/18] Pkernfs: Support persistence for live update
Posted by Luca Boccassi 2 years ago
> Also, the question of a hard separation between
> persistent memory and ephemeral memory, compared to allowing
> arbitrary pages to
> be persisted. Pkernfs does it via a hard separation defined at boot
> time, other
> approaches could make the carving out of persistent pages dynamic.

Speaking from experience here - in Azure (Boost) we have been using
hard-carved out memory areas (DAX devices with ranges configured via
DTB) for persisting state across kexec for ~5 years or so. In a
nutshell: don't, it's a mistake.

It's a constant and consistence source of problems, headaches, issues
and workarounds piled upon workarounds, held together with duct tape
and prayers. It's just not flexible enough for any modern system. For
example, unless _all_ the machines are ridicolously overprovisioned in
terms of memory capacity (and guaranteed to remain so, forever), you
end up wasting enormous amounts of memory.

In Azure we are very much interested in a nice, well-abstracted, first-
class replacement for that setup that allows persisting data across
kexec, and in systemd userspace we'd very much want to use it as well,
but it really, really needs to be dynamic, and avoid the pitfall of
hard-configured carved out chunk.

-- 
Kind regards,
Luca Boccassi