[v4] Live Update Orchestrator

[PATCH v4 00/30] Live Update Orchestrator

Pasha Tatashin posted 30 patches 4 months, 1 week ago

Diff against v1 v1 v2 v2 v3 v5 v6 v7 v8
Download series mbox

There is a newer version of this series

Documentation/core-api/index.rst              |   1 +
Documentation/core-api/kho/concepts.rst       |   2 +-
Documentation/core-api/liveupdate.rst         |  64 ++
Documentation/mm/index.rst                    |   1 +
Documentation/mm/memfd_preservation.rst       | 138 +++
Documentation/userspace-api/index.rst         |   1 +
.../userspace-api/ioctl/ioctl-number.rst      |   2 +
Documentation/userspace-api/liveupdate.rst    |  25 +
MAINTAINERS                                   |  18 +-
include/linux/kexec_handover.h                |  53 +-
include/linux/liveupdate.h                    | 209 +++++
include/linux/shmem_fs.h                      |  23 +
include/uapi/linux/liveupdate.h               | 460 +++++++++
init/Kconfig                                  |   2 +
kernel/Kconfig.kexec                          |  15 -
kernel/Makefile                               |   2 +-
kernel/liveupdate/Kconfig                     |  72 ++
kernel/liveupdate/Makefile                    |  14 +
kernel/{ => liveupdate}/kexec_handover.c      | 507 ++++------
kernel/liveupdate/kexec_handover_debug.c      | 222 +++++
kernel/liveupdate/kexec_handover_internal.h   |  45 +
kernel/liveupdate/luo_core.c                  | 588 ++++++++++++
kernel/liveupdate/luo_file.c                  | 599 ++++++++++++
kernel/liveupdate/luo_internal.h              | 114 +++
kernel/liveupdate/luo_ioctl.c                 | 255 +++++
kernel/liveupdate/luo_selftests.c             | 345 +++++++
kernel/liveupdate/luo_selftests.h             |  84 ++
kernel/liveupdate/luo_session.c               | 887 ++++++++++++++++++
kernel/liveupdate/luo_subsystems.c            | 452 +++++++++
kernel/reboot.c                               |   4 +
mm/Makefile                                   |   1 +
mm/internal.h                                 |   6 +
mm/memblock.c                                 |  60 +-
mm/memfd_luo.c                                | 523 +++++++++++
mm/shmem.c                                    |  51 +-
tools/testing/selftests/Makefile              |   1 +
tools/testing/selftests/liveupdate/.gitignore |   2 +
tools/testing/selftests/liveupdate/Makefile   |  48 +
tools/testing/selftests/liveupdate/config     |   6 +
.../testing/selftests/liveupdate/do_kexec.sh  |   6 +
.../testing/selftests/liveupdate/liveupdate.c | 404 ++++++++
.../selftests/liveupdate/luo_multi_file.c     | 119 +++
.../selftests/liveupdate/luo_multi_kexec.c    | 182 ++++
.../selftests/liveupdate/luo_multi_session.c  | 155 +++
.../selftests/liveupdate/luo_test_utils.c     | 241 +++++
.../selftests/liveupdate/luo_test_utils.h     |  51 +
.../selftests/liveupdate/luo_unreclaimed.c    | 107 +++
47 files changed, 6757 insertions(+), 410 deletions(-)
create mode 100644 Documentation/core-api/liveupdate.rst
create mode 100644 Documentation/mm/memfd_preservation.rst
create mode 100644 Documentation/userspace-api/liveupdate.rst
create mode 100644 include/linux/liveupdate.h
create mode 100644 include/uapi/linux/liveupdate.h
create mode 100644 kernel/liveupdate/Kconfig
create mode 100644 kernel/liveupdate/Makefile
rename kernel/{ => liveupdate}/kexec_handover.c (80%)
create mode 100644 kernel/liveupdate/kexec_handover_debug.c
create mode 100644 kernel/liveupdate/kexec_handover_internal.h
create mode 100644 kernel/liveupdate/luo_core.c
create mode 100644 kernel/liveupdate/luo_file.c
create mode 100644 kernel/liveupdate/luo_internal.h
create mode 100644 kernel/liveupdate/luo_ioctl.c
create mode 100644 kernel/liveupdate/luo_selftests.c
create mode 100644 kernel/liveupdate/luo_selftests.h
create mode 100644 kernel/liveupdate/luo_session.c
create mode 100644 kernel/liveupdate/luo_subsystems.c
create mode 100644 mm/memfd_luo.c
create mode 100644 tools/testing/selftests/liveupdate/.gitignore
create mode 100644 tools/testing/selftests/liveupdate/Makefile
create mode 100644 tools/testing/selftests/liveupdate/config
create mode 100755 tools/testing/selftests/liveupdate/do_kexec.sh
create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c
create mode 100644 tools/testing/selftests/liveupdate/luo_multi_file.c
create mode 100644 tools/testing/selftests/liveupdate/luo_multi_kexec.c
create mode 100644 tools/testing/selftests/liveupdate/luo_multi_session.c
create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.c
create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.h
create mode 100644 tools/testing/selftests/liveupdate/luo_unreclaimed.c

Expand all Fold all

[PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months, 1 week ago

This series introduces the Live Update Orchestrator (LUO), a kernel
subsystem designed to facilitate live kernel updates. LUO enables
kexec-based reboots with minimal downtime, a critical capability for
cloud environments where hypervisors must be updated without disrupting
running virtual machines. By preserving the state of selected resources,
such as file descriptors and memory, LUO allows workloads to resume
seamlessly in the new kernel.

The git branch for this series can be found at:
https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4

The patch series applies against linux-next tag: next-20250926

While this series is showed cased using memfd preservation. There are
works to preserve devices:
1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org

=======================================================================
Changelog since v3:
(https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):

- The main architectural change in this version is introduction of
  "sessions" to manage the lifecycle of preserved file descriptors.
  In v3, session management was left to a single userspace agent. This
  approach has been revised to improve robustness. Now, each session is
  represented by a file descriptor (/dev/liveupdate). The lifecycle of
  all preserved resources within a session is tied to this FD, ensuring
  automatic cleanup by the kernel if the controlling userspace agent
  crashes or exits unexpectedly.

- The first three KHO fixes from the previous series have been merged
  into Linus' tree.

- Various bug fixes and refactorings, including correcting memory
  unpreservation logic during a kho_abort() sequence.

- Addressing all comments from reviewers.

- Removing sysfs interface (/sys/kernel/liveupdate/state), the state
  can now be queried  only via ioctl() API.

=======================================================================

What is Live Update?

Live Update is a kexec-based reboot process where selected kernel
resources (memory, file descriptors, and eventually devices) are kept
operational or their state is preserved across a kernel transition. For
certain resources, DMA and interrupt activity might continue with
minimal interruption during the kernel reboot.

LUO provides a framework for coordinating live updates. It features:

State Machine
Manages the live update process through states: NORMAL, PREPARED,
FROZEN, UPDATED.

Session Management
==================
Userspace creates named sessions (driven by LUOD: Live Update
Orchestrator Daemon, see: https://tinyurl.com/luoddesign), each
represented by a file descriptor. Preserved resources are tied to a
session, and their lifecycle is managed by the session's FD, ensuring 
automatic cleanup if the controlling process exits unexpectedly.
Furthermore, sessions can be finished, prepared, and frozen
independently of the global LUO states. This granular control allows a
VMM to serialize and resume specific VMs as soon as their resources are
ready, without having to wait for all VMs to be prepared.

After a reboot, a central live update agent can retrieve a session
handle and pass it to the VMM process, which then restores its own file
descriptors. This ensures that resource allocations, such as cgroup
memory charges, are correctly accounted against the workload's cgroup
instead of the administrative agent's.

KHO Integration
===============
LUO programmatically drives KHO's finalization and abort sequences
(KHO may soon to become completely stateless, which will make KHO
interraction with LUO even simpler:
https://lore.kernel.org/all/20250917025019.1585041-1-jasonmiu@google.com)

KHO's debugfs interface is now optional, configured via
CONFIG_KEXEC_HANDOVER_DEBUG. LUO preserves its own metadata via KHO's
kho_add_subtree() and kho_preserve_phys() mechanisms.

Subsystem Participation
=======================
A callback API, liveupdate_register_subsystem(), allows kernel
subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register handlers for LUO
events (PREPARE, FREEZE, FINISH, CANCEL) and persist a u64 payload via
the LUO FDT.

File Descriptor Preservation
============================
An infrastructure (liveupdate_register_file_handler, luo_preserve_file,
luo_retrieve_file) allows specific types of file descriptors (e.g.,
memfd, vfio) to be preserved and restored within a session. Handlers for
specific file types can be registered to manage their preservation,
storing a u64 payload in the LUO FDT.

Userspace Interface
===================
ioctl (/dev/liveupdate): The primary control interface for creating and
retrieving sessions, triggering global LUO state transitions (prepare,
finish, cancel), and managing preserved file descriptors within a
session.

sysfs (/sys/kernel/liveupdate/state)
A read-only interface for monitoring the current LUO state.

Selftests
=========
Includes kernel-side hooks and an extensive userspace selftest suite to
verify core LUO functionality, including subsystem registration, state
transitions, and complex multi-kexec session lifecycles.

LUO State Machine and Events
============================
NORMAL:   Default operational state.
PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE event.
          Subsystems have saved initial state.
FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE event,
          just before kexec. Workloads must be suspended.
UPDATED:  Next kernel has booted via live update, awaiting restoration
          and LIVEUPDATE_FINISH.

Events
LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.

Mike Rapoport (Microsoft) (1):
  kho: drop notifiers

Pasha Tatashin (24):
  kho: allow to drive kho from within kernel
  kho: make debugfs interface optional
  kho: add interfaces to unpreserve folios and page ranes
  kho: don't unpreserve memory during abort
  liveupdate: kho: move to kernel/liveupdate
  liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
  liveupdate: luo_core: integrate with KHO
  liveupdate: luo_subsystems: add subsystem registration
  liveupdate: luo_subsystems: implement subsystem callbacks
  liveupdate: luo_session: Add sessions support
  liveupdate: luo_ioctl: add user interface
  liveupdate: luo_file: implement file systems callbacks
  liveupdate: luo_session: Add ioctls for file preservation and state
    management
  reboot: call liveupdate_reboot() before kexec
  kho: move kho debugfs directory to liveupdate
  liveupdate: add selftests for subsystems un/registration
  selftests/liveupdate: add subsystem/state tests
  docs: add luo documentation
  MAINTAINERS: add liveupdate entry
  selftests/liveupdate: Add multi-kexec session lifecycle test
  selftests/liveupdate: Add multi-file and unreclaimed file test
  selftests/liveupdate: Add multi-session workflow and state interaction
    test
  selftests/liveupdate: Add test for unreclaimed resource cleanup
  selftests/liveupdate: Add tests for per-session state and cancel
    cycles

Pratyush Yadav (5):
  mm: shmem: use SHMEM_F_* flags instead of VM_* flags
  mm: shmem: allow freezing inode mapping
  mm: shmem: export some functions to internal.h
  luo: allow preserving memfd
  docs: add documentation for memfd preservation via LUO

 Documentation/core-api/index.rst              |   1 +
 Documentation/core-api/kho/concepts.rst       |   2 +-
 Documentation/core-api/liveupdate.rst         |  64 ++
 Documentation/mm/index.rst                    |   1 +
 Documentation/mm/memfd_preservation.rst       | 138 +++
 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   2 +
 Documentation/userspace-api/liveupdate.rst    |  25 +
 MAINTAINERS                                   |  18 +-
 include/linux/kexec_handover.h                |  53 +-
 include/linux/liveupdate.h                    | 209 +++++
 include/linux/shmem_fs.h                      |  23 +
 include/uapi/linux/liveupdate.h               | 460 +++++++++
 init/Kconfig                                  |   2 +
 kernel/Kconfig.kexec                          |  15 -
 kernel/Makefile                               |   2 +-
 kernel/liveupdate/Kconfig                     |  72 ++
 kernel/liveupdate/Makefile                    |  14 +
 kernel/{ => liveupdate}/kexec_handover.c      | 507 ++++------
 kernel/liveupdate/kexec_handover_debug.c      | 222 +++++
 kernel/liveupdate/kexec_handover_internal.h   |  45 +
 kernel/liveupdate/luo_core.c                  | 588 ++++++++++++
 kernel/liveupdate/luo_file.c                  | 599 ++++++++++++
 kernel/liveupdate/luo_internal.h              | 114 +++
 kernel/liveupdate/luo_ioctl.c                 | 255 +++++
 kernel/liveupdate/luo_selftests.c             | 345 +++++++
 kernel/liveupdate/luo_selftests.h             |  84 ++
 kernel/liveupdate/luo_session.c               | 887 ++++++++++++++++++
 kernel/liveupdate/luo_subsystems.c            | 452 +++++++++
 kernel/reboot.c                               |   4 +
 mm/Makefile                                   |   1 +
 mm/internal.h                                 |   6 +
 mm/memblock.c                                 |  60 +-
 mm/memfd_luo.c                                | 523 +++++++++++
 mm/shmem.c                                    |  51 +-
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/liveupdate/.gitignore |   2 +
 tools/testing/selftests/liveupdate/Makefile   |  48 +
 tools/testing/selftests/liveupdate/config     |   6 +
 .../testing/selftests/liveupdate/do_kexec.sh  |   6 +
 .../testing/selftests/liveupdate/liveupdate.c | 404 ++++++++
 .../selftests/liveupdate/luo_multi_file.c     | 119 +++
 .../selftests/liveupdate/luo_multi_kexec.c    | 182 ++++
 .../selftests/liveupdate/luo_multi_session.c  | 155 +++
 .../selftests/liveupdate/luo_test_utils.c     | 241 +++++
 .../selftests/liveupdate/luo_test_utils.h     |  51 +
 .../selftests/liveupdate/luo_unreclaimed.c    | 107 +++
 47 files changed, 6757 insertions(+), 410 deletions(-)
 create mode 100644 Documentation/core-api/liveupdate.rst
 create mode 100644 Documentation/mm/memfd_preservation.rst
 create mode 100644 Documentation/userspace-api/liveupdate.rst
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 include/uapi/linux/liveupdate.h
 create mode 100644 kernel/liveupdate/Kconfig
 create mode 100644 kernel/liveupdate/Makefile
 rename kernel/{ => liveupdate}/kexec_handover.c (80%)
 create mode 100644 kernel/liveupdate/kexec_handover_debug.c
 create mode 100644 kernel/liveupdate/kexec_handover_internal.h
 create mode 100644 kernel/liveupdate/luo_core.c
 create mode 100644 kernel/liveupdate/luo_file.c
 create mode 100644 kernel/liveupdate/luo_internal.h
 create mode 100644 kernel/liveupdate/luo_ioctl.c
 create mode 100644 kernel/liveupdate/luo_selftests.c
 create mode 100644 kernel/liveupdate/luo_selftests.h
 create mode 100644 kernel/liveupdate/luo_session.c
 create mode 100644 kernel/liveupdate/luo_subsystems.c
 create mode 100644 mm/memfd_luo.c
 create mode 100644 tools/testing/selftests/liveupdate/.gitignore
 create mode 100644 tools/testing/selftests/liveupdate/Makefile
 create mode 100644 tools/testing/selftests/liveupdate/config
 create mode 100755 tools/testing/selftests/liveupdate/do_kexec.sh
 create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_file.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_kexec.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_session.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.h
 create mode 100644 tools/testing/selftests/liveupdate/luo_unreclaimed.c

-- 
2.51.0.536.g15c5d4f767-goog

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> This series introduces the Live Update Orchestrator (LUO), a kernel
> subsystem designed to facilitate live kernel updates. LUO enables
> kexec-based reboots with minimal downtime, a critical capability for
> cloud environments where hypervisors must be updated without disrupting
> running virtual machines. By preserving the state of selected resources,
> such as file descriptors and memory, LUO allows workloads to resume
> seamlessly in the new kernel.
>
> The git branch for this series can be found at:
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
>
> The patch series applies against linux-next tag: next-20250926
>
> While this series is showed cased using memfd preservation. There are
> works to preserve devices:
> 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
>
> =======================================================================
> Changelog since v3:
> (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
>
> - The main architectural change in this version is introduction of
>   "sessions" to manage the lifecycle of preserved file descriptors.
>   In v3, session management was left to a single userspace agent. This
>   approach has been revised to improve robustness. Now, each session is
>   represented by a file descriptor (/dev/liveupdate). The lifecycle of
>   all preserved resources within a session is tied to this FD, ensuring
>   automatic cleanup by the kernel if the controlling userspace agent
>   crashes or exits unexpectedly.
>
> - The first three KHO fixes from the previous series have been merged
>   into Linus' tree.
>
> - Various bug fixes and refactorings, including correcting memory
>   unpreservation logic during a kho_abort() sequence.
>
> - Addressing all comments from reviewers.
>
> - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
>   can now be queried  only via ioctl() API.
>
> =======================================================================

Hi all,

Following up on yesterday's Hypervisor Live Update meeting, we
discussed the requirements for the LUO to track dependencies,
particularly for IOMMU preservation and other stateful file
descriptors. This email summarizes the main design decisions and
outcomes from that discussion.

For context, the notes from the previous meeting can be found here:
https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
The notes for yesterday's meeting are not yes available.

The key outcomes are as follows:

1. User-Enforced Ordering
-------------------------
The responsibility for enforcing the correct order of operations will
lie with the userspace agent. If fd_A is a dependency for fd_B,
userspace must ensure that fd_A is preserved before fd_B. This same
ordering must be honored during the restoration phase after the reboot
(fd_A must be restored before fd_B). The kernel preserve the ordering.

2. Serialization in PRESERVE_FD
-------------------------------
To keep the global prepare() phase lightweight and predictable, the
consensus was to shift the heavy serialization work into the
PRESERVE_FD ioctl handler. This means that when userspace requests to
preserve a file, the file handler should perform the bulk of the
state-saving work immediately.

The proposed sequence of operations reflects this shift:

Shutdown Flow:
fd_preserve() (heavy serialization) -> prepare() (lightweight final
checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)

Boot & Restore Flow:
fd_restore() (lightweight object creation) -> Resume VM -> Heavy
post-restore IOCTLs (e.g., hardware page table re-creation) ->
finish() (lightweight cleanup)

This decision primarily serves as a guideline for file handler
implementations. For the LUO core, this implies minor API changes,
such as renaming can_preserve() to a more active preserve() and adding
a corresponding unpreserve() callback to be called during
UNPRESERVE_FD.

3. FD Data Query API
--------------------
We identified the need for a kernel API to allow subsystems to query
preserved FD data during the boot process, before userspace has
initiated the restore.

The proposed API would allow a file handler to retrieve a list of all
its preserved FDs, including their session names, tokens, and the
private data payload.

Proposed Data Structure:

struct liveupdate_fd {
        char *session; /* session name */
        u64 token; /* Preserved FD token */
        u64 data; /* Private preserved data */
};

Proposed Function:
liveupdate_fd_data_query(struct liveupdate_file_handler *h,
                         struct liveupdate_fd *fds, long *count);

4. New File-Lifecycle-Bound Global State
----------------------------------------
A new mechanism for managing global state was proposed, designed to be
tied to the lifecycle of the preserved files themselves. This would
allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
global state that is only relevant when one or more of its FDs are
being managed by LUO.

The key characteristics of this new mechanism are:
The global state is optionally created on the first preserve() call
for a given file handler.
The state can be updated on subsequent preserve() calls.
The state is destroyed when the last corresponding file is unpreserved
or finished.
The data can be accessed during boot.

I am thinking of an API like this.

1. Add three more callbacks to liveupdate_file_ops:
/*
 * Optional. Called by LUO during first get global state call.
 * The handler should allocate/KHO preserve its global state object and return a
 * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
 * address of preserved memory) via 'data_handle' that LUO will save.
 * Return: 0 on success.
 */
int (*global_state_create)(struct liveupdate_file_handler *h,
                           void **obj, u64 *data_handle);

/*
 * Optional. Called by LUO in the new kernel
 * before the first access to the global state. The handler receives
 * the preserved u64 data_handle and should use it to reconstruct its
 * global state object, returning a pointer to it via 'obj'.
 * Return: 0 on success.
 */
int (*global_state_restore)(struct liveupdate_file_handler *h,
                            u64 data_handle, void **obj);

/*
 * Optional. Called by LUO after the last
 * file for this handler is unpreserved or finished. The handler
 * must free its global state object and any associated resources.
 */
void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);

The get/put global state data:

/* Get and lock the data with file_handler scoped lock */
int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
                                   void **obj);

/* Unlock the data */
void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

Execution Flow:
1. Outgoing Kernel (First preserve() call):
2. Handler's preserve() is called. It needs the global state, so it calls
   liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
   It sees h->global_state_obj is NULL.
   LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
   The handler allocates its state, preserves it with KHO, and returns its live
   pointer and a u64 handle.
3. LUO stores the handle internally for later serialization.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
5. The preserve() callback does its work using the obj.
6. It calls liveupdate_fh_global_state_put(h), which releases the lock.

Global PREPARE:
1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
   the LUO FDT.

Incoming Kernel (First access):
1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
   acquires h->global_state_lock.
2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
   handle from the FDT. LUO calls h->ops->global_state_restore()
3. Reconstructs its state object, and returns the live pointer.
4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
5. The caller does its work.
6. It calls liveupdate_fh_global_state_put(h) to release the lock.

Last File Cleanup (in unpreserve or finish):
1. LUO decrements h->count to 0.
2. This triggers the cleanup logic.
3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
4. The handler frees its memory and resources.
5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
   cycle.

Pasha

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pratyush Yadav 4 months ago

On Tue, Oct 07 2025, Pasha Tatashin wrote:

> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
[...]
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.

Is this going to replace LUO subsystems? If yes, then why? The global
state will likely need to have its own lifecycle just like the FDs, and
subsystems are a simple and clean abstraction to control that. I get the
idea of only "activating" a subsystem when one or more of its FDs are
participating in LUO, but we can do that while keeping subsystems
around.

>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

IMHO this looks clunky and overcomplicated. Each LUO FD type knows what
its subsystem is. It should talk to it directly. I don't get why we are
adding this intermediate step.

Here is how I imagine the proposed API would compare against subsystems
with hugetlb as an example (hugetlb support is still WIP, so I'm still
not clear on specifics, but this is how I imagine it will work):

- Hugetlb subsystem needs to track its huge page pools and which pages
  are allocated and free. This is its global state. The pools get
  reconstructed after kexec. Post-kexec, the free pages are ready for
  allocation from other "regular" files and the pages used in LUO files
  are reserved.

- Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
  in hugetlb's global data structure tracking this. This is runtime data
  (say xarray), and _not_ serialized data. Reason being, there are
  likely more FDs to come so no point in wasting time serializing just
  yet.

  This can look something like:

  hugetlb_luo_preserve_folio(folio, ...);

  Nice and simple.

  Compare this with the new proposed API:

  liveupdate_fh_global_state_get(h, &hugetlb_data);
  // This will have update serialized state now.
  hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
  liveupdate_fh_global_state_put(h);

  We do the same thing but in a very complicated way.

- When the system-wide preserve happens, the hugetlb subsystem gets a
  callback to serialize. It converts its runtime global state to
  serialized state since now it knows no more FDs will be added.

  With the new API, this doesn't need to be done since each FD prepare
  already updates serialized state.

- If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
  anything in LUO. This is same as new API.

- If some hugetlb FDs are not restored after liveupdate and the finish
  event is triggered, the subsystem gets its finish() handler called and
  it can free things up.

  I don't get how that would work with the new API.

My point is, I see subsystems working perfectly fine here and I don't
get how the proposed API is any better.

Am I missing something?

>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.

The huge page pools are allocated early-ish in boot. On x86, the 1 GiB
pages are allocated from setup_arch(). Other sizes are allocated later
in boot from a subsys_initcall. This is way before the first FD gets
restored, and in 1 GiB case even before LUO gets initialized.

At that point, it would be great if the hugetlb preserved data can be
retrieved. If not, then there needs to at least be some indication that
LUO brings huge pages with it, so that the kernel can trust that it will
be able to successfully get the pages later in boot.

This flow is tricky to implement in the proposed model. With subsystems,
it might just end up working with some early boot tricks to fetch LUO
data.

> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.

-- 
Regards,
Pratyush Yadav

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> >>
> [...]
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
>
> Is this going to replace LUO subsystems? If yes, then why? The global
> state will likely need to have its own lifecycle just like the FDs, and
> subsystems are a simple and clean abstraction to control that. I get the
> idea of only "activating" a subsystem when one or more of its FDs are
> participating in LUO, but we can do that while keeping subsystems
> around.
>
> >
> > The key characteristics of this new mechanism are:
> > The global state is optionally created on the first preserve() call
> > for a given file handler.
> > The state can be updated on subsequent preserve() calls.
> > The state is destroyed when the last corresponding file is unpreserved
> > or finished.
> > The data can be accessed during boot.
> >
> > I am thinking of an API like this.
> >
> > 1. Add three more callbacks to liveupdate_file_ops:
> > /*
> >  * Optional. Called by LUO during first get global state call.
> >  * The handler should allocate/KHO preserve its global state object and return a
> >  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
> >  * address of preserved memory) via 'data_handle' that LUO will save.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_create)(struct liveupdate_file_handler *h,
> >                            void **obj, u64 *data_handle);
> >
> > /*
> >  * Optional. Called by LUO in the new kernel
> >  * before the first access to the global state. The handler receives
> >  * the preserved u64 data_handle and should use it to reconstruct its
> >  * global state object, returning a pointer to it via 'obj'.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_restore)(struct liveupdate_file_handler *h,
> >                             u64 data_handle, void **obj);
> >
> > /*
> >  * Optional. Called by LUO after the last
> >  * file for this handler is unpreserved or finished. The handler
> >  * must free its global state object and any associated resources.
> >  */
> > void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
> >
> > The get/put global state data:
> >
> > /* Get and lock the data with file_handler scoped lock */
> > int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
> >                                    void **obj);
> >
> > /* Unlock the data */
> > void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> IMHO this looks clunky and overcomplicated. Each LUO FD type knows what
> its subsystem is. It should talk to it directly. I don't get why we are
> adding this intermediate step.
>
> Here is how I imagine the proposed API would compare against subsystems
> with hugetlb as an example (hugetlb support is still WIP, so I'm still
> not clear on specifics, but this is how I imagine it will work):
>
> - Hugetlb subsystem needs to track its huge page pools and which pages
>   are allocated and free. This is its global state. The pools get
>   reconstructed after kexec. Post-kexec, the free pages are ready for
>   allocation from other "regular" files and the pages used in LUO files
>   are reserved.

Thinking more about this, HugeTLB is different from iommufd/iommu-core
vfiofd/pci because it supports many types of FDs, such as memfd and
guest_memfd (1G support is coming soon!). Also, since not all memfds
or guest_memfd instances require HugeTLB, binding their lifecycles to
HugeTLB doesn't make sense here. I agree that a subsystem is more
appropriate for this use case.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> >>
> [...]
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
>
> Is this going to replace LUO subsystems? If yes, then why? The global
> state will likely need to have its own lifecycle just like the FDs, and
> subsystems are a simple and clean abstraction to control that. I get the
> idea of only "activating" a subsystem when one or more of its FDs are
> participating in LUO, but we can do that while keeping subsystems
> around.

Thanks for the feedback. The FLB Global State is not replacing the LUO
subsystems. On the contrary, it's a higher-level abstraction that is
itself implemented as a LUO subsystem. The goal is to provide a
solution for a pattern that emerged during the PCI and IOMMU
discussions.

You can see the WIP implementation here, which shows it registering as
a subsystem named "luo-fh-states-v1-struct":
https://github.com/soleen/linux/commit/94e191aab6b355d83633718bc4a1d27dda390001

The existing subsystem API is a low-level tool that provides for the
preservation of a raw 8-byte handle. It doesn't provide locking, nor
is it explicitly tied to the lifecycle of any higher-level object like
a file handler. The new API is designed to solve a more specific
problem: allowing global components (like IOMMU or PCI) to
automatically track when resources relevant to them are added to or
removed from preservation. If HugeTLB requires a subsystem, it can
still use it, but I suspect it might benefit from FLB Global State as
well.

> Here is how I imagine the proposed API would compare against subsystems
> with hugetlb as an example (hugetlb support is still WIP, so I'm still
> not clear on specifics, but this is how I imagine it will work):
>
> - Hugetlb subsystem needs to track its huge page pools and which pages
>   are allocated and free. This is its global state. The pools get
>   reconstructed after kexec. Post-kexec, the free pages are ready for
>   allocation from other "regular" files and the pages used in LUO files
>   are reserved.
>
> - Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
>   in hugetlb's global data structure tracking this. This is runtime data
>   (say xarray), and _not_ serialized data. Reason being, there are
>   likely more FDs to come so no point in wasting time serializing just
>   yet.
>
>   This can look something like:
>
>   hugetlb_luo_preserve_folio(folio, ...);
>
>   Nice and simple.
>
>   Compare this with the new proposed API:
>
>   liveupdate_fh_global_state_get(h, &hugetlb_data);
>   // This will have update serialized state now.
>   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>   liveupdate_fh_global_state_put(h);
>
>   We do the same thing but in a very complicated way.
>
> - When the system-wide preserve happens, the hugetlb subsystem gets a
>   callback to serialize. It converts its runtime global state to
>   serialized state since now it knows no more FDs will be added.
>
>   With the new API, this doesn't need to be done since each FD prepare
>   already updates serialized state.
>
> - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>   anything in LUO. This is same as new API.
>
> - If some hugetlb FDs are not restored after liveupdate and the finish
>   event is triggered, the subsystem gets its finish() handler called and
>   it can free things up.
>
>   I don't get how that would work with the new API.

The new API isn't more complicated; It codifies the common pattern of
"create on first use, destroy on last use" into a reusable helper,
saving each file handler from having to reinvent the same reference
counting and locking scheme. But, as you point out, subsystems provide
more control, specifically they handle full creation/free instead of
relying on file-handlers for that.

> My point is, I see subsystems working perfectly fine here and I don't
> get how the proposed API is any better.
>
> Am I missing something?

No, I don't think you are. Your analysis is correct that this is
achievable with subsystems. The goal of the new API is to make that
specific, common use case simpler.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pratyush Yadav 3 months, 4 weeks ago

On Thu, Oct 09 2025, Pasha Tatashin wrote:

> On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>>
>> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
>> > <pasha.tatashin@soleen.com> wrote:
>> >>
>> [...]
>> > 4. New File-Lifecycle-Bound Global State
>> > ----------------------------------------
>> > A new mechanism for managing global state was proposed, designed to be
>> > tied to the lifecycle of the preserved files themselves. This would
>> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
>> > global state that is only relevant when one or more of its FDs are
>> > being managed by LUO.
>>
>> Is this going to replace LUO subsystems? If yes, then why? The global
>> state will likely need to have its own lifecycle just like the FDs, and
>> subsystems are a simple and clean abstraction to control that. I get the
>> idea of only "activating" a subsystem when one or more of its FDs are
>> participating in LUO, but we can do that while keeping subsystems
>> around.
>
> Thanks for the feedback. The FLB Global State is not replacing the LUO
> subsystems. On the contrary, it's a higher-level abstraction that is
> itself implemented as a LUO subsystem. The goal is to provide a
> solution for a pattern that emerged during the PCI and IOMMU
> discussions.

Okay, makes sense then. I thought we were removing the subsystems idea.
I didn't follow the PCI and IOMMU discussions that closely.

Side note: I see a dependency between subsystems forming. For example,
the FLB subsystem probably wants to make sure all its dependent
subsystems (like LUO files) go through their callbacks before getting
its callback. Maybe in the current implementation doing it in any order
works, but in general, if it manages data of other subsystems, it should
be serialized after them.

Same with the hugetlb subsystem for example. On prepare or freeze time,
it would probably be a good idea if the files callbacks finish first. I
would imagine most subsystems would want to go after files.

With the current registration mechanism, the order depends on when the
subsystem is registered, which is hard to control. Maybe we should have
a global list of subsystems and can manually specify the order? Not sure
if that is a good idea, just throwing it out there off the top of my
head.

>
> You can see the WIP implementation here, which shows it registering as
> a subsystem named "luo-fh-states-v1-struct":
> https://github.com/soleen/linux/commit/94e191aab6b355d83633718bc4a1d27dda390001
>
> The existing subsystem API is a low-level tool that provides for the
> preservation of a raw 8-byte handle. It doesn't provide locking, nor
> is it explicitly tied to the lifecycle of any higher-level object like
> a file handler. The new API is designed to solve a more specific
> problem: allowing global components (like IOMMU or PCI) to
> automatically track when resources relevant to them are added to or
> removed from preservation. If HugeTLB requires a subsystem, it can
> still use it, but I suspect it might benefit from FLB Global State as
> well.

Hmm, right. Let me see how I can make use of it.

>
>> Here is how I imagine the proposed API would compare against subsystems
>> with hugetlb as an example (hugetlb support is still WIP, so I'm still
>> not clear on specifics, but this is how I imagine it will work):
>>
>> - Hugetlb subsystem needs to track its huge page pools and which pages
>>   are allocated and free. This is its global state. The pools get
>>   reconstructed after kexec. Post-kexec, the free pages are ready for
>>   allocation from other "regular" files and the pages used in LUO files
>>   are reserved.
>>
>> - Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
>>   in hugetlb's global data structure tracking this. This is runtime data
>>   (say xarray), and _not_ serialized data. Reason being, there are
>>   likely more FDs to come so no point in wasting time serializing just
>>   yet.
>>
>>   This can look something like:
>>
>>   hugetlb_luo_preserve_folio(folio, ...);
>>
>>   Nice and simple.
>>
>>   Compare this with the new proposed API:
>>
>>   liveupdate_fh_global_state_get(h, &hugetlb_data);
>>   // This will have update serialized state now.
>>   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>>   liveupdate_fh_global_state_put(h);
>>
>>   We do the same thing but in a very complicated way.
>>
>> - When the system-wide preserve happens, the hugetlb subsystem gets a
>>   callback to serialize. It converts its runtime global state to
>>   serialized state since now it knows no more FDs will be added.
>>
>>   With the new API, this doesn't need to be done since each FD prepare
>>   already updates serialized state.
>>
>> - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>>   anything in LUO. This is same as new API.
>>
>> - If some hugetlb FDs are not restored after liveupdate and the finish
>>   event is triggered, the subsystem gets its finish() handler called and
>>   it can free things up.
>>
>>   I don't get how that would work with the new API.
>
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.
>
>> My point is, I see subsystems working perfectly fine here and I don't
>> get how the proposed API is any better.
>>
>> Am I missing something?
>
> No, I don't think you are. Your analysis is correct that this is
> achievable with subsystems. The goal of the new API is to make that
> specific, common use case simpler.

Right. Thanks for clarifying.

>
> Pasha

-- 
Regards,
Pratyush Yadav

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
> >   This can look something like:
> >
> >   hugetlb_luo_preserve_folio(folio, ...);
> >
> >   Nice and simple.
> >
> >   Compare this with the new proposed API:
> >
> >   liveupdate_fh_global_state_get(h, &hugetlb_data);
> >   // This will have update serialized state now.
> >   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
> >   liveupdate_fh_global_state_put(h);
> >
> >   We do the same thing but in a very complicated way.
> >
> > - When the system-wide preserve happens, the hugetlb subsystem gets a
> >   callback to serialize. It converts its runtime global state to
> >   serialized state since now it knows no more FDs will be added.
> >
> >   With the new API, this doesn't need to be done since each FD prepare
> >   already updates serialized state.
> >
> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
> >   anything in LUO. This is same as new API.
> >
> > - If some hugetlb FDs are not restored after liveupdate and the finish
> >   event is triggered, the subsystem gets its finish() handler called and
> >   it can free things up.
> >
> >   I don't get how that would work with the new API.
> 
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.

I'd say hugetlb *should* be doing the more complicated thing. We
should not have global static data for luo floating around the kernel,
this is too easily abused in bad ways.

The above "complicated" sequence forces the caller to have a fd
session handle, and "hides" the global state inside luo so the
subsystem can't just randomly reach into it whenever it likes.

This is a deliberate and violent way to force clean coding practices
and good layering.

Not sure why hugetlb pools would need another xarray??

1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
   frozen, can't add/remove PFNs.
2) Require the users of hugetlb memory, like memfd, to
   preserve/restore the folios they are using (using their hugetlb order)
3) Just before kexec run over the PFN list and mark a bit if the folio
   was preserved by KHO or not. Make sure everything gets KHO
   preserved.

Restore puts the PFNs that were not preserved directly in the free
pool, the end user of the folio like the memfd restores and eventually
normally frees the other folios.

It is simple and fits nicely into the infrastructure here, where the
first time you trigger a global state it does the pfn list and
freezing, and the lifecycle and locking for this operation is directly
managed by luo.

The memfd, when it knows it has hugetlb folios inside it, would
trigger this.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pratyush Yadav 3 months, 3 weeks ago

On Fri, Oct 10 2025, Jason Gunthorpe wrote:

> On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
>> >   This can look something like:
>> >
>> >   hugetlb_luo_preserve_folio(folio, ...);
>> >
>> >   Nice and simple.
>> >
>> >   Compare this with the new proposed API:
>> >
>> >   liveupdate_fh_global_state_get(h, &hugetlb_data);
>> >   // This will have update serialized state now.
>> >   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>> >   liveupdate_fh_global_state_put(h);
>> >
>> >   We do the same thing but in a very complicated way.
>> >
>> > - When the system-wide preserve happens, the hugetlb subsystem gets a
>> >   callback to serialize. It converts its runtime global state to
>> >   serialized state since now it knows no more FDs will be added.
>> >
>> >   With the new API, this doesn't need to be done since each FD prepare
>> >   already updates serialized state.
>> >
>> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>> >   anything in LUO. This is same as new API.
>> >
>> > - If some hugetlb FDs are not restored after liveupdate and the finish
>> >   event is triggered, the subsystem gets its finish() handler called and
>> >   it can free things up.
>> >
>> >   I don't get how that would work with the new API.
>> 
>> The new API isn't more complicated; It codifies the common pattern of
>> "create on first use, destroy on last use" into a reusable helper,
>> saving each file handler from having to reinvent the same reference
>> counting and locking scheme. But, as you point out, subsystems provide
>> more control, specifically they handle full creation/free instead of
>> relying on file-handlers for that.
>
> I'd say hugetlb *should* be doing the more complicated thing. We
> should not have global static data for luo floating around the kernel,
> this is too easily abused in bad ways.

Not sure how much difference this makes in practice, but I get your
point.

>
> The above "complicated" sequence forces the caller to have a fd
> session handle, and "hides" the global state inside luo so the
> subsystem can't just randomly reach into it whenever it likes.
>
> This is a deliberate and violent way to force clean coding practices
> and good layering.
>
> Not sure why hugetlb pools would need another xarray??

Not sure myself either. I used it to demonstrate my point of having
runtime state and serialized state separate from each other.

>
> 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
>    frozen, can't add/remove PFNs.

Doesn't that circumvent LUO's state machine? The idea with the state
machine was to have clear points in time when the system goes into the
"limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
event. With what you propose, the first FD being preserved implicitly
triggers the prepare event. Same thing for unprepare/cancel operations.

I am wondering if it is better to do it the other way round: prepare all
files first, and then prepare the hugetlb subsystem at
LIVEUPDATE_PREPARE event. At that point it already knows which pages to
mark preserved so the serialization can be done in one go.

> 2) Require the users of hugetlb memory, like memfd, to
>    preserve/restore the folios they are using (using their hugetlb order)
> 3) Just before kexec run over the PFN list and mark a bit if the folio
>    was preserved by KHO or not. Make sure everything gets KHO
>    preserved.

"just before kexec" would need a callback from LUO. I suppose a
subsystem is the place for that callback. I wrote my email under the
(wrong) impression that we were replacing subsystems.

That makes me wonder: how is the subsystem-level callback supposed to
access the global data? I suppose it can use the liveupdate_file_handler
directly, but it is kind of strange since technically the subsystem and
file handler are two different entities.

Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
I'm not sure how that would map with this shared global data. memfd and
guest_memfd will likely have different liveupdate_file_handler but would
share data from the same subsystem. Maybe that's a problem to solve for
later...

>
> Restore puts the PFNs that were not preserved directly in the free
> pool, the end user of the folio like the memfd restores and eventually
> normally frees the other folios.

Yeah, on the restore side this idea works fine I think.

>
> It is simple and fits nicely into the infrastructure here, where the
> first time you trigger a global state it does the pfn list and
> freezing, and the lifecycle and locking for this operation is directly
> managed by luo.
>
> The memfd, when it knows it has hugetlb folios inside it, would
> trigger this.
>
> Jason

-- 
Regards,
Pratyush Yadav

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 3 months, 3 weeks ago

On Tue, Oct 14, 2025 at 03:29:59PM +0200, Pratyush Yadav wrote:
> > 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
> >    frozen, can't add/remove PFNs.
> 
> Doesn't that circumvent LUO's state machine? The idea with the state
> machine was to have clear points in time when the system goes into the
> "limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
> event. 

I wouldn't get too invested in the FSM, it is there but it doesn't
mean every luo client has to be focused on it.

> With what you propose, the first FD being preserved implicitly
> triggers the prepare event. Same thing for unprepare/cancel operations.

Yes, this is easy to write and simple to manage.

> I am wondering if it is better to do it the other way round: prepare all
> files first, and then prepare the hugetlb subsystem at
> LIVEUPDATE_PREPARE event. At that point it already knows which pages to
> mark preserved so the serialization can be done in one go.

I think this would be slower and more complex?

> > 2) Require the users of hugetlb memory, like memfd, to
> >    preserve/restore the folios they are using (using their hugetlb order)
> > 3) Just before kexec run over the PFN list and mark a bit if the folio
> >    was preserved by KHO or not. Make sure everything gets KHO
> >    preserved.
> 
> "just before kexec" would need a callback from LUO. I suppose a
> subsystem is the place for that callback. I wrote my email under the
> (wrong) impression that we were replacing subsystems.

The file descriptors path should have luo client ops that have all
the required callbacks. This is probably an existing op.

> That makes me wonder: how is the subsystem-level callback supposed to
> access the global data? I suppose it can use the liveupdate_file_handler
> directly, but it is kind of strange since technically the subsystem and
> file handler are two different entities.

If we need such things we would need a way to link these together, but
I'm wonder if we really don't..

> Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
> I'm not sure how that would map with this shared global data. memfd and
> guest_memfd will likely have different liveupdate_file_handler but would
> share data from the same subsystem. Maybe that's a problem to solve for
> later...

On preserve memfd should call into hugetlb to activate it as a hugetlb
page provider and preserve it too.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pratyush Yadav 3 months, 2 weeks ago

On Mon, Oct 20 2025, Jason Gunthorpe wrote:

> On Tue, Oct 14, 2025 at 03:29:59PM +0200, Pratyush Yadav wrote:
>> > 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
>> >    frozen, can't add/remove PFNs.
>> 
>> Doesn't that circumvent LUO's state machine? The idea with the state
>> machine was to have clear points in time when the system goes into the
>> "limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
>> event. 
>
> I wouldn't get too invested in the FSM, it is there but it doesn't
> mean every luo client has to be focused on it.

Having each subsystem have its own state machine sounds like a bad idea
to me. It can get tricky to manage both for us and our users.

>
>> With what you propose, the first FD being preserved implicitly
>> triggers the prepare event. Same thing for unprepare/cancel operations.
>
> Yes, this is easy to write and simple to manage.
>
>> I am wondering if it is better to do it the other way round: prepare all
>> files first, and then prepare the hugetlb subsystem at
>> LIVEUPDATE_PREPARE event. At that point it already knows which pages to
>> mark preserved so the serialization can be done in one go.
>
> I think this would be slower and more complex?
>
>> > 2) Require the users of hugetlb memory, like memfd, to
>> >    preserve/restore the folios they are using (using their hugetlb order)
>> > 3) Just before kexec run over the PFN list and mark a bit if the folio
>> >    was preserved by KHO or not. Make sure everything gets KHO
>> >    preserved.
>> 
>> "just before kexec" would need a callback from LUO. I suppose a
>> subsystem is the place for that callback. I wrote my email under the
>> (wrong) impression that we were replacing subsystems.
>
> The file descriptors path should have luo client ops that have all
> the required callbacks. This is probably an existing op.
>
>> That makes me wonder: how is the subsystem-level callback supposed to
>> access the global data? I suppose it can use the liveupdate_file_handler
>> directly, but it is kind of strange since technically the subsystem and
>> file handler are two different entities.
>
> If we need such things we would need a way to link these together, but
> I'm wonder if we really don't..
>
>> Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
>> I'm not sure how that would map with this shared global data. memfd and
>> guest_memfd will likely have different liveupdate_file_handler but would
>> share data from the same subsystem. Maybe that's a problem to solve for
>> later...
>
> On preserve memfd should call into hugetlb to activate it as a hugetlb
> page provider and preserve it too.

From what I understand, the main problem you want to solve is that the
life cycle of the global data should be tied to the file descriptors.
And since everything should have a FD anyway, can't we directly tie the
subsystems to file handlers? The subsystem gets a "preserve" callback
when the first FD that uses it gets preserved. It gets a "unpreserve"
callback when the last FD goes away. And the rest of the state machine
like prepare, cancel, etc. stay the same.

I think this gives us a clean abstraction that has LUO-managed lifetime.

It also works with the guest_memfd and memfd case since both can have
hugetlb as their underlying subsystem. For example,

static const struct liveupdate_file_ops memfd_luo_file_ops = {
	.preserve = memfd_luo_preserve,
	.unpreserve = memfd_luo_unpreserve,
	[...]
	.subsystem = &luo_hugetlb_subsys,
};

And then luo_{un,}preserve_file() can keep a refcount for the subsystem
and preserve or unpreserve the subsystem as needed. LUO can manage the
locking for these callbacks too.

-- 
Regards,
Pratyush Yadav

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Samiullah Khawaja 4 months ago

On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > This series introduces the Live Update Orchestrator (LUO), a kernel
> > subsystem designed to facilitate live kernel updates. LUO enables
> > kexec-based reboots with minimal downtime, a critical capability for
> > cloud environments where hypervisors must be updated without disrupting
> > running virtual machines. By preserving the state of selected resources,
> > such as file descriptors and memory, LUO allows workloads to resume
> > seamlessly in the new kernel.
> >
> > The git branch for this series can be found at:
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> >
> > The patch series applies against linux-next tag: next-20250926
> >
> > While this series is showed cased using memfd preservation. There are
> > works to preserve devices:
> > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> >
> > =======================================================================
> > Changelog since v3:
> > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> >
> > - The main architectural change in this version is introduction of
> >   "sessions" to manage the lifecycle of preserved file descriptors.
> >   In v3, session management was left to a single userspace agent. This
> >   approach has been revised to improve robustness. Now, each session is
> >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> >   all preserved resources within a session is tied to this FD, ensuring
> >   automatic cleanup by the kernel if the controlling userspace agent
> >   crashes or exits unexpectedly.
> >
> > - The first three KHO fixes from the previous series have been merged
> >   into Linus' tree.
> >
> > - Various bug fixes and refactorings, including correcting memory
> >   unpreservation logic during a kho_abort() sequence.
> >
> > - Addressing all comments from reviewers.
> >
> > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> >   can now be queried  only via ioctl() API.
> >
> > =======================================================================
>
> Hi all,
>
> Following up on yesterday's Hypervisor Live Update meeting, we
> discussed the requirements for the LUO to track dependencies,
> particularly for IOMMU preservation and other stateful file
> descriptors. This email summarizes the main design decisions and
> outcomes from that discussion.
>
> For context, the notes from the previous meeting can be found here:
> https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> The notes for yesterday's meeting are not yes available.
>
> The key outcomes are as follows:
>
> 1. User-Enforced Ordering
> -------------------------
> The responsibility for enforcing the correct order of operations will
> lie with the userspace agent. If fd_A is a dependency for fd_B,
> userspace must ensure that fd_A is preserved before fd_B. This same
> ordering must be honored during the restoration phase after the reboot
> (fd_A must be restored before fd_B). The kernel preserve the ordering.
>
> 2. Serialization in PRESERVE_FD
> -------------------------------
> To keep the global prepare() phase lightweight and predictable, the
> consensus was to shift the heavy serialization work into the
> PRESERVE_FD ioctl handler. This means that when userspace requests to
> preserve a file, the file handler should perform the bulk of the
> state-saving work immediately.
>
> The proposed sequence of operations reflects this shift:
>
> Shutdown Flow:
> fd_preserve() (heavy serialization) -> prepare() (lightweight final
> checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
>
> Boot & Restore Flow:
> fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> post-restore IOCTLs (e.g., hardware page table re-creation) ->
> finish() (lightweight cleanup)
>
> This decision primarily serves as a guideline for file handler
> implementations. For the LUO core, this implies minor API changes,
> such as renaming can_preserve() to a more active preserve() and adding
> a corresponding unpreserve() callback to be called during
> UNPRESERVE_FD.
>
> 3. FD Data Query API
> --------------------
> We identified the need for a kernel API to allow subsystems to query
> preserved FD data during the boot process, before userspace has
> initiated the restore.
>
> The proposed API would allow a file handler to retrieve a list of all
> its preserved FDs, including their session names, tokens, and the
> private data payload.
>
> Proposed Data Structure:
>
> struct liveupdate_fd {
>         char *session; /* session name */
>         u64 token; /* Preserved FD token */
>         u64 data; /* Private preserved data */
> };
>
> Proposed Function:
> liveupdate_fd_data_query(struct liveupdate_file_handler *h,
>                          struct liveupdate_fd *fds, long *count);

Now that you are adding the "File-Lifecycle-Bound Global State", I was
wondering if this session data query mechanism is still necessary. It
seems that any preserved state a file handler needs to restore during
boot could be fetched using the Global data support instead. For
example, I don't think session information will be needed to restore
iommu domains during boot (iommu init), but even if some other file
handler needs it then it can keep this info in global data. I
discussed this briefly with Pasha today, but wanted to raise it here
as well.
>
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.
>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.
> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.
>
> Pasha
>
>
> Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 5:58 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > This series introduces the Live Update Orchestrator (LUO), a kernel
> > > subsystem designed to facilitate live kernel updates. LUO enables
> > > kexec-based reboots with minimal downtime, a critical capability for
> > > cloud environments where hypervisors must be updated without disrupting
> > > running virtual machines. By preserving the state of selected resources,
> > > such as file descriptors and memory, LUO allows workloads to resume
> > > seamlessly in the new kernel.
> > >
> > > The git branch for this series can be found at:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> > >
> > > The patch series applies against linux-next tag: next-20250926
> > >
> > > While this series is showed cased using memfd preservation. There are
> > > works to preserve devices:
> > > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> > >
> > > =======================================================================
> > > Changelog since v3:
> > > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> > >
> > > - The main architectural change in this version is introduction of
> > >   "sessions" to manage the lifecycle of preserved file descriptors.
> > >   In v3, session management was left to a single userspace agent. This
> > >   approach has been revised to improve robustness. Now, each session is
> > >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> > >   all preserved resources within a session is tied to this FD, ensuring
> > >   automatic cleanup by the kernel if the controlling userspace agent
> > >   crashes or exits unexpectedly.
> > >
> > > - The first three KHO fixes from the previous series have been merged
> > >   into Linus' tree.
> > >
> > > - Various bug fixes and refactorings, including correcting memory
> > >   unpreservation logic during a kho_abort() sequence.
> > >
> > > - Addressing all comments from reviewers.
> > >
> > > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> > >   can now be queried  only via ioctl() API.
> > >
> > > =======================================================================
> >
> > Hi all,
> >
> > Following up on yesterday's Hypervisor Live Update meeting, we
> > discussed the requirements for the LUO to track dependencies,
> > particularly for IOMMU preservation and other stateful file
> > descriptors. This email summarizes the main design decisions and
> > outcomes from that discussion.
> >
> > For context, the notes from the previous meeting can be found here:
> > https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> > The notes for yesterday's meeting are not yes available.
> >
> > The key outcomes are as follows:
> >
> > 1. User-Enforced Ordering
> > -------------------------
> > The responsibility for enforcing the correct order of operations will
> > lie with the userspace agent. If fd_A is a dependency for fd_B,
> > userspace must ensure that fd_A is preserved before fd_B. This same
> > ordering must be honored during the restoration phase after the reboot
> > (fd_A must be restored before fd_B). The kernel preserve the ordering.
> >
> > 2. Serialization in PRESERVE_FD
> > -------------------------------
> > To keep the global prepare() phase lightweight and predictable, the
> > consensus was to shift the heavy serialization work into the
> > PRESERVE_FD ioctl handler. This means that when userspace requests to
> > preserve a file, the file handler should perform the bulk of the
> > state-saving work immediately.
> >
> > The proposed sequence of operations reflects this shift:
> >
> > Shutdown Flow:
> > fd_preserve() (heavy serialization) -> prepare() (lightweight final
> > checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
> >
> > Boot & Restore Flow:
> > fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> > post-restore IOCTLs (e.g., hardware page table re-creation) ->
> > finish() (lightweight cleanup)
> >
> > This decision primarily serves as a guideline for file handler
> > implementations. For the LUO core, this implies minor API changes,
> > such as renaming can_preserve() to a more active preserve() and adding
> > a corresponding unpreserve() callback to be called during
> > UNPRESERVE_FD.
> >
> > 3. FD Data Query API
> > --------------------
> > We identified the need for a kernel API to allow subsystems to query
> > preserved FD data during the boot process, before userspace has
> > initiated the restore.
> >
> > The proposed API would allow a file handler to retrieve a list of all
> > its preserved FDs, including their session names, tokens, and the
> > private data payload.
> >
> > Proposed Data Structure:
> >
> > struct liveupdate_fd {
> >         char *session; /* session name */
> >         u64 token; /* Preserved FD token */
> >         u64 data; /* Private preserved data */
> > };
> >
> > Proposed Function:
> > liveupdate_fd_data_query(struct liveupdate_file_handler *h,
> >                          struct liveupdate_fd *fds, long *count);
>
> Now that you are adding the "File-Lifecycle-Bound Global State", I was
> wondering if this session data query mechanism is still necessary. It
> seems that any preserved state a file handler needs to restore during
> boot could be fetched using the Global data support instead. For
> example, I don't think session information will be needed to restore
> iommu domains during boot (iommu init), but even if some other file
> handler needs it then it can keep this info in global data. I
> discussed this briefly with Pasha today, but wanted to raise it here
> as well.

I agree, the query API is ugly and indeed not needed with the FLB
Global State. The biggest problem with the query API is that the
caller must somehow know how to interpret the preserved file-handler
data before the struct file is reconstructed. This is problematic;
there should only be one place that knows how to store and interpret
the data, not multiple.

It looks like the combination of an enforced ordering:
Preservation: A->B->C->D
Un-preservation: D->C->B->A
Retrieval: A->B->C->D

and the FLB Global State (where data is automatically created and
destroyed when a particular file type participates in a live update)
solves the need for this query mechanism. For example, the IOMMU
driver/core can add its data only when an iommufd is preserved and add
more data as more iommufds are added. The preserved data is also
automatically removed once the live update is finished or canceled.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Thu, Oct 09, 2025 at 06:42:09PM -0400, Pasha Tatashin wrote:
> 
> It looks like the combination of an enforced ordering:
> Preservation: A->B->C->D
> Un-preservation: D->C->B->A
> Retrieval: A->B->C->D
> 
> and the FLB Global State (where data is automatically created and
> destroyed when a particular file type participates in a live update)
> solves the need for this query mechanism. For example, the IOMMU
> driver/core can add its data only when an iommufd is preserved and add
> more data as more iommufds are added. The preserved data is also
> automatically removed once the live update is finished or canceled.

IDK I think we should try to be flexible on the restoration order.

Eg, if we project ahead to when we might need to preserve kvm and
iommufd FDs as well, the order would likely be:

Preservation: memfd -> kvm -> iommufd -> vfio
Retrieval: iommud_domain (early boot) kvm -> iommufd -> vfio -> memfd

Just because of how the dependencies work, and the desire to push the
memfd as late as possible.

I don't see an issue with this, the kernel enforcing the ordering
should fall out naturally based on the sanity checks each step will
do.

ie I can't get back the KVM fd if luo says it is out of order.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Fri, Oct 10, 2025 at 10:42 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 09, 2025 at 06:42:09PM -0400, Pasha Tatashin wrote:
> >
> > It looks like the combination of an enforced ordering:
> > Preservation: A->B->C->D
> > Un-preservation: D->C->B->A
> > Retrieval: A->B->C->D
> >
> > and the FLB Global State (where data is automatically created and
> > destroyed when a particular file type participates in a live update)
> > solves the need for this query mechanism. For example, the IOMMU
> > driver/core can add its data only when an iommufd is preserved and add
> > more data as more iommufds are added. The preserved data is also
> > automatically removed once the live update is finished or canceled.
>
> IDK I think we should try to be flexible on the restoration order.

It is easier to be inflexible at first and then relax the requirement
than the other way around. I think it is alright to enforce the order
for now, as it is driven only by userspace.

> Eg, if we project ahead to when we might need to preserve kvm and
> iommufd FDs as well, the order would likely be:
>
> Preservation: memfd -> kvm -> iommufd -> vfio
> Retrieval: iommud_domain (early boot) kvm -> iommufd -> vfio -> memfd

At some point, we will implement orphaned VMs, where a VM can run
without a VMM during the live-update period. This would allow us to
reduce the blackout time and later enable vCPUs to keep running even
during kexec.

With that, I would assume KVM itself would drive the live update and
would make LUO calls to preserve the resources in an orderly fashion
and then restore them in the same order during boot.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Fri, Oct 10, 2025 at 10:58:00AM -0400, Pasha Tatashin wrote:

> With that, I would assume KVM itself would drive the live update and
> would make LUO calls to preserve the resources in an orderly fashion
> and then restore them in the same order during boot.

I don't think so, it should always be sequenced by userspace, and KVM
is not the thing linked to VFIO or IOMMUFD, that's backwards.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Samiullah Khawaja 4 months ago

On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > This series introduces the Live Update Orchestrator (LUO), a kernel
> > subsystem designed to facilitate live kernel updates. LUO enables
> > kexec-based reboots with minimal downtime, a critical capability for
> > cloud environments where hypervisors must be updated without disrupting
> > running virtual machines. By preserving the state of selected resources,
> > such as file descriptors and memory, LUO allows workloads to resume
> > seamlessly in the new kernel.
> >
> > The git branch for this series can be found at:
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> >
> > The patch series applies against linux-next tag: next-20250926
> >
> > While this series is showed cased using memfd preservation. There are
> > works to preserve devices:
> > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> >
> > =======================================================================
> > Changelog since v3:
> > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> >
> > - The main architectural change in this version is introduction of
> >   "sessions" to manage the lifecycle of preserved file descriptors.
> >   In v3, session management was left to a single userspace agent. This
> >   approach has been revised to improve robustness. Now, each session is
> >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> >   all preserved resources within a session is tied to this FD, ensuring
> >   automatic cleanup by the kernel if the controlling userspace agent
> >   crashes or exits unexpectedly.
> >
> > - The first three KHO fixes from the previous series have been merged
> >   into Linus' tree.
> >
> > - Various bug fixes and refactorings, including correcting memory
> >   unpreservation logic during a kho_abort() sequence.
> >
> > - Addressing all comments from reviewers.
> >
> > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> >   can now be queried  only via ioctl() API.
> >
> > =======================================================================
>
> Hi all,
>
> Following up on yesterday's Hypervisor Live Update meeting, we
> discussed the requirements for the LUO to track dependencies,
> particularly for IOMMU preservation and other stateful file
> descriptors. This email summarizes the main design decisions and
> outcomes from that discussion.
>
> For context, the notes from the previous meeting can be found here:
> https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> The notes for yesterday's meeting are not yes available.
>
> The key outcomes are as follows:
>
> 1. User-Enforced Ordering
> -------------------------
> The responsibility for enforcing the correct order of operations will
> lie with the userspace agent. If fd_A is a dependency for fd_B,
> userspace must ensure that fd_A is preserved before fd_B. This same
> ordering must be honored during the restoration phase after the reboot
> (fd_A must be restored before fd_B). The kernel preserve the ordering.
>
> 2. Serialization in PRESERVE_FD
> -------------------------------
> To keep the global prepare() phase lightweight and predictable, the
> consensus was to shift the heavy serialization work into the
> PRESERVE_FD ioctl handler. This means that when userspace requests to
> preserve a file, the file handler should perform the bulk of the
> state-saving work immediately.
>
> The proposed sequence of operations reflects this shift:
>
> Shutdown Flow:
> fd_preserve() (heavy serialization) -> prepare() (lightweight final
> checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
>
> Boot & Restore Flow:
> fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> post-restore IOCTLs (e.g., hardware page table re-creation) ->
> finish() (lightweight cleanup)
>
> This decision primarily serves as a guideline for file handler
> implementations. For the LUO core, this implies minor API changes,
> such as renaming can_preserve() to a more active preserve() and adding
> a corresponding unpreserve() callback to be called during
> UNPRESERVE_FD.
>
> 3. FD Data Query API
> --------------------
> We identified the need for a kernel API to allow subsystems to query
> preserved FD data during the boot process, before userspace has
> initiated the restore.
>
> The proposed API would allow a file handler to retrieve a list of all
> its preserved FDs, including their session names, tokens, and the
> private data payload.
>
> Proposed Data Structure:
>
> struct liveupdate_fd {
>         char *session; /* session name */
>         u64 token; /* Preserved FD token */
>         u64 data; /* Private preserved data */
> };
>
> Proposed Function:
> liveupdate_fd_data_query(struct liveupdate_file_handler *h,
>                          struct liveupdate_fd *fds, long *count);
>
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.
>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:

This part is a little tricky, the file handler might be in a
completely different subsystem as compared to the global state. While
FD is supposed to own and control the lifecycle of the preserved
state, the global state might be needed in a completely different
layer during boot or some other event. Maybe the user can put some
APIs in place to move this state across layers?

Subsystems actually do provide this flexibility. But If I see
correctly, this approach is "global" like a subsystem but synchronous.
Users can decide when to preserve/create global state when they need
without the need to stage it and preserve it when subsystem PREPARE is
called.
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);

If I understand correctly, is this only for unpacking? Once unpacked
the user can call the _get function and get the global state that it
just unpacked. This should be fine.
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.
> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.
>
> Pasha
>
>
> Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Wed, Oct 8, 2025 at 3:04 AM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > This series introduces the Live Update Orchestrator (LUO), a kernel
> > > subsystem designed to facilitate live kernel updates. LUO enables
> > > kexec-based reboots with minimal downtime, a critical capability for
> > > cloud environments where hypervisors must be updated without disrupting
> > > running virtual machines. By preserving the state of selected resources,
> > > such as file descriptors and memory, LUO allows workloads to resume
> > > seamlessly in the new kernel.
> > >
> > > The git branch for this series can be found at:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> > >
> > > The patch series applies against linux-next tag: next-20250926
> > >
> > > While this series is showed cased using memfd preservation. There are
> > > works to preserve devices:
> > > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> > >
> > > =======================================================================
> > > Changelog since v3:
> > > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> > >
> > > - The main architectural change in this version is introduction of
> > >   "sessions" to manage the lifecycle of preserved file descriptors.
> > >   In v3, session management was left to a single userspace agent. This
> > >   approach has been revised to improve robustness. Now, each session is
> > >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> > >   all preserved resources within a session is tied to this FD, ensuring
> > >   automatic cleanup by the kernel if the controlling userspace agent
> > >   crashes or exits unexpectedly.
> > >
> > > - The first three KHO fixes from the previous series have been merged
> > >   into Linus' tree.
> > >
> > > - Various bug fixes and refactorings, including correcting memory
> > >   unpreservation logic during a kho_abort() sequence.
> > >
> > > - Addressing all comments from reviewers.
> > >
> > > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> > >   can now be queried  only via ioctl() API.
> > >
> > > =======================================================================
> >
> > Hi all,
> >
> > Following up on yesterday's Hypervisor Live Update meeting, we
> > discussed the requirements for the LUO to track dependencies,
> > particularly for IOMMU preservation and other stateful file
> > descriptors. This email summarizes the main design decisions and
> > outcomes from that discussion.
> >
> > For context, the notes from the previous meeting can be found here:
> > https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> > The notes for yesterday's meeting are not yes available.
> >
> > The key outcomes are as follows:
> >
> > 1. User-Enforced Ordering
> > -------------------------
> > The responsibility for enforcing the correct order of operations will
> > lie with the userspace agent. If fd_A is a dependency for fd_B,
> > userspace must ensure that fd_A is preserved before fd_B. This same
> > ordering must be honored during the restoration phase after the reboot
> > (fd_A must be restored before fd_B). The kernel preserve the ordering.
> >
> > 2. Serialization in PRESERVE_FD
> > -------------------------------
> > To keep the global prepare() phase lightweight and predictable, the
> > consensus was to shift the heavy serialization work into the
> > PRESERVE_FD ioctl handler. This means that when userspace requests to
> > preserve a file, the file handler should perform the bulk of the
> > state-saving work immediately.
> >
> > The proposed sequence of operations reflects this shift:
> >
> > Shutdown Flow:
> > fd_preserve() (heavy serialization) -> prepare() (lightweight final
> > checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
> >
> > Boot & Restore Flow:
> > fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> > post-restore IOCTLs (e.g., hardware page table re-creation) ->
> > finish() (lightweight cleanup)
> >
> > This decision primarily serves as a guideline for file handler
> > implementations. For the LUO core, this implies minor API changes,
> > such as renaming can_preserve() to a more active preserve() and adding
> > a corresponding unpreserve() callback to be called during
> > UNPRESERVE_FD.
> >
> > 3. FD Data Query API
> > --------------------
> > We identified the need for a kernel API to allow subsystems to query
> > preserved FD data during the boot process, before userspace has
> > initiated the restore.
> >
> > The proposed API would allow a file handler to retrieve a list of all
> > its preserved FDs, including their session names, tokens, and the
> > private data payload.
> >
> > Proposed Data Structure:
> >
> > struct liveupdate_fd {
> >         char *session; /* session name */
> >         u64 token; /* Preserved FD token */
> >         u64 data; /* Private preserved data */
> > };
> >
> > Proposed Function:
> > liveupdate_fd_data_query(struct liveupdate_file_handler *h,
> >                          struct liveupdate_fd *fds, long *count);
> >
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
> >
> > The key characteristics of this new mechanism are:
> > The global state is optionally created on the first preserve() call
> > for a given file handler.
> > The state can be updated on subsequent preserve() calls.
> > The state is destroyed when the last corresponding file is unpreserved
> > or finished.
> > The data can be accessed during boot.
> >
> > I am thinking of an API like this.

Sami and I discussed this further, and we agree that the proposed API
will work. We also identified two additional requirements that were
not mentioned in my previous email:

1. Ordered Un-preservation
The un-preservation of file descriptors must also be ordered and must
occur in the reverse order of preservation. For example, if a user
preserves a memfd first and then an iommufd that depends on it, the
iommufd must be un-preserved before the memfd when the session is
closed or the FDs are explicitly un-preserved.

2. New API to Check Preservation Status
A new LUO API will be needed to check if a struct file is already
preserved within a session. This is needed for dependency validation.
The proposed function would look like this:

bool liveupdate_is_file_preserved(struct liveupdate_session *session,
struct file *file);

This will allow the file handler for one FD (e.g., iommufd) to verify
during its preserve() callback that its dependencies (e.g., the
backing memfd) have already been preserved in the same session.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> 1. Ordered Un-preservation
> The un-preservation of file descriptors must also be ordered and must
> occur in the reverse order of preservation. For example, if a user
> preserves a memfd first and then an iommufd that depends on it, the
> iommufd must be un-preserved before the memfd when the session is
> closed or the FDs are explicitly un-preserved.

Why?

I imagined the first to unpreserve would restore the struct file * -
that would satisfy the order.

The ioctl version that is to get back a FD would recover the struct
file and fd_install it.

Meaning preserve side is retaining a database of labels to restored
struct file *'s.

As discussed unpreserve a FD does not imply unfreeze, which is the
opposite of how preserver works.

> 2. New API to Check Preservation Status
> A new LUO API will be needed to check if a struct file is already
> preserved within a session. This is needed for dependency validation.
> The proposed function would look like this:

This doesn't seem right, the API should be more like 'luo get
serialization handle for this file *'

If it hasn't been preserved then there won't be a handle, otherwise it
should return something to allow the unpreserving side to recover this
struct file *.

That's the general use case at least, there may be some narrower use
cases where the preserver throws away the handle.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > 1. Ordered Un-preservation
> > The un-preservation of file descriptors must also be ordered and must
> > occur in the reverse order of preservation. For example, if a user
> > preserves a memfd first and then an iommufd that depends on it, the
> > iommufd must be un-preserved before the memfd when the session is
> > closed or the FDs are explicitly un-preserved.
>
> Why?
>
> I imagined the first to unpreserve would restore the struct file * -
> that would satisfy the order.

In my description, "un-preserve" refers to the action of canceling a
preservation request in the outgoing kernel, before kexec ever
happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
used when a user decides not to go through with the live update for a
specific FD.

The terminology I am using:
preserve: Put FD into LUO in the outgoing kernel
unpreserve: Remove FD from LUO from the outgoing kernel
retrieve: Restore FD and return it to user in the next kernel

For the retrieval part, we are going to be using FIFO order, the same
as preserve.

> The ioctl version that is to get back a FD would recover the struct
> file and fd_install it.
>
> Meaning preserve side is retaining a database of labels to restored
> struct file *'s.
>
> As discussed unpreserve a FD does not imply unfreeze, which is the
> opposite of how preserver works.
>
> > 2. New API to Check Preservation Status
> > A new LUO API will be needed to check if a struct file is already
> > preserved within a session. This is needed for dependency validation.
> > The proposed function would look like this:
>
> This doesn't seem right, the API should be more like 'luo get
> serialization handle for this file *'

How about:

int liveupdate_find_token(struct liveupdate_session *session,
                          struct file *file, u64 *token);

And if needed:
int liveupdate_find_file(struct liveupdate_session *session,
                         u64 token, struct file **file);

Return: 0 on success, or -ENOENT if the file is not preserved.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Wed, Oct 08, 2025 at 04:26:39PM -0400, Pasha Tatashin wrote:
> On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > > 1. Ordered Un-preservation
> > > The un-preservation of file descriptors must also be ordered and must
> > > occur in the reverse order of preservation. For example, if a user
> > > preserves a memfd first and then an iommufd that depends on it, the
> > > iommufd must be un-preserved before the memfd when the session is
> > > closed or the FDs are explicitly un-preserved.
> >
> > Why?
> >
> > I imagined the first to unpreserve would restore the struct file * -
> > that would satisfy the order.
> 
> In my description, "un-preserve" refers to the action of canceling a
> preservation request in the outgoing kernel, before kexec ever
> happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
> used when a user decides not to go through with the live update for a
> specific FD.
> 
> The terminology I am using:
> preserve: Put FD into LUO in the outgoing kernel
> unpreserve: Remove FD from LUO from the outgoing kernel
> retrieve: Restore FD and return it to user in the next kernel

Ok

> For the retrieval part, we are going to be using FIFO order, the same
> as preserve.

This won't work. retrieval is driven by early boot discovery ordering
and then by userspace. It will be in whatever order it wants. We need
to be able to do things like make the struct file * at the moment
something requests it..

> > This doesn't seem right, the API should be more like 'luo get
> > serialization handle for this file *'
> 
> How about:
> 
> int liveupdate_find_token(struct liveupdate_session *session,
>                           struct file *file, u64 *token);

This sort of thing should not be used on the preserve side..

> And if needed:
> int liveupdate_find_file(struct liveupdate_session *session,
>                          u64 token, struct file **file);
> 
> Return: 0 on success, or -ENOENT if the file is not preserved.

I would argue it should always cause a preservation...

But this is still backwards, what we need is something like

liveupdate_preserve_file(session, file, &token);
my_preserve_blob.file_token = token

[..]

file = liveupdate_retrieve_file(session, my_preserve_blob.file_token);

And these can run in any order, and be called multiple times.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 10:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 08, 2025 at 04:26:39PM -0400, Pasha Tatashin wrote:
> > On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > > > 1. Ordered Un-preservation
> > > > The un-preservation of file descriptors must also be ordered and must
> > > > occur in the reverse order of preservation. For example, if a user
> > > > preserves a memfd first and then an iommufd that depends on it, the
> > > > iommufd must be un-preserved before the memfd when the session is
> > > > closed or the FDs are explicitly un-preserved.
> > >
> > > Why?
> > >
> > > I imagined the first to unpreserve would restore the struct file * -
> > > that would satisfy the order.
> >
> > In my description, "un-preserve" refers to the action of canceling a
> > preservation request in the outgoing kernel, before kexec ever
> > happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
> > used when a user decides not to go through with the live update for a
> > specific FD.
> >
> > The terminology I am using:
> > preserve: Put FD into LUO in the outgoing kernel
> > unpreserve: Remove FD from LUO from the outgoing kernel
> > retrieve: Restore FD and return it to user in the next kernel
>
> Ok
>
> > For the retrieval part, we are going to be using FIFO order, the same
> > as preserve.
>
> This won't work. retrieval is driven by early boot discovery ordering
> and then by userspace. It will be in whatever order it wants. We need
> to be able to do things like make the struct file * at the moment
> something requests it..

I thought we wanted only the user to do "struct file" creation when
the user retrieves FD back. In this case we can enforce strict
ordering during retrieval. If "struct file" can be retrieved by
anything within the kernel, then that could be any kernel process
during boot, meaning that charging is not going to be properly applied
when kernel allocations are performed.

We specifically decided that while "struct file"s are going to be
created only by the user, the other subsystems can have early access
to the preserved file data, if they know how to parse it.

> > > This doesn't seem right, the API should be more like 'luo get
> > > serialization handle for this file *'
> >
> > How about:
> >
> > int liveupdate_find_token(struct liveupdate_session *session,
> >                           struct file *file, u64 *token);
>
> This sort of thing should not be used on the preserve side..
>
> > And if needed:
> > int liveupdate_find_file(struct liveupdate_session *session,
> >                          u64 token, struct file **file);
> >
> > Return: 0 on success, or -ENOENT if the file is not preserved.
>
> I would argue it should always cause a preservation...
>
> But this is still backwards, what we need is something like
>
> liveupdate_preserve_file(session, file, &token);
> my_preserve_blob.file_token = token

We cannot do that, the user should have already preserved that file
and provided us with a token to use, if that file was not preserved by
the user it is a bug. With this proposal, we would have to generate a
token, and it was argued that the kernel should not do that.

> file = liveupdate_retrieve_file(session, my_preserve_blob.file_token);
>
> And these can run in any order, and be called multiple times.
>
> Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> In this case we can enforce strict
> ordering during retrieval. If "struct file" can be retrieved by
> anything within the kernel, then that could be any kernel process
> during boot, meaning that charging is not going to be properly applied
> when kernel allocations are performed.

Ugh, yeah, OK that's irritating and might burn us, but we did decide
on that strategy.

> > I would argue it should always cause a preservation...
> >
> > But this is still backwards, what we need is something like
> >
> > liveupdate_preserve_file(session, file, &token);
> > my_preserve_blob.file_token = token
> 
> We cannot do that, the user should have already preserved that file
> and provided us with a token to use, if that file was not preserved by
> the user it is a bug. With this proposal, we would have to generate a
> token, and it was argued that the kernel should not do that.

The token is the label used as ABI across the kexec. Each entity doing
a serialization can operate it's labels however it needs.

Here I am suggeting that when a kernel entity goes to record a struct
file in a kernel ABI structure it can get a kernel generated token for
it.

This is a different token name space than the user provided tokens
through the ioctl. A single struct file may have many entities
serializing it and possibly many tokens.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 1:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> > In this case we can enforce strict
> > ordering during retrieval. If "struct file" can be retrieved by
> > anything within the kernel, then that could be any kernel process
> > during boot, meaning that charging is not going to be properly applied
> > when kernel allocations are performed.
>
> Ugh, yeah, OK that's irritating and might burn us, but we did decide
> on that strategy.
>
> > > I would argue it should always cause a preservation...
> > >
> > > But this is still backwards, what we need is something like
> > >
> > > liveupdate_preserve_file(session, file, &token);
> > > my_preserve_blob.file_token = token
> >
> > We cannot do that, the user should have already preserved that file
> > and provided us with a token to use, if that file was not preserved by
> > the user it is a bug. With this proposal, we would have to generate a
> > token, and it was argued that the kernel should not do that.
>
> The token is the label used as ABI across the kexec. Each entity doing
> a serialization can operate it's labels however it needs.
>
> Here I am suggeting that when a kernel entity goes to record a struct
> file in a kernel ABI structure it can get a kernel generated token for
> it.

Sure, we can consider allowing the kernel to preserve dependent FDs
automatically in the future, but is there a compelling use case that
requires it right now?

For the initial implementation, I think we should stick to the
simpler, agreed-upon plan: preservation order is explicitly defined by
userspace. If a preserve() call fails due to an unmet dependency, the
error is returned to the user, who is then responsible for correcting
the order. This keeps the kernel logic straightforward and places the
preservation responsibility squarely in userspace, where it belongs.

Pasha

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Thu, Oct 09, 2025 at 02:37:44PM -0400, Pasha Tatashin wrote:
> On Thu, Oct 9, 2025 at 1:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> > > In this case we can enforce strict
> > > ordering during retrieval. If "struct file" can be retrieved by
> > > anything within the kernel, then that could be any kernel process
> > > during boot, meaning that charging is not going to be properly applied
> > > when kernel allocations are performed.
> >
> > Ugh, yeah, OK that's irritating and might burn us, but we did decide
> > on that strategy.
> >
> > > > I would argue it should always cause a preservation...
> > > >
> > > > But this is still backwards, what we need is something like
> > > >
> > > > liveupdate_preserve_file(session, file, &token);
> > > > my_preserve_blob.file_token = token
> > >
> > > We cannot do that, the user should have already preserved that file
> > > and provided us with a token to use, if that file was not preserved by
> > > the user it is a bug. With this proposal, we would have to generate a
> > > token, and it was argued that the kernel should not do that.
> >
> > The token is the label used as ABI across the kexec. Each entity doing
> > a serialization can operate it's labels however it needs.
> >
> > Here I am suggeting that when a kernel entity goes to record a struct
> > file in a kernel ABI structure it can get a kernel generated token for
> > it.
> 
> Sure, we can consider allowing the kernel to preserve dependent FDs
> automatically in the future, but is there a compelling use case that
> requires it right now?

Right now for the three prototype series.. Hmm, yes, I think we can
avoid implementing this.

In the future I suspect iommufd will need to restore the KVM fd since
stuff in the KVM sometimes becomes entangled with the iommu in some
cases on some arches.

The issue here is not order, it is straight up 'what value does
iommufd write to it's kexec ABI struct to refer to the KVM fd'.

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Samiullah Khawaja 4 months ago

On Thu, Oct 9, 2025 at 8:02 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Thu, Oct 9, 2025 at 10:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 04:26:39PM -0400, Pasha Tatashin wrote:
> > > On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > > > > 1. Ordered Un-preservation
> > > > > The un-preservation of file descriptors must also be ordered and must
> > > > > occur in the reverse order of preservation. For example, if a user
> > > > > preserves a memfd first and then an iommufd that depends on it, the
> > > > > iommufd must be un-preserved before the memfd when the session is
> > > > > closed or the FDs are explicitly un-preserved.
> > > >
> > > > Why?
> > > >
> > > > I imagined the first to unpreserve would restore the struct file * -
> > > > that would satisfy the order.
> > >
> > > In my description, "un-preserve" refers to the action of canceling a
> > > preservation request in the outgoing kernel, before kexec ever
> > > happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
> > > used when a user decides not to go through with the live update for a
> > > specific FD.
> > >
> > > The terminology I am using:
> > > preserve: Put FD into LUO in the outgoing kernel
> > > unpreserve: Remove FD from LUO from the outgoing kernel
> > > retrieve: Restore FD and return it to user in the next kernel
> >
> > Ok
> >
> > > For the retrieval part, we are going to be using FIFO order, the same
> > > as preserve.
> >
> > This won't work. retrieval is driven by early boot discovery ordering
> > and then by userspace. It will be in whatever order it wants. We need
> > to be able to do things like make the struct file * at the moment
> > something requests it..
>
> I thought we wanted only the user to do "struct file" creation when
> the user retrieves FD back. In this case we can enforce strict
> ordering during retrieval. If "struct file" can be retrieved by
> anything within the kernel, then that could be any kernel process
> during boot, meaning that charging is not going to be properly applied
> when kernel allocations are performed.
>
> We specifically decided that while "struct file"s are going to be
> created only by the user, the other subsystems can have early access
> to the preserved file data, if they know how to parse it.
>
> > > > This doesn't seem right, the API should be more like 'luo get
> > > > serialization handle for this file *'
> > >
> > > How about:
> > >
> > > int liveupdate_find_token(struct liveupdate_session *session,
> > >                           struct file *file, u64 *token);
> >
> > This sort of thing should not be used on the preserve side..
> >
> > > And if needed:
> > > int liveupdate_find_file(struct liveupdate_session *session,
> > >                          u64 token, struct file **file);
> > >
> > > Return: 0 on success, or -ENOENT if the file is not preserved.
> >
> > I would argue it should always cause a preservation...
> >
> > But this is still backwards, what we need is something like
> >
> > liveupdate_preserve_file(session, file, &token);
> > my_preserve_blob.file_token = token

Please clarify if you still consider that the user does register the
dependencies FDs explicitly, but this API just triggers the
"prepare()" or "preserve()" callback so the preservation order is
enforced/synchronized?
>
> We cannot do that, the user should have already preserved that file
> and provided us with a token to use, if that file was not preserved by
> the user it is a bug. With this proposal, we would have to generate a
> token, and it was argued that the kernel should not do that.

Agreed. Another thing that I was wondering about is how does the user
space know that its FD was preserved as dependency?

>
> > file = liveupdate_retrieve_file(session, my_preserve_blob.file_token);
> >
> > And these can run in any order, and be called multiple times.
> >
> > Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Thu, Oct 9, 2025 at 11:01 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Thu, Oct 9, 2025 at 10:48 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 04:26:39PM -0400, Pasha Tatashin wrote:
> > > On Wed, Oct 8, 2025 at 3:36 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >
> > > > On Wed, Oct 08, 2025 at 12:40:34PM -0400, Pasha Tatashin wrote:
> > > > > 1. Ordered Un-preservation
> > > > > The un-preservation of file descriptors must also be ordered and must
> > > > > occur in the reverse order of preservation. For example, if a user
> > > > > preserves a memfd first and then an iommufd that depends on it, the
> > > > > iommufd must be un-preserved before the memfd when the session is
> > > > > closed or the FDs are explicitly un-preserved.
> > > >
> > > > Why?
> > > >
> > > > I imagined the first to unpreserve would restore the struct file * -
> > > > that would satisfy the order.
> > >
> > > In my description, "un-preserve" refers to the action of canceling a
> > > preservation request in the outgoing kernel, before kexec ever
> > > happens. It's the pre-reboot counterpart to the PRESERVE_FD ioctl,
> > > used when a user decides not to go through with the live update for a
> > > specific FD.
> > >
> > > The terminology I am using:
> > > preserve: Put FD into LUO in the outgoing kernel
> > > unpreserve: Remove FD from LUO from the outgoing kernel
> > > retrieve: Restore FD and return it to user in the next kernel
> >
> > Ok
> >
> > > For the retrieval part, we are going to be using FIFO order, the same
> > > as preserve.
> >
> > This won't work. retrieval is driven by early boot discovery ordering
> > and then by userspace. It will be in whatever order it wants. We need
> > to be able to do things like make the struct file * at the moment
> > something requests it..
>
> I thought we wanted only the user to do "struct file" creation when
> the user retrieves FD back. In this case we can enforce strict
> ordering during retrieval. If "struct file" can be retrieved by
> anything within the kernel, then that could be any kernel process
> during boot, meaning that charging is not going to be properly applied
> when kernel allocations are performed.

There is a second reason: by the time we enter userspace, and are
ready to retrieve FDs, we know that all file handlers that are to be
registered have registered, if we do that during boot with-in kernel,
then we can get into the problem, where we are trying to retrieve data
of a file-handler that has not yet registered.

>
> We specifically decided that while "struct file"s are going to be
> created only by the user, the other subsystems can have early access
> to the preserved file data, if they know how to parse it.
>
> > > > This doesn't seem right, the API should be more like 'luo get
> > > > serialization handle for this file *'
> > >
> > > How about:
> > >
> > > int liveupdate_find_token(struct liveupdate_session *session,
> > >                           struct file *file, u64 *token);
> >
> > This sort of thing should not be used on the preserve side..
> >
> > > And if needed:
> > > int liveupdate_find_file(struct liveupdate_session *session,
> > >                          u64 token, struct file **file);
> > >
> > > Return: 0 on success, or -ENOENT if the file is not preserved.
> >
> > I would argue it should always cause a preservation...
> >
> > But this is still backwards, what we need is something like
> >
> > liveupdate_preserve_file(session, file, &token);
> > my_preserve_blob.file_token = token
>
> We cannot do that, the user should have already preserved that file
> and provided us with a token to use, if that file was not preserved by
> the user it is a bug. With this proposal, we would have to generate a
> token, and it was argued that the kernel should not do that.
>
> > file = liveupdate_retrieve_file(session, my_preserve_blob.file_token);
> >
> > And these can run in any order, and be called multiple times.
> >
> > Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Jason Gunthorpe 4 months ago

On Tue, Oct 07, 2025 at 01:10:30PM -0400, Pasha Tatashin wrote:
> 
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
> 
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);

It shouldn't be a "push" like this. Everything has a certain logical point
when it will need the luo data, it should be coded to 'pull' the data
right at that point.


> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);

I'm not sure a callback here is a good idea, the users are synchronous
at early boot, they should get their data and immediately process it
within the context of the caller. A 'unpack' callback does not seem so
useful to me.

> The get/put global state data:
> 
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
> 
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

Maybe lock/unlock if it is locking.

It seems like a good direction overall. Really need to see how it
works with some examples

Jason

Re: [PATCH v4 00/30] Live Update Orchestrator

Posted by Pasha Tatashin 4 months ago

On Tue, Oct 7, 2025 at 1:50 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Oct 07, 2025 at 01:10:30PM -0400, Pasha Tatashin wrote:
> >
> > 1. Add three more callbacks to liveupdate_file_ops:
> > /*
> >  * Optional. Called by LUO during first get global state call.
> >  * The handler should allocate/KHO preserve its global state object and return a
> >  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
> >  * address of preserved memory) via 'data_handle' that LUO will save.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_create)(struct liveupdate_file_handler *h,
> >                            void **obj, u64 *data_handle);
> >
> > /*
> >  * Optional. Called by LUO in the new kernel
> >  * before the first access to the global state. The handler receives
> >  * the preserved u64 data_handle and should use it to reconstruct its
> >  * global state object, returning a pointer to it via 'obj'.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_restore)(struct liveupdate_file_handler *h,
> >                             u64 data_handle, void **obj);
>
> It shouldn't be a "push" like this. Everything has a certain logical point
> when it will need the luo data, it should be coded to 'pull' the data
> right at that point.

It is not pushed, this callback is done automatically on the first
call liveupdate_fh_global_state_lock() in the new kernel, so exactly,
when a user tries to access the global data, it is restored from KHO
for the user to be able to access it via a normal pointer.

> > /*
> >  * Optional. Called by LUO after the last
> >  * file for this handler is unpreserved or finished. The handler
> >  * must free its global state object and any associated resources.
> >  */
> > void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> I'm not sure a callback here is a good idea, the users are synchronous
> at early boot, they should get their data and immediately process it
> within the context of the caller. A 'unpack' callback does not seem so
> useful to me.

This callback is also automatic, it is called only once the last FD is
finished, and LUO has no FDs for this file handler, so the global
state can be properly freed. There is no unpack here.

>
> > The get/put global state data:
> >
> > /* Get and lock the data with file_handler scoped lock */
> > int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
> >                                    void **obj);
> >
> > /* Unlock the data */
> > void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Maybe lock/unlock if it is locking.

Sure, will name them:
liveupdate_fh_global_state_lock()
liveupdate_fh_global_state_unlock()

>
> It seems like a good direction overall. Really need to see how it
> works with some examples
>
> Jason