From nobody Tue Dec 2 00:25:57 2025 Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1AA5330D32 for ; Tue, 25 Nov 2025 16:59:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764089978; cv=none; b=Xw8PLPc/VCjc//N1fU1nkgVBFntDBwTxKmjGm97Mhz3SS9EQlWKzv/QmHcifsqtsGtCyBHH8ugfxo/+daNijqcMqUjGcqPSpNUjVjsUk7YSF7sbNiVPqQf2j7hxsyRxQv8ragad+quyoFWOBGxnNWNRHyrysShvB2QuyTr82RBU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764089978; c=relaxed/simple; bh=FoZhYQ5anN9QvxDg+LqoX1tvpbpX6PKiEw9R3dzkS2M=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=W+e/2McFx3PxvweXwsQwbGLEvhd49S9U3Py4PgZEMKww7fsea0KHcr0cTfNnIa+prxOT+efLYuk/8fzpbg48xW45dqKxBMJN6S56xRS9OkrHS/IY7CaU8SUIb3QB9XSXUoaBdMqPIFUshhMgWA2CrQlp9dcyaB69FpDy93XxXCg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=soleen.com; spf=pass smtp.mailfrom=soleen.com; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b=YtQtC6Cg; arc=none smtp.client-ip=209.85.128.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=soleen.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=soleen.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="YtQtC6Cg" Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-787c9f90eccso59974387b3.3 for ; Tue, 25 Nov 2025 08:59:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1764089974; x=1764694774; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=hoq1KHM//ceWZHYzHvP2GAeyKGu/ekSRGep2L5s8MQE=; b=YtQtC6Cg9C6m65HMIAnsVp0IM5jeaLNY89HFe+MJVan7NgwA5/CUxtovNkY0SakgTS pNpRqjZFOMMqo3i1q4hnsnd+S9QWWVh9xYwnYDCR58bhR3rG3BvfIo1/bKMnJc+fufre Hoj+ksyxMmKoHQddOKEbh7NzYpj6iK9iJ51e4o2nHPurpXzB/liExEC5kZ+ATHC4T47F 5Jt0xZtRJpYoOIjlJhid2fWJrGPvoNg5YDTwxiXPN4ElQQiIzPHGqJnYN32I0447zbWm k/0zJ5DXi3Ycr8Wo7dCUVRsmLWPkMVTRXbaAyt9Ss/IfxvDndW8WJp0f50nYutbaojaC UcYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764089974; x=1764694774; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=hoq1KHM//ceWZHYzHvP2GAeyKGu/ekSRGep2L5s8MQE=; b=eXjNcD8qSH9BU5DylV0gVM/cU/Rkuk74aqRi2VNLfM+cIypqbtX7zHxWZmJO7qCPcj COJ824hMPR2Pp6Jids7t7kgOGtqRh2q8p21H8qEpSSYiLwcs2cV7IQrdaz6VpviNRZD9 zCBJ/W1uv3vHFi4tXKIcMkGAbSlUqi8n+ObfVaLAMjlj4azmJSG0TMpyMEZKMP0epknf SnLZNRx+hX2KM5pwrT4fMsZRFIUgM4AmZD68bhWxO/VOm2oB7jrjY0AsOPe6udcxPFln 7hJnRVAogisMGvCzM60aMzvNAajnDF5VZX02o2dTmbqbrVuICQzD5HNsoXeRSqHTTjJ6 6XAw== X-Forwarded-Encrypted: i=1; AJvYcCWbL3tM2rEscbbRbsSEJHt+t98ulAkWBNsh6PZa336qhel6pR33vOQuXy5A6QS4q/6d1kJqM9JEWOB7I5E=@vger.kernel.org X-Gm-Message-State: AOJu0YyukrhFPU3Lvc+qubqQFTquFjExdtcQBpmoOba6xwhTUF7WGT04 u7XME9Rmw0qjT7eGVbjkcGSX2mJkaCXUazfyF24iwx+430X/Azj7B60YYCfy8v7wMHw= X-Gm-Gg: ASbGncuIDAYBCBT+OT2NjYs9VopMnz9nNpIgezQM27zZJnD49WvHT4AGEeIaQkBlurA bkTjGVHS71aZqjYXaCb0YEDqpnS96tBf8TUU2tnxNcPoM/MJe/zO8ZFSMdwIEsTq5G9ihlA7eIM +WvcP7X4qI/XQm61uwVg8P8q3C+jW0eqSBgkd5X+4MTugKsmUvj8KYpF5WoB4D1AV/SM3VrkJ2V HmgLLWspf8ywEPlhlXMVRMNqA7Ibts/qBswnZ8YAuLUoVf3g6UQ7y5Gg3ldQtXIyE0F90PsSS9Z nIAzLe8X8eWrQi6SMXTu1Lc12d7TnYFQm3uyKUWgn7GA8jAC9UV/G4tyd7LZe4JeBqxjCrJicrY MFkKVO1Q1QMsa8v6qR0V6VRwwMTzdvxOcUA1974OZdrQgFzsRga7U6+jA+fUfCiBxLfG3thbNmA L/Z8e8l75DzlkICPnPi2XYovpJZcJeEzlKB8CAElMvU4FWrlSakqz8rmGgTfTahH1tVzuTajcDE EcI22M= X-Google-Smtp-Source: AGHT+IGhyqko2XNGAQ+MbliOV7Q7lZRrugm5nEIedcDe3Hixtd5DaNp0Tw4exj+v0JxhFqTVbLupfA== X-Received: by 2002:a05:690c:45c5:b0:786:45ce:9bd3 with SMTP id 00721157ae682-78ab6f345bamr29492047b3.34.1764089974430; Tue, 25 Nov 2025 08:59:34 -0800 (PST) Received: from soleen.c.googlers.com.com (182.221.85.34.bc.googleusercontent.com. [34.85.221.182]) by smtp.gmail.com with ESMTPSA id 00721157ae682-78a798a5518sm57284357b3.14.2025.11.25.08.59.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Nov 2025 08:59:33 -0800 (PST) From: Pasha Tatashin To: pratyush@kernel.org, jasonmiu@google.com, graf@amazon.com, pasha.tatashin@soleen.com, rppt@kernel.org, dmatlack@google.com, rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org, ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com, ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org, akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr, mmaurer@google.com, roman.gushchin@linux.dev, chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com, joel.granados@kernel.org, rostedt@goodmis.org, anna.schumaker@oracle.com, song@kernel.org, linux@weissschuh.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, gregkh@linuxfoundation.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, rafael@kernel.org, dakr@kernel.org, bartosz.golaszewski@linaro.org, cw00.choi@samsung.com, myungjoo.ham@samsung.com, yesanishhere@gmail.com, Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com, aleksander.lobakin@intel.com, ira.weiny@intel.com, andriy.shevchenko@linux.intel.com, leon@kernel.org, lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org, djeffery@redhat.com, stuart.w.hayes@gmail.com, ptyadav@amazon.de, lennart@poettering.net, brauner@kernel.org, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, saeedm@nvidia.com, ajayachandra@nvidia.com, jgg@nvidia.com, parav@nvidia.com, leonro@nvidia.com, witu@nvidia.com, hughd@google.com, skhawaja@google.com, chrisl@kernel.org Subject: [PATCH v8 14/18] mm: memfd_luo: allow preserving memfd Date: Tue, 25 Nov 2025 11:58:44 -0500 Message-ID: <20251125165850.3389713-15-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.52.0.460.gd25c4c69ec-goog In-Reply-To: <20251125165850.3389713-1-pasha.tatashin@soleen.com> References: <20251125165850.3389713-1-pasha.tatashin@soleen.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Pratyush Yadav The ability to preserve a memfd allows userspace to use KHO and LUO to transfer its memory contents to the next kernel. This is useful in many ways. For one, it can be used with IOMMUFD as the backing store for IOMMU page tables. Preserving IOMMUFD is essential for performing a hypervisor live update with passthrough devices. memfd support provides the first building block for making that possible. For another, applications with a large amount of memory that takes time to reconstruct, reboots to consume kernel upgrades can be very expensive. memfd with LUO gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added. To be more precise, support for anonymous shmem files is added. The handover to the next kernel is not transparent. All the properties of the file are not preserved; only its memory contents, position, and size. The recreated file gets the UID and GID of the task doing the restore, and the task's cgroup gets charged with the memory. Once preserved, the file cannot grow or shrink, and all its pages are pinned to avoid migrations and swapping. The file can still be read from or written to. Use vmalloc to get the buffer to hold the folios, and preserve it using kho_preserve_vmalloc(). This doesn't have the size limit. Signed-off-by: Pratyush Yadav Co-developed-by: Pasha Tatashin Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) --- MAINTAINERS | 2 + include/linux/kho/abi/memfd.h | 77 +++++ mm/Makefile | 1 + mm/memfd_luo.c | 516 ++++++++++++++++++++++++++++++++++ 4 files changed, 596 insertions(+) create mode 100644 include/linux/kho/abi/memfd.h create mode 100644 mm/memfd_luo.c diff --git a/MAINTAINERS b/MAINTAINERS index 868d3d23fdea..425c46bba764 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14469,6 +14469,7 @@ F: tools/testing/selftests/livepatch/ LIVE UPDATE M: Pasha Tatashin M: Mike Rapoport +R: Pratyush Yadav L: linux-kernel@vger.kernel.org S: Maintained F: Documentation/core-api/liveupdate.rst @@ -14477,6 +14478,7 @@ F: include/linux/liveupdate.h F: include/linux/liveupdate/ F: include/uapi/linux/liveupdate.h F: kernel/liveupdate/ +F: mm/memfd_luo.c =20 LLC (802.2) L: netdev@vger.kernel.org diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h new file mode 100644 index 000000000000..da7d063474a1 --- /dev/null +++ b/include/linux/kho/abi/memfd.h @@ -0,0 +1,77 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + * + * Copyright (C) 2025 Amazon.com Inc. or its affiliates. + * Pratyush Yadav + */ + +#ifndef _LINUX_KHO_ABI_MEMFD_H +#define _LINUX_KHO_ABI_MEMFD_H + +#include +#include + +/** + * DOC: memfd Live Update ABI + * + * This header defines the ABI for preserving the state of a memfd across a + * kexec reboot using the LUO. + * + * The state is serialized into a packed structure `struct memfd_luo_ser` + * which is handed over to the next kernel via the KHO mechanism. + * + * This interface is a contract. Any modification to the structure layout + * constitutes a breaking change. Such changes require incrementing the + * version number in the MEMFD_LUO_FH_COMPATIBLE string. + */ + +/** + * MEMFD_LUO_FOLIO_DIRTY - The folio is dirty. + * + * This flag indicates the folio contains data from user. A non-dirty foli= o is + * one that was allocated (say using fallocate(2)) but not written to. + */ +#define MEMFD_LUO_FOLIO_DIRTY BIT(0) + +/** + * MEMFD_LUO_FOLIO_UPTODATE - The folio is up-to-date. + * + * An up-to-date folio has been zeroed out. shmem zeroes out folios on fir= st + * use. This flag tracks which folios need zeroing. + */ +#define MEMFD_LUO_FOLIO_UPTODATE BIT(1) + +/** + * struct memfd_luo_folio_ser - Serialized state of a single folio. + * @pfn: The page frame number of the folio. + * @flags: Flags to describe the state of the folio. + * @index: The page offset (pgoff_t) of the folio within the original = file. + */ +struct memfd_luo_folio_ser { + u64 pfn:52; + u64 flags:12; + u64 index; +} __packed; + +/** + * struct memfd_luo_ser - Main serialization structure for a memfd. + * @pos: The file's current position (f_pos). + * @size: The total size of the file in bytes (i_size). + * @nr_folios: Number of folios in the folios array. + * @folios: KHO vmalloc descriptor pointing to the array of + * struct memfd_luo_folio_ser. + */ +struct memfd_luo_ser { + u64 pos; + u64 size; + u64 nr_folios; + struct kho_vmalloc folios; +} __packed; + +/* The compatibility string for memfd file handler */ +#define MEMFD_LUO_FH_COMPATIBLE "memfd-v1" + +#endif /* _LINUX_KHO_ABI_MEMFD_H */ diff --git a/mm/Makefile b/mm/Makefile index 21abb3353550..7738ec416f00 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -100,6 +100,7 @@ obj-$(CONFIG_NUMA) +=3D memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o +obj-$(CONFIG_LIVEUPDATE) +=3D memfd_luo.o obj-$(CONFIG_MEMCG_V1) +=3D memcontrol-v1.o obj-$(CONFIG_MEMCG) +=3D memcontrol.o vmpressure.o ifdef CONFIG_SWAP diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c new file mode 100644 index 000000000000..4f6ba63b4310 --- /dev/null +++ b/mm/memfd_luo.c @@ -0,0 +1,516 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2025, Google LLC. + * Pasha Tatashin + * + * Copyright (C) 2025 Amazon.com Inc. or its affiliates. + * Pratyush Yadav + */ + +/** + * DOC: Memfd Preservation via LUO + * + * Overview + * =3D=3D=3D=3D=3D=3D=3D=3D + * + * Memory file descriptors (memfd) can be preserved over a kexec using the= Live + * Update Orchestrator (LUO) file preservation. This allows userspace to + * transfer its memory contents to the next kernel after a kexec. + * + * The preservation is not intended to be transparent. Only select propert= ies of + * the file are preserved. All others are reset to default. The preserved + * properties are described below. + * + * .. note:: + * The LUO API is not stabilized yet, so the preserved properties of a = memfd + * are also not stable and are subject to backwards incompatible change= s. + * + * .. note:: + * Currently a memfd backed by Hugetlb is not supported. Memfds created + * with ``MFD_HUGETLB`` will be rejected. + * + * Preserved Properties + * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + * + * The following properties of the memfd are preserved across kexec: + * + * File Contents + * All data stored in the file is preserved. + * + * File Size + * The size of the file is preserved. Holes in the file are filled by + * allocating pages for them during preservation. + * + * File Position + * The current file position is preserved, allowing applications to cont= inue + * reading/writing from their last position. + * + * File Status Flags + * memfds are always opened with ``O_RDWR`` and ``O_LARGEFILE``. This pr= operty + * is maintained. + * + * Non-Preserved Properties + * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D + * + * All properties which are not preserved must be assumed to be reset to + * default. This section describes some of those properties which may be m= ore of + * note. + * + * ``FD_CLOEXEC`` flag + * A memfd can be created with the ``MFD_CLOEXEC`` flag that sets the + * ``FD_CLOEXEC`` on the file. This flag is not preserved and must be set + * again after restore via ``fcntl()``. + * + * Seals + * File seals are not preserved. The file is unsealed on restore and if + * needed, must be sealed again via ``fcntl()``. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +static int memfd_luo_preserve_folios(struct file *file, + struct kho_vmalloc *kho_vmalloc, + struct memfd_luo_folio_ser **out_folios_ser, + u64 *nr_foliosp) +{ + struct inode *inode =3D file_inode(file); + struct memfd_luo_folio_ser *folios_ser; + unsigned int max_folios; + long i, size, nr_pinned; + struct folio **folios; + int err =3D -EINVAL; + pgoff_t offset; + u64 nr_folios; + + size =3D i_size_read(inode); + /* + * If the file has zero size, then the folios and nr_folios properties + * are not set. + */ + if (!size) { + *nr_foliosp =3D 0; + *out_folios_ser =3D NULL; + memset(kho_vmalloc, 0, sizeof(*kho_vmalloc)); + return 0; + } + + /* + * Guess the number of folios based on inode size. Real number might end + * up being smaller if there are higher order folios. + */ + max_folios =3D PAGE_ALIGN(size) / PAGE_SIZE; + folios =3D kvmalloc_array(max_folios, sizeof(*folios), GFP_KERNEL); + if (!folios) + return -ENOMEM; + + /* + * Pin the folios so they don't move around behind our back. This also + * ensures none of the folios are in CMA -- which ensures they don't + * fall in KHO scratch memory. It also moves swapped out folios back to + * memory. + * + * A side effect of doing this is that it allocates a folio for all + * indices in the file. This might waste memory on sparse memfds. If + * that is really a problem in the future, we can have a + * memfd_pin_folios() variant that does not allocate a page on empty + * slots. + */ + nr_pinned =3D memfd_pin_folios(file, 0, size - 1, folios, max_folios, + &offset); + if (nr_pinned < 0) { + err =3D nr_pinned; + pr_err("failed to pin folios: %d\n", err); + goto err_free_folios; + } + nr_folios =3D nr_pinned; + + folios_ser =3D vcalloc(nr_folios, sizeof(*folios_ser)); + if (!folios_ser) { + err =3D -ENOMEM; + goto err_unpin; + } + + for (i =3D 0; i < nr_folios; i++) { + struct memfd_luo_folio_ser *pfolio =3D &folios_ser[i]; + struct folio *folio =3D folios[i]; + unsigned int flags =3D 0; + + err =3D kho_preserve_folio(folio); + if (err) + goto err_unpreserve; + + if (folio_test_dirty(folio)) + flags |=3D MEMFD_LUO_FOLIO_DIRTY; + if (folio_test_uptodate(folio)) + flags |=3D MEMFD_LUO_FOLIO_UPTODATE; + + pfolio->pfn =3D folio_pfn(folio); + pfolio->flags =3D flags; + pfolio->index =3D folio->index; + } + + err =3D kho_preserve_vmalloc(folios_ser, kho_vmalloc); + if (err) + goto err_unpreserve; + + kvfree(folios); + *nr_foliosp =3D nr_folios; + *out_folios_ser =3D folios_ser; + + /* + * Note: folios_ser is purposely not freed here. It is preserved + * memory (via KHO). In the 'unpreserve' path, we use the vmap pointer + * that is passed via private_data. + */ + return 0; + +err_unpreserve: + for (i =3D i - 1; i >=3D 0; i--) + kho_unpreserve_folio(folios[i]); + vfree(folios_ser); +err_unpin: + unpin_folios(folios, nr_folios); +err_free_folios: + kvfree(folios); + + return err; +} + +static void memfd_luo_unpreserve_folios(struct kho_vmalloc *kho_vmalloc, + struct memfd_luo_folio_ser *folios_ser, + u64 nr_folios) +{ + long i; + + if (!nr_folios) + return; + + kho_unpreserve_vmalloc(kho_vmalloc); + + for (i =3D 0; i < nr_folios; i++) { + const struct memfd_luo_folio_ser *pfolio =3D &folios_ser[i]; + struct folio *folio; + + if (!pfolio->pfn) + continue; + + folio =3D pfn_folio(pfolio->pfn); + + kho_unpreserve_folio(folio); + unpin_folio(folio); + } + + vfree(folios_ser); +} + +static int memfd_luo_preserve(struct liveupdate_file_op_args *args) +{ + struct inode *inode =3D file_inode(args->file); + struct memfd_luo_folio_ser *folios_ser; + struct memfd_luo_ser *ser; + u64 nr_folios; + int err =3D 0; + + inode_lock(inode); + shmem_freeze(inode, true); + + /* Allocate the main serialization structure in preserved memory */ + ser =3D kho_alloc_preserve(sizeof(*ser)); + if (IS_ERR(ser)) { + err =3D PTR_ERR(ser); + goto err_unlock; + } + + ser->pos =3D args->file->f_pos; + ser->size =3D i_size_read(inode); + + err =3D memfd_luo_preserve_folios(args->file, &ser->folios, + &folios_ser, &nr_folios); + if (err) + goto err_free_ser; + + ser->nr_folios =3D nr_folios; + inode_unlock(inode); + + args->private_data =3D folios_ser; + args->serialized_data =3D virt_to_phys(ser); + + return 0; + +err_free_ser: + kho_unpreserve_free(ser); +err_unlock: + shmem_freeze(inode, false); + inode_unlock(inode); + return err; +} + +static int memfd_luo_freeze(struct liveupdate_file_op_args *args) +{ + struct memfd_luo_ser *ser; + + if (WARN_ON_ONCE(!args->serialized_data)) + return -EINVAL; + + ser =3D phys_to_virt(args->serialized_data); + + /* + * The pos might have changed since prepare. Everything else stays the + * same. + */ + ser->pos =3D args->file->f_pos; + + return 0; +} + +static void memfd_luo_unpreserve(struct liveupdate_file_op_args *args) +{ + struct inode *inode =3D file_inode(args->file); + struct memfd_luo_ser *ser; + + if (WARN_ON_ONCE(!args->serialized_data)) + return; + + inode_lock(inode); + shmem_freeze(inode, false); + + ser =3D phys_to_virt(args->serialized_data); + + memfd_luo_unpreserve_folios(&ser->folios, args->private_data, + ser->nr_folios); + + kho_unpreserve_free(ser); + inode_unlock(inode); +} + +static void memfd_luo_discard_folios(const struct memfd_luo_folio_ser *fol= ios_ser, + u64 nr_folios) +{ + u64 i; + + for (i =3D 0; i < nr_folios; i++) { + const struct memfd_luo_folio_ser *pfolio =3D &folios_ser[i]; + struct folio *folio; + phys_addr_t phys; + + if (!pfolio->pfn) + continue; + + phys =3D PFN_PHYS(pfolio->pfn); + folio =3D kho_restore_folio(phys); + if (!folio) { + pr_warn_ratelimited("Unable to restore folio at physical address: %llx\= n", + phys); + continue; + } + + folio_put(folio); + } +} + +static void memfd_luo_finish(struct liveupdate_file_op_args *args) +{ + struct memfd_luo_folio_ser *folios_ser; + struct memfd_luo_ser *ser; + + if (args->retrieved) + return; + + ser =3D phys_to_virt(args->serialized_data); + if (!ser) + return; + + if (ser->nr_folios) { + folios_ser =3D kho_restore_vmalloc(&ser->folios); + if (!folios_ser) + goto out; + + memfd_luo_discard_folios(folios_ser, ser->nr_folios); + vfree(folios_ser); + } + +out: + kho_restore_free(ser); +} + +static int memfd_luo_retrieve_folios(struct file *file, + struct memfd_luo_folio_ser *folios_ser, + u64 nr_folios) +{ + struct inode *inode =3D file_inode(file); + struct address_space *mapping =3D inode->i_mapping; + struct folio *folio; + int err =3D -EIO; + long i; + + for (i =3D 0; i < nr_folios; i++) { + const struct memfd_luo_folio_ser *pfolio =3D &folios_ser[i]; + phys_addr_t phys; + u64 index; + int flags; + + if (!pfolio->pfn) + continue; + + phys =3D PFN_PHYS(pfolio->pfn); + folio =3D kho_restore_folio(phys); + if (!folio) { + pr_err("Unable to restore folio at physical address: %llx\n", + phys); + goto put_folios; + } + index =3D pfolio->index; + flags =3D pfolio->flags; + + /* Set up the folio for insertion. */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + err =3D mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping)); + if (err) { + pr_err("shmem: failed to charge folio index %ld: %d\n", + i, err); + goto unlock_folio; + } + + err =3D shmem_add_to_page_cache(folio, mapping, index, NULL, + mapping_gfp_mask(mapping)); + if (err) { + pr_err("shmem: failed to add to page cache folio index %ld: %d\n", + i, err); + goto unlock_folio; + } + + if (flags & MEMFD_LUO_FOLIO_UPTODATE) + folio_mark_uptodate(folio); + if (flags & MEMFD_LUO_FOLIO_DIRTY) + folio_mark_dirty(folio); + + err =3D shmem_inode_acct_blocks(inode, 1); + if (err) { + pr_err("shmem: failed to account folio index %ld: %d\n", + i, err); + goto unlock_folio; + } + + shmem_recalc_inode(inode, 1, 0); + folio_add_lru(folio); + folio_unlock(folio); + folio_put(folio); + } + + return 0; + +unlock_folio: + folio_unlock(folio); + folio_put(folio); +put_folios: + /* + * Note: don't free the folios already added to the file. They will be + * freed when the file is freed. Free the ones not added yet here. + */ + for (long j =3D i + 1; j < nr_folios; j++) { + const struct memfd_luo_folio_ser *pfolio =3D &folios_ser[j]; + + folio =3D kho_restore_folio(pfolio->pfn); + if (folio) + folio_put(folio); + } + + return err; +} + +static int memfd_luo_retrieve(struct liveupdate_file_op_args *args) +{ + struct memfd_luo_folio_ser *folios_ser; + struct memfd_luo_ser *ser; + struct file *file; + int err; + + ser =3D phys_to_virt(args->serialized_data); + if (!ser) + return -EINVAL; + + file =3D shmem_file_setup("", 0, VM_NORESERVE); + + if (IS_ERR(file)) { + pr_err("failed to setup file: %pe\n", file); + return PTR_ERR(file); + } + + vfs_setpos(file, ser->pos, MAX_LFS_FILESIZE); + file->f_inode->i_size =3D ser->size; + + if (ser->nr_folios) { + folios_ser =3D kho_restore_vmalloc(&ser->folios); + if (!folios_ser) { + err =3D -EINVAL; + goto put_file; + } + + err =3D memfd_luo_retrieve_folios(file, folios_ser, ser->nr_folios); + vfree(folios_ser); + if (err) + goto put_file; + } + + args->file =3D file; + kho_restore_free(ser); + + return 0; + +put_file: + fput(file); + + return err; +} + +static bool memfd_luo_can_preserve(struct liveupdate_file_handler *handler, + struct file *file) +{ + struct inode *inode =3D file_inode(file); + + return shmem_file(file) && !inode->i_nlink; +} + +static const struct liveupdate_file_ops memfd_luo_file_ops =3D { + .freeze =3D memfd_luo_freeze, + .finish =3D memfd_luo_finish, + .retrieve =3D memfd_luo_retrieve, + .preserve =3D memfd_luo_preserve, + .unpreserve =3D memfd_luo_unpreserve, + .can_preserve =3D memfd_luo_can_preserve, + .owner =3D THIS_MODULE, +}; + +static struct liveupdate_file_handler memfd_luo_handler =3D { + .ops =3D &memfd_luo_file_ops, + .compatible =3D MEMFD_LUO_FH_COMPATIBLE, +}; + +static int __init memfd_luo_init(void) +{ + int err =3D liveupdate_register_file_handler(&memfd_luo_handler); + + if (err && err !=3D -EOPNOTSUPP) { + pr_err("Could not register luo filesystem handler: %pe\n", + ERR_PTR(err)); + + return err; + } + + return 0; +} +late_initcall(memfd_luo_init); --=20 2.52.0.460.gd25c4c69ec-goog