From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51A36CDB465 for ; Mon, 16 Oct 2023 23:32:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234018AbjJPXcj (ORCPT ); Mon, 16 Oct 2023 19:32:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53128 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233444AbjJPXcg (ORCPT ); Mon, 16 Oct 2023 19:32:36 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id BFAF79F for ; Mon, 16 Oct 2023 16:32:33 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 88BD920B74C1; Mon, 16 Oct 2023 16:32:32 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 88BD920B74C1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499153; bh=aE7GEE4w8uDUnb6VvRJNVuX+EcDx1I9KmGc5OVksCs4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=qx+Gfv5hj1Fy0/v27lcwkdslPDvHbw4VQGT8jnw/A/kASaJRIEfU7UcY0pvaDcAvd F76GHm7xqfroCNDJrZhjEdr4S/iKTHBcXgLjdudX9nPz7P/WkvgQvdnfGN6JoclsH3 Gy731Qw8P0FE4VPQnBHPPpj8S/465RGBTdl3OPoA= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 01/10] mm/prmem: Allocate memory during boot for storing persistent data Date: Mon, 16 Oct 2023 18:32:06 -0500 Message-Id: <20231016233215.13090-2-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Introduce the "Persistent-Across-Kexec memory (prmem)" feature that allows user and kernel data to be persisted across kexecs. The first step is to set aside some memory for storing persistent data. Introduce a new kernel command line parameter for this: prmem=3Dsize[KMG] Allocate this memory from memblocks during boot. Make sure that the allocation is done late enough so it does not interfere with any fixed range allocations. Define a "prmem_region" structure to store the range that is allocated. The region structure will be used to manage the memory. Define a "prmem" structure for storing persistence metadata. Allocate a metadata page to contain the metadata structure. Initialize the metadata. Add the initial region to a region list in the metadata. Signed-off-by: Madhavan T. Venkataraman --- arch/x86/kernel/setup.c | 2 + include/linux/prmem.h | 76 ++++++++++++++++++++++++++++++++++++ kernel/Makefile | 1 + kernel/prmem/Makefile | 3 ++ kernel/prmem/prmem_init.c | 27 +++++++++++++ kernel/prmem/prmem_parse.c | 33 ++++++++++++++++ kernel/prmem/prmem_region.c | 21 ++++++++++ kernel/prmem/prmem_reserve.c | 56 ++++++++++++++++++++++++++ mm/mm_init.c | 2 + 9 files changed, 221 insertions(+) create mode 100644 include/linux/prmem.h create mode 100644 kernel/prmem/Makefile create mode 100644 kernel/prmem/prmem_init.c create mode 100644 kernel/prmem/prmem_parse.c create mode 100644 kernel/prmem/prmem_region.c create mode 100644 kernel/prmem/prmem_reserve.c diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fd975a4a5200..f2b13b3d3ead 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -25,6 +25,7 @@ #include #include #include +#include =20 #include =20 @@ -1231,6 +1232,7 @@ void __init setup_arch(char **cmdline_p) * won't consume hotpluggable memory. */ reserve_crashkernel(); + prmem_reserve(); =20 memblock_find_dma_reserve(); =20 diff --git a/include/linux/prmem.h b/include/linux/prmem.h new file mode 100644 index 000000000000..7f22016c4ad2 --- /dev/null +++ b/include/linux/prmem.h @@ -0,0 +1,76 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Persistent-Across-Kexec memory (prmem) - Definitions. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#ifndef _LINUX_PRMEM_H +#define _LINUX_PRMEM_H +/* + * The prmem feature can be used to persist kernel and user data across ke= xec + * reboots in memory for various uses. E.g., + * + * - Saving cached data. E.g., database caches. + * - Saving state. E.g., KVM guest states. + * - Saving historical information since the last cold boot such as + * events, logs and journals. + * - Saving measurements for integrity checks on the next boot. + * - Saving driver data. + * - Saving IOMMU mappings. + * - Saving MMIO config information. + * + * This is useful on systems where there is no non-volatile storage or + * non-volatile storage is too slow. + */ +#include +#include +#include +#include +#include + +#include +#include +#include +/* + * A prmem region supplies the memory for storing persistent data. + * + * node List node. + * pa Physical address of the region. + * size Size of the region in bytes. + */ +struct prmem_region { + struct list_head node; + unsigned long pa; + size_t size; +}; + +/* + * PRMEM metadata. + * + * metadata Physical address of the metadata page. + * size Size of initial memory allocated to prmem. + * + * regions List of memory regions. + */ +struct prmem { + unsigned long metadata; + size_t size; + + /* Persistent Regions. */ + struct list_head regions; +}; + +extern struct prmem *prmem; +extern unsigned long prmem_metadata; +extern unsigned long prmem_pa; +extern size_t prmem_size; + +/* Kernel API. */ +void prmem_reserve(void); +void prmem_init(void); + +/* Internal functions. */ +struct prmem_region *prmem_add_region(unsigned long pa, size_t size); + +#endif /* _LINUX_PRMEM_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 3947122d618b..43b485b0467a 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -50,6 +50,7 @@ obj-y +=3D rcu/ obj-y +=3D livepatch/ obj-y +=3D dma/ obj-y +=3D entry/ +obj-y +=3D prmem/ obj-$(CONFIG_MODULES) +=3D module/ =20 obj-$(CONFIG_KCMP) +=3D kcmp.o diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile new file mode 100644 index 000000000000..11a53d49312a --- /dev/null +++ b/kernel/prmem/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y +=3D prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c new file mode 100644 index 000000000000..97b550252028 --- /dev/null +++ b/kernel/prmem/prmem_init.c @@ -0,0 +1,27 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Persistent-Across-Kexec memory (prmem) - Initialization. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +bool prmem_inited; + +void __init prmem_init(void) +{ + if (!prmem) + return; + + if (!prmem->metadata) { + /* Cold boot. */ + prmem->metadata =3D prmem_metadata; + prmem->size =3D prmem_size; + INIT_LIST_HEAD(&prmem->regions); + + if (!prmem_add_region(prmem_pa, prmem_size)) + return; + } + prmem_inited =3D true; +} diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c new file mode 100644 index 000000000000..191655b53545 --- /dev/null +++ b/kernel/prmem/prmem_parse.c @@ -0,0 +1,33 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Persistent-Across-Kexec memory (prmem) - Process prmem cmdline paramete= r. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +/* + * Syntax: prmem=3Dsize[KMG] + * + * Specifies the size of the initial memory to be allocated to prmem. + */ +static int __init prmem_size_parse(char *cmdline) +{ + char *tmp, *cur =3D cmdline; + unsigned long size; + + if (!cur) + return -EINVAL; + + /* Get initial size. */ + size =3D memparse(cur, &tmp); + if (cur =3D=3D tmp || !size || size & (PAGE_SIZE - 1)) { + pr_warn("%s: Incorrect size %lx\n", __func__, size); + return -EINVAL; + } + + prmem_size =3D size; + return 0; +} +early_param("prmem", prmem_size_parse); diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c new file mode 100644 index 000000000000..8254dafcee13 --- /dev/null +++ b/kernel/prmem/prmem_region.c @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Persistent-Across-Kexec memory (prmem) - Regions. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +struct prmem_region *prmem_add_region(unsigned long pa, size_t size) +{ + struct prmem_region *region; + + /* Allocate region structure from the base of the region itself. */ + region =3D __va(pa); + region->pa =3D pa; + region->size =3D size; + + list_add_tail(®ion->node, &prmem->regions); + return region; +} diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c new file mode 100644 index 000000000000..e20e31a61d12 --- /dev/null +++ b/kernel/prmem/prmem_reserve.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Persistent-Across-Kexec memory (prmem) - Reserve memory. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +struct prmem *prmem; +unsigned long prmem_metadata; +unsigned long prmem_pa; +unsigned long prmem_size; + +void __init prmem_reserve(void) +{ + BUILD_BUG_ON(sizeof(*prmem) > PAGE_SIZE); + + if (!prmem_size) + return; + + /* + * prmem uses direct map addresses. If PAGE_OFFSET is randomized, + * these addresses will change across kexecs. Persistence cannot + * be supported. + */ + if (kaslr_memory_enabled()) { + pr_warn("%s: Cannot support persistence because of KASLR.\n", + __func__); + return; + } + + /* Allocate a metadata page. */ + prmem_metadata =3D memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); + if (!prmem_metadata) { + pr_warn("%s: Could not allocate metadata at %lx\n", __func__, + prmem_metadata); + return; + } + + /* Allocate initial memory. */ + prmem_pa =3D memblock_phys_alloc(prmem_size, PAGE_SIZE); + if (!prmem_pa) { + pr_warn("%s: Could not allocate initial memory\n", __func__); + goto free_metadata; + } + + /* Clear metadata. */ + prmem =3D __va(prmem_metadata); + memset(prmem, 0, sizeof(*prmem)); + return; + +free_metadata: + memblock_phys_free(prmem_metadata, PAGE_SIZE); + prmem =3D NULL; +} diff --git a/mm/mm_init.c b/mm/mm_init.c index a1963c3322af..f12757829281 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -2804,4 +2805,5 @@ void __init mm_core_init(void) pti_init(); kmsan_init_runtime(); mm_cache_init(); + prmem_init(); } --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA409CDB474 for ; Mon, 16 Oct 2023 23:32:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234226AbjJPXcn (ORCPT ); Mon, 16 Oct 2023 19:32:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233590AbjJPXch (ORCPT ); Mon, 16 Oct 2023 19:32:37 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B40E9AC for ; Mon, 16 Oct 2023 16:32:34 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 7CD9020B74C2; Mon, 16 Oct 2023 16:32:33 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 7CD9020B74C2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499154; bh=MdSra1KJVnjMDs0z9VTV6dK5Bib2ved+sk8K2wmqT7s=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LYBhRqvCQrkNqpEuRVU6G0Y3n/i0Dd1UvHXi9+DX9aBYvpoNdHGuk1TK3qjgQXmgj knwdmu6h5YR7HMOaI/pEWZRZT0yTwrZdvGyjMqNQso97FEKp0666XOJ/Mdlov3+ftm iP3W/XMnpB1icKl5T65yFmb9L7VfMalBia7UNNPE= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 02/10] mm/prmem: Reserve metadata and persistent regions in early boot after kexec Date: Mon, 16 Oct 2023 18:32:07 -0500 Message-Id: <20231016233215.13090-3-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Currently, only one memory region is given to prmem to store persistent data. In the future, regions may be added dynamically. The prmem metadata and the regions need to be reserved during early boot after a kexec. For this to happen, the kernel must know where the metadata is. To allow this, introduce a kernel command line parameter: prmem_meta=3Dmetadata_address When a kexec image is loaded into the kernel, add this parameter to the kexec cmdline. Upon a kexec boot, get the metadata page from the cmdline and reserve it. Then, walk the list of regions in the metadata and reserve the regions. Note that the cmdline modification is done automatically within the kernel. Userland does not have to do anything. The metadata needs to be validated before it can be used. To allow this, compute a checksum on the metadata and store it in the metadata at the end of shutdown. During early boot, validate the metadata with the checksum. If the validation fails, discard the metadata. Treat it as a cold boot. That is, allocate a new metadata page and initial region and start over. Similarly, if the reservation of the regions fails, treat it as a cold boot and start over. This means that all persistent data will be lost on any of these failures. Note that there will be no memory leak when this happens. Signed-off-by: Madhavan T. Venkataraman --- arch/x86/kernel/kexec-bzimage64.c | 5 +- arch/x86/kernel/setup.c | 2 + include/linux/memblock.h | 2 + include/linux/prmem.h | 11 ++++ kernel/prmem/Makefile | 2 +- kernel/prmem/prmem_init.c | 9 ++++ kernel/prmem/prmem_misc.c | 85 +++++++++++++++++++++++++++++++ kernel/prmem/prmem_parse.c | 29 +++++++++++ kernel/prmem/prmem_reserve.c | 70 ++++++++++++++++++++++++- kernel/reboot.c | 2 + mm/memblock.c | 12 +++++ 11 files changed, 226 insertions(+), 3 deletions(-) create mode 100644 kernel/prmem/prmem_misc.c diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzim= age64.c index a61c12c01270..a19f172be410 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -18,6 +18,7 @@ #include #include #include +#include =20 #include #include @@ -82,6 +83,8 @@ static int setup_cmdline(struct kimage *image, struct boo= t_params *params, =20 cmdline_ptr[cmdline_len - 1] =3D '\0'; =20 + prmem_cmdline(cmdline_ptr); + pr_debug("Final command line is: %s\n", cmdline_ptr); cmdline_ptr_phys =3D bootparams_load_addr + cmdline_offset; cmdline_low_32 =3D cmdline_ptr_phys & 0xffffffffUL; @@ -458,7 +461,7 @@ static void *bzImage64_load(struct kimage *image, char = *kernel, */ efi_map_sz =3D efi_get_runtime_map_size(); params_cmdline_sz =3D sizeof(struct boot_params) + cmdline_len + - MAX_ELFCOREHDR_STR_LEN; + MAX_ELFCOREHDR_STR_LEN + prmem_cmdline_size(); params_cmdline_sz =3D ALIGN(params_cmdline_sz, 16); kbuf.bufsz =3D params_cmdline_sz + ALIGN(efi_map_sz, 16) + sizeof(struct setup_data) + diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index f2b13b3d3ead..22f5cd494291 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1137,6 +1137,8 @@ void __init setup_arch(char **cmdline_p) */ efi_reserve_boot_services(); =20 + prmem_reserve_early(); + /* preallocate 4k for mptable mpc */ e820__memblock_alloc_reserved_mpc_new(); =20 diff --git a/include/linux/memblock.h b/include/linux/memblock.h index f71ff9f0ec81..584bbb884c8e 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -114,6 +114,8 @@ int memblock_add(phys_addr_t base, phys_addr_t size); int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_phys_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); +void memblock_unreserve(phys_addr_t base, phys_addr_t size); + #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP int memblock_physmem_add(phys_addr_t base, phys_addr_t size); #endif diff --git a/include/linux/prmem.h b/include/linux/prmem.h index 7f22016c4ad2..bc8054a86f49 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -48,12 +48,16 @@ struct prmem_region { /* * PRMEM metadata. * + * checksum Just before reboot, a checksum is computed on the metadata. On + * the next kexec reboot, the metadata is validated with the + * checksum to make sure that the metadata has not been corrupted. * metadata Physical address of the metadata page. * size Size of initial memory allocated to prmem. * * regions List of memory regions. */ struct prmem { + unsigned long checksum; unsigned long metadata; size_t size; =20 @@ -65,12 +69,19 @@ extern struct prmem *prmem; extern unsigned long prmem_metadata; extern unsigned long prmem_pa; extern size_t prmem_size; +extern bool prmem_inited; =20 /* Kernel API. */ +void prmem_reserve_early(void); void prmem_reserve(void); void prmem_init(void); +void prmem_fini(void); +int prmem_cmdline_size(void); =20 /* Internal functions. */ struct prmem_region *prmem_add_region(unsigned long pa, size_t size); +unsigned long prmem_checksum(void *start, size_t size); +bool __init prmem_validate(void); +void prmem_cmdline(char *cmdline); =20 #endif /* _LINUX_PRMEM_H */ diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile index 11a53d49312a..9b0a693bfee1 100644 --- a/kernel/prmem/Makefile +++ b/kernel/prmem/Makefile @@ -1,3 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 =20 -obj-y +=3D prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o +obj-y +=3D prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem= _misc.o diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c index 97b550252028..9cea1cd3b6a5 100644 --- a/kernel/prmem/prmem_init.c +++ b/kernel/prmem/prmem_init.c @@ -25,3 +25,12 @@ void __init prmem_init(void) } prmem_inited =3D true; } + +void prmem_fini(void) +{ + if (!prmem_inited) + return; + + /* Compute checksum over the metadata. */ + prmem->checksum =3D prmem_checksum(prmem, sizeof(*prmem)); +} diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c new file mode 100644 index 000000000000..49b6a7232c1a --- /dev/null +++ b/kernel/prmem/prmem_misc.c @@ -0,0 +1,85 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Persistent-Across-Kexec memory (prmem) - Miscellaneous functions. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +#define MAX_META_LENGTH 31 + +/* + * On a kexec, modify the kernel command line to include the boot parameter + * "prmem_meta=3D" so that the metadata can be found on the next boot. If = the + * parameter is already present in cmdline, overwrite it. Else, add it. + */ +void prmem_cmdline(char *cmdline) +{ + char meta[MAX_META_LENGTH], *str; + unsigned long metadata; + + metadata =3D prmem_inited ? prmem->metadata : 0; + snprintf(meta, MAX_META_LENGTH, " prmem_meta=3D0x%.16lx", metadata); + + str =3D strstr(cmdline, " prmem_meta"); + if (str) { + /* + * Boot parameter already exists. Overwrite it. We deliberately + * use strncpy() and rely on the fact that it will not NULL + * terminate the copy. + */ + strncpy(str, meta, MAX_META_LENGTH - 1); + return; + } + if (prmem_inited) { + /* Boot parameter does not exist. Add it. */ + strcat(cmdline, meta); + } +} + +/* + * Make sure that the kexec command line can accommodate the prmem_meta + * command line parameter. + */ +int prmem_cmdline_size(void) +{ + return MAX_META_LENGTH; +} + +unsigned long prmem_checksum(void *start, size_t size) +{ + unsigned long checksum =3D 0; + unsigned long *ptr; + void *end; + + end =3D start + size; + for (ptr =3D start; (void *) ptr < end; ptr++) + checksum +=3D *ptr; + return checksum; +} + +/* + * Check if the metadata is sane. It would not be sane on a cold boot or i= f the + * metadata has been corrupted. In the latter case, we treat it as a cold = boot. + */ +bool __init prmem_validate(void) +{ + unsigned long checksum; + + /* Sanity check the boot parameter. */ + if (prmem_metadata !=3D prmem->metadata || prmem_size !=3D prmem->size) { + pr_warn("%s: Boot parameter mismatch\n", __func__); + return false; + } + + /* Compute and check the checksum of the metadata. */ + checksum =3D prmem->checksum; + prmem->checksum =3D 0; + + if (checksum !=3D prmem_checksum(prmem, sizeof(*prmem))) { + pr_warn("%s: Checksum mismatch\n", __func__); + return false; + } + return true; +} diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c index 191655b53545..6c1a23c6b84e 100644 --- a/kernel/prmem/prmem_parse.c +++ b/kernel/prmem/prmem_parse.c @@ -31,3 +31,32 @@ static int __init prmem_size_parse(char *cmdline) return 0; } early_param("prmem", prmem_size_parse); + +/* + * Syntax: prmem_meta=3Dmetadata_address + * + * Specifies the address of a single page where the prmem metadata resides. + * + * On a kexec, the following will be appended to the kernel command line - + * "prmem_meta=3Dmetadata_address". This is so that the metadata can be lo= cated + * easily on kexec reboots. + */ +static int __init prmem_meta_parse(char *cmdline) +{ + char *tmp, *cur =3D cmdline; + unsigned long addr; + + if (!cur) + return -EINVAL; + + /* Get metadata address. */ + addr =3D memparse(cur, &tmp); + if (cur =3D=3D tmp || addr & (PAGE_SIZE - 1)) { + pr_warn("%s: Incorrect address %lx\n", __func__, addr); + return -EINVAL; + } + + prmem_metadata =3D addr; + return 0; +} +early_param("prmem_meta", prmem_meta_parse); diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c index e20e31a61d12..8000fff05402 100644 --- a/kernel/prmem/prmem_reserve.c +++ b/kernel/prmem/prmem_reserve.c @@ -12,11 +12,79 @@ unsigned long prmem_metadata; unsigned long prmem_pa; unsigned long prmem_size; =20 +void __init prmem_reserve_early(void) +{ + struct prmem_region *region; + unsigned long nregions; + + /* Need to specify an initial size to enable prmem. */ + if (!prmem_size) + return; + + /* Nothing to be done if it is a cold boot. */ + if (!prmem_metadata) + return; + + /* + * prmem uses direct map addresses. If PAGE_OFFSET is randomized, + * these addresses will change across kexecs. Persistence cannot + * be supported. + */ + if (kaslr_memory_enabled()) { + pr_warn("%s: Cannot support persistence because of KASLR.\n", + __func__); + return; + } + + /* + * This is a kexec reboot. If any step fails here, treat this like a + * cold boot. That is, forget all persistent data and start over. + */ + + /* Reserve metadata page. */ + if (memblock_reserve(prmem_metadata, PAGE_SIZE)) { + pr_warn("%s: Unable to reserve metadata at %lx\n", __func__, + prmem_metadata); + return; + } + prmem =3D __va(prmem_metadata); + + /* Make sure that the metadata is sane. */ + if (!prmem_validate()) + goto unreserve_metadata; + + /* Reserve regions that were added to prmem. */ + nregions =3D 0; + list_for_each_entry(region, &prmem->regions, node) { + if (memblock_reserve(region->pa, region->size)) { + pr_warn("%s: Unable to reserve %lx, %lx\n", __func__, + region->pa, region->size); + goto unreserve_regions; + } + nregions++; + } + return; + +unreserve_regions: + /* Unreserve regions. */ + list_for_each_entry(region, &prmem->regions, node) { + if (!nregions) + break; + memblock_unreserve(region->pa, region->size); + nregions--; + } + +unreserve_metadata: + /* Unreserve the metadata page. */ + memblock_unreserve(prmem_metadata, PAGE_SIZE); + prmem =3D NULL; +} + void __init prmem_reserve(void) { BUILD_BUG_ON(sizeof(*prmem) > PAGE_SIZE); =20 - if (!prmem_size) + if (!prmem_size || prmem) return; =20 /* diff --git a/kernel/reboot.c b/kernel/reboot.c index 3bba88c7ffc6..b4595b7e77f3 100644 --- a/kernel/reboot.c +++ b/kernel/reboot.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -84,6 +85,7 @@ void kernel_restart_prepare(char *cmd) system_state =3D SYSTEM_RESTART; usermodehelper_disable(); device_shutdown(); + prmem_fini(); } =20 /** diff --git a/mm/memblock.c b/mm/memblock.c index f9e61e565a53..1f5070f7b5bc 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -873,6 +873,18 @@ int __init_memblock memblock_reserve(phys_addr_t base,= phys_addr_t size) return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0= ); } =20 +void __init_memblock memblock_unreserve(phys_addr_t base, phys_addr_t size) +{ + phys_addr_t end =3D base + size - 1; + + memblock_dbg("%s: [%pa-%pa] %pS\n", __func__, + &base, &end, (void *)_RET_IP_); + + if (memblock_remove_range(&memblock.reserved, base, size)) + return; + memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); +} + #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t siz= e) { --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EAE9CDB474 for ; Mon, 16 Oct 2023 23:32:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234249AbjJPXcp (ORCPT ); Mon, 16 Oct 2023 19:32:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233800AbjJPXch (ORCPT ); Mon, 16 Oct 2023 19:32:37 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A0DE1D9 for ; Mon, 16 Oct 2023 16:32:35 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 6F27A20B74C3; Mon, 16 Oct 2023 16:32:34 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 6F27A20B74C3 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499155; bh=1++eNyqRswp3oGNUuCWUHGrpRdktVHyUltKwKhyqiRU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=iNqCEpEjzTE8WovCw4f+pbtGA2H9fPOLKS/lmVB0QpDyYHLhb+OX2CgYw4laOjE1w nwliO/996zaOQ0eCLQ2mMU8cAPWRbBQRZ4rLPouM0t7i6SwcvYpf8IGHjlQcVT1v0y XcdWcIxwROlouzNUs2skG0w/can9dYak3PkQg52M= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 03/10] mm/prmem: Manage persistent memory with the gen pool allocator. Date: Mon, 16 Oct 2023 18:32:08 -0500 Message-Id: <20231016233215.13090-4-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" The memory in a prmem region must be managed by an allocator. Use the Gen Pool allocator (lib/genalloc.c) for that purpose. This is so we don't have to write a new allocator. Now, the Gen Pool allocator uses a "struct gen_pool_chunk" to manage a contiguous range of memory. The chunk is normally allocated using the kmem allocator. However, for prmem, the chunk must be persisted across a kexec reboot so that the allocations can be "remembered". To allow this, allocate the chunk from the region itself and initialize it. Then, pass the chunk to the Gen Pool allocator. In other words, persist the chunk. Inside the Gen Pool allocator, distinguish between a chunk that is allocated internally from kmem and a chunk that is passed by the caller and handle it properly when the pool is destroyed. Provide wrapper functions around the Gen Pool allocator functions so we can change the allocator in the future if we wanted to. prmem_create_pool() prmem_alloc_pool() prmem_free_pool() Signed-off-by: Madhavan T. Venkataraman --- include/linux/genalloc.h | 6 ++++ include/linux/prmem.h | 8 +++++ kernel/prmem/prmem_init.c | 8 +++++ kernel/prmem/prmem_region.c | 67 ++++++++++++++++++++++++++++++++++++- lib/genalloc.c | 45 ++++++++++++++++++------- 5 files changed, 121 insertions(+), 13 deletions(-) diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h index 0bd581003cd5..186757b0aec7 100644 --- a/include/linux/genalloc.h +++ b/include/linux/genalloc.h @@ -73,6 +73,7 @@ struct gen_pool_chunk { struct list_head next_chunk; /* next chunk in pool */ atomic_long_t avail; phys_addr_t phys_addr; /* physical starting address of memory chunk */ + bool external; /* Chunk is passed by caller. */ void *owner; /* private data to retrieve at alloc time */ unsigned long start_addr; /* start address of memory chunk */ unsigned long end_addr; /* end address of memory chunk (inclusive) */ @@ -121,6 +122,11 @@ static inline int gen_pool_add(struct gen_pool *pool, = unsigned long addr, { return gen_pool_add_virt(pool, addr, -1, size, nid); } +extern unsigned long gen_pool_chunk_size(size_t size, int min_alloc_order); +extern void gen_pool_init_chunk(struct gen_pool_chunk *chunk, + unsigned long addr, phys_addr_t phys, + size_t size, bool external, void *owner); +void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chun= k); extern void gen_pool_destroy(struct gen_pool *); unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size, genpool_algo_t algo, void *data, void **owner); diff --git a/include/linux/prmem.h b/include/linux/prmem.h index bc8054a86f49..f43f5b0d2b9c 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -24,6 +24,7 @@ * non-volatile storage is too slow. */ #include +#include #include #include #include @@ -38,11 +39,15 @@ * node List node. * pa Physical address of the region. * size Size of the region in bytes. + * pool Gen Pool to manage region memory. + * chunk Persistent Gen Pool chunk. */ struct prmem_region { struct list_head node; unsigned long pa; size_t size; + struct gen_pool *pool; + struct gen_pool_chunk *chunk; }; =20 /* @@ -80,6 +85,9 @@ int prmem_cmdline_size(void); =20 /* Internal functions. */ struct prmem_region *prmem_add_region(unsigned long pa, size_t size); +bool prmem_create_pool(struct prmem_region *region, bool new_region); +void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align= ); +void prmem_free_pool(struct prmem_region *region, void *va, size_t size); unsigned long prmem_checksum(void *start, size_t size); bool __init prmem_validate(void); void prmem_cmdline(char *cmdline); diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c index 9cea1cd3b6a5..56df1e6d3ebc 100644 --- a/kernel/prmem/prmem_init.c +++ b/kernel/prmem/prmem_init.c @@ -22,6 +22,14 @@ void __init prmem_init(void) =20 if (!prmem_add_region(prmem_pa, prmem_size)) return; + } else { + /* Warm boot. */ + struct prmem_region *region; + + list_for_each_entry(region, &prmem->regions, node) { + if (!prmem_create_pool(region, false)) + return; + } } prmem_inited =3D true; } diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c index 8254dafcee13..6dc88c74d9c8 100644 --- a/kernel/prmem/prmem_region.c +++ b/kernel/prmem/prmem_region.c @@ -1,12 +1,74 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * Persistent-Across-Kexec memory (prmem) - Regions. + * Persistent-Across-Kexec memory (prmem) - Regions and Region Pools. * * Copyright (C) 2023 Microsoft Corporation * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) */ #include =20 +bool prmem_create_pool(struct prmem_region *region, bool new_region) +{ + size_t chunk_size, total_size; + + chunk_size =3D gen_pool_chunk_size(region->size, PAGE_SHIFT); + total_size =3D sizeof(*region) + chunk_size; + total_size =3D ALIGN(total_size, PAGE_SIZE); + + if (new_region) { + /* + * We place the region structure at the base of the region + * itself. Part of the region is a genpool chunk that is used + * to manage the region memory. + * + * Normally, the chunk is allocated from regular memory by + * genpool. But in the case of prmem, the chunk must be + * persisted across kexecs so allocations can be remembered. + * That is why it is allocated from the region memory itself + * and passed to genpool. + * + * Make sure there is enough space for the region and the chunk. + */ + if (total_size >=3D region->size) { + pr_warn("%s: region size too small\n", __func__); + return false; + } + + /* Initialize the persistent genpool chunk. */ + region->chunk =3D (void *) (region + 1); + memset(region->chunk, 0, chunk_size); + gen_pool_init_chunk(region->chunk, (unsigned long) region, + region->pa, region->size, true, NULL); + } + + region->pool =3D gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE); + if (!region->pool) { + pr_warn("%s: Could not create genpool\n", __func__); + return false; + } + + gen_pool_add_chunk(region->pool, region->chunk); + + if (new_region) { + /* Reserve the region and chunk. */ + gen_pool_alloc(region->pool, total_size); + } + return true; +} + +void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align) +{ + struct genpool_data_align data =3D { .align =3D align, }; + + return (void *) gen_pool_alloc_algo(region->pool, size, + gen_pool_first_fit_align, &data); +} + +void prmem_free_pool(struct prmem_region *region, void *va, size_t size) +{ + gen_pool_free(region->pool, (unsigned long) va, size); +} + struct prmem_region *prmem_add_region(unsigned long pa, size_t size) { struct prmem_region *region; @@ -16,6 +78,9 @@ struct prmem_region *prmem_add_region(unsigned long pa, s= ize_t size) region->pa =3D pa; region->size =3D size; =20 + if (!prmem_create_pool(region, true)) + return NULL; + list_add_tail(®ion->node, &prmem->regions); return region; } diff --git a/lib/genalloc.c b/lib/genalloc.c index 6c644f954bc5..655db7b47ea9 100644 --- a/lib/genalloc.c +++ b/lib/genalloc.c @@ -165,6 +165,33 @@ struct gen_pool *gen_pool_create(int min_alloc_order, = int nid) } EXPORT_SYMBOL(gen_pool_create); =20 +size_t gen_pool_chunk_size(size_t size, int min_alloc_order) +{ + unsigned long nbits =3D size >> min_alloc_order; + unsigned long nbytes =3D sizeof(struct gen_pool_chunk) + + BITS_TO_LONGS(nbits) * sizeof(long); + return nbytes; +} + +void gen_pool_init_chunk(struct gen_pool_chunk *chunk, unsigned long virt, + phys_addr_t phys, size_t size, bool external, + void *owner) +{ + chunk->phys_addr =3D phys; + chunk->start_addr =3D virt; + chunk->end_addr =3D virt + size - 1; + chunk->external =3D external; + chunk->owner =3D owner; + atomic_long_set(&chunk->avail, size); +} + +void gen_pool_add_chunk(struct gen_pool *pool, struct gen_pool_chunk *chun= k) +{ + spin_lock(&pool->lock); + list_add_rcu(&chunk->next_chunk, &pool->chunks); + spin_unlock(&pool->lock); +} + /** * gen_pool_add_owner- add a new chunk of special memory to the pool * @pool: pool to add new memory chunk to @@ -183,23 +210,14 @@ int gen_pool_add_owner(struct gen_pool *pool, unsigne= d long virt, phys_addr_t ph size_t size, int nid, void *owner) { struct gen_pool_chunk *chunk; - unsigned long nbits =3D size >> pool->min_alloc_order; - unsigned long nbytes =3D sizeof(struct gen_pool_chunk) + - BITS_TO_LONGS(nbits) * sizeof(long); + unsigned long nbytes =3D gen_pool_chunk_size(size, pool->min_alloc_order); =20 chunk =3D vzalloc_node(nbytes, nid); if (unlikely(chunk =3D=3D NULL)) return -ENOMEM; =20 - chunk->phys_addr =3D phys; - chunk->start_addr =3D virt; - chunk->end_addr =3D virt + size - 1; - chunk->owner =3D owner; - atomic_long_set(&chunk->avail, size); - - spin_lock(&pool->lock); - list_add_rcu(&chunk->next_chunk, &pool->chunks); - spin_unlock(&pool->lock); + gen_pool_init_chunk(chunk, virt, phys, size, false, owner); + gen_pool_add_chunk(pool, chunk); =20 return 0; } @@ -248,6 +266,9 @@ void gen_pool_destroy(struct gen_pool *pool) chunk =3D list_entry(_chunk, struct gen_pool_chunk, next_chunk); list_del(&chunk->next_chunk); =20 + if (chunk->external) + continue; + end_bit =3D chunk_size(chunk) >> order; bit =3D find_first_bit(chunk->bits, end_bit); BUG_ON(bit < end_bit); --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3B26CDB482 for ; Mon, 16 Oct 2023 23:32:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234260AbjJPXcs (ORCPT ); Mon, 16 Oct 2023 19:32:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233854AbjJPXci (ORCPT ); Mon, 16 Oct 2023 19:32:38 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8AA3692 for ; Mon, 16 Oct 2023 16:32:36 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 628DF20B74C4; Mon, 16 Oct 2023 16:32:35 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 628DF20B74C4 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499156; bh=7Kg0kDVRT4XUsyYitTfP8a+YwKQXAgc1KQikAOX3WAY=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Yt8+5vTSKf0oAErSwOtHn6FQjCOGeLisGeuub6q/djLn2p+T5pP7eOxRYSGmL6t3n rmGVIM80mpjIhbpqL9f/XXorAr0bZQkq3mlx1BLzOZNid96fWsO7hN5QGHtO+eT2SX zBHgeRKFsB0wU1vyi0WE4hJoh3zRyfs5hhdvcEsQ= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 04/10] mm/prmem: Implement a page allocator for persistent memory Date: Mon, 16 Oct 2023 18:32:09 -0500 Message-Id: <20231016233215.13090-5-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Define the following convenience wrapper functions for allocating and freeing pages: - prmem_alloc_pages() - prmem_free_pages() The functions look similar to alloc_pages() and __free_pages(). However, the only GFP flag that is processed is __GFP_ZERO to zero out the allocated memory. Signed-off-by: Madhavan T. Venkataraman --- include/linux/prmem.h | 7 ++++ kernel/prmem/Makefile | 1 + kernel/prmem/prmem_allocator.c | 74 ++++++++++++++++++++++++++++++++++ kernel/prmem/prmem_init.c | 2 + 4 files changed, 84 insertions(+) create mode 100644 kernel/prmem/prmem_allocator.c diff --git a/include/linux/prmem.h b/include/linux/prmem.h index f43f5b0d2b9c..108683933c82 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -75,6 +75,7 @@ extern unsigned long prmem_metadata; extern unsigned long prmem_pa; extern size_t prmem_size; extern bool prmem_inited; +extern spinlock_t prmem_lock; =20 /* Kernel API. */ void prmem_reserve_early(void); @@ -83,11 +84,17 @@ void prmem_init(void); void prmem_fini(void); int prmem_cmdline_size(void); =20 +/* Allocator API. */ +struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp); +void prmem_free_pages(struct page *pages, unsigned int order); + /* Internal functions. */ struct prmem_region *prmem_add_region(unsigned long pa, size_t size); bool prmem_create_pool(struct prmem_region *region, bool new_region); void *prmem_alloc_pool(struct prmem_region *region, size_t size, int align= ); void prmem_free_pool(struct prmem_region *region, void *va, size_t size); +void *prmem_alloc_pages_locked(unsigned int order); +void prmem_free_pages_locked(void *va, unsigned int order); unsigned long prmem_checksum(void *start, size_t size); bool __init prmem_validate(void); void prmem_cmdline(char *cmdline); diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile index 9b0a693bfee1..99bb19f0afd3 100644 --- a/kernel/prmem/Makefile +++ b/kernel/prmem/Makefile @@ -1,3 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 =20 obj-y +=3D prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem= _misc.o +obj-y +=3D prmem_allocator.o diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c new file mode 100644 index 000000000000..07a5a430630c --- /dev/null +++ b/kernel/prmem/prmem_allocator.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Persistent-Across-Kexec memory feature (prmem) - Allocator. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +/* Page Allocation functions. */ + +void *prmem_alloc_pages_locked(unsigned int order) +{ + struct prmem_region *region; + void *va; + size_t size =3D (1UL << order) << PAGE_SHIFT; + + list_for_each_entry(region, &prmem->regions, node) { + va =3D prmem_alloc_pool(region, size, size); + if (va) + return va; + } + return NULL; +} + +struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp) +{ + void *va; + size_t size =3D (1UL << order) << PAGE_SHIFT; + bool zero =3D !!(gfp & __GFP_ZERO); + + if (!prmem_inited || order > MAX_ORDER) + return NULL; + + spin_lock(&prmem_lock); + va =3D prmem_alloc_pages_locked(order); + spin_unlock(&prmem_lock); + + if (va) { + if (zero) + memset(va, 0, size); + return virt_to_page(va); + } + return NULL; +} +EXPORT_SYMBOL_GPL(prmem_alloc_pages); + +void prmem_free_pages_locked(void *va, unsigned int order) +{ + struct prmem_region *region; + size_t size =3D (1UL << order) << PAGE_SHIFT; + void *eva =3D va + size; + void *region_va; + + list_for_each_entry(region, &prmem->regions, node) { + /* The region structure is at the base of the region memory. */ + region_va =3D region; + if (va >=3D region_va && eva <=3D (region_va + region->size)) { + prmem_free_pool(region, va, size); + return; + } + } +} + +void prmem_free_pages(struct page *pages, unsigned int order) +{ + if (!prmem_inited || order > MAX_ORDER) + return; + + spin_lock(&prmem_lock); + prmem_free_pages_locked(page_to_virt(pages), order); + spin_unlock(&prmem_lock); +} +EXPORT_SYMBOL_GPL(prmem_free_pages); diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c index 56df1e6d3ebc..d23833d296fe 100644 --- a/kernel/prmem/prmem_init.c +++ b/kernel/prmem/prmem_init.c @@ -9,6 +9,8 @@ =20 bool prmem_inited; =20 +DEFINE_SPINLOCK(prmem_lock); + void __init prmem_init(void) { if (!prmem) --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAA84CDB482 for ; Mon, 16 Oct 2023 23:32:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234273AbjJPXcv (ORCPT ); Mon, 16 Oct 2023 19:32:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233280AbjJPXci (ORCPT ); Mon, 16 Oct 2023 19:32:38 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 55D5F9F for ; Mon, 16 Oct 2023 16:32:37 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 54D2020B74C0; Mon, 16 Oct 2023 16:32:36 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 54D2020B74C0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499157; bh=gwqo8nAHuTkGlbsI56Q26dBaLLDjv6Rxrz98nYbOB2c=; h=From:To:Subject:Date:In-Reply-To:References:From; b=NLMKufIPDMUJ67hSfmBMt8Y+H9/Q18Hpv2iX7h1kSbsIHVpljtfhVKCQi0chFUj8+ B4+t3BdWs0SqHbN80N54dh9tjP/K9kcWSckFawigYKvVPN2jF1yNfiXIX/qs9O0HSD YVF0B6V8k6kHLVxkHXw7PJey2WV5075oz1FaXXbA= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 05/10] mm/prmem: Implement a buffer allocator for persistent memory Date: Mon, 16 Oct 2023 18:32:10 -0500 Message-Id: <20231016233215.13090-6-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Implement functions that can allocate and free memory smaller than a page size. - prmem_alloc() - prmem_free() These functions look like kmalloc() and kfree(). However, the only GFP flag that is processed is __GFP_ZERO to zero out the allocated memory. To make the implementation simpler, create allocation caches for different object sizes: 8, 16, 32, 64, ..., PAGE_SIZE For a given size, allocate from the appropriate cache. This idea has been plagiarized from the kmem allocator. To fill the cache of a specific size, allocate a page, break it up into equal sized objects and add the objects to the cache. This is just a very simple allocator. It does not attempt to do sophisticated things like cache coloring, coalescing objects that belong to the same page so the page can be freed, etc. Signed-off-by: Madhavan T. Venkataraman --- include/linux/prmem.h | 12 ++++ kernel/prmem/prmem_allocator.c | 112 ++++++++++++++++++++++++++++++++- 2 files changed, 123 insertions(+), 1 deletion(-) diff --git a/include/linux/prmem.h b/include/linux/prmem.h index 108683933c82..1cb4660cf35e 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -50,6 +50,8 @@ struct prmem_region { struct gen_pool_chunk *chunk; }; =20 +#define PRMEM_MAX_CACHES 14 + /* * PRMEM metadata. * @@ -60,6 +62,9 @@ struct prmem_region { * size Size of initial memory allocated to prmem. * * regions List of memory regions. + * + * caches Caches for different object sizes. For allocations smaller than + * PAGE_SIZE, these caches are used. */ struct prmem { unsigned long checksum; @@ -68,6 +73,9 @@ struct prmem { =20 /* Persistent Regions. */ struct list_head regions; + + /* Allocation caches. */ + void *caches[PRMEM_MAX_CACHES]; }; =20 extern struct prmem *prmem; @@ -87,6 +95,8 @@ int prmem_cmdline_size(void); /* Allocator API. */ struct page *prmem_alloc_pages(unsigned int order, gfp_t gfp); void prmem_free_pages(struct page *pages, unsigned int order); +void *prmem_alloc(size_t size, gfp_t gfp); +void prmem_free(void *va, size_t size); =20 /* Internal functions. */ struct prmem_region *prmem_add_region(unsigned long pa, size_t size); @@ -95,6 +105,8 @@ void *prmem_alloc_pool(struct prmem_region *region, size= _t size, int align); void prmem_free_pool(struct prmem_region *region, void *va, size_t size); void *prmem_alloc_pages_locked(unsigned int order); void prmem_free_pages_locked(void *va, unsigned int order); +void *prmem_alloc_locked(size_t size); +void prmem_free_locked(void *va, size_t size); unsigned long prmem_checksum(void *start, size_t size); bool __init prmem_validate(void); void prmem_cmdline(char *cmdline); diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c index 07a5a430630c..f12975bc6777 100644 --- a/kernel/prmem/prmem_allocator.c +++ b/kernel/prmem/prmem_allocator.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* - * Persistent-Across-Kexec memory feature (prmem) - Allocator. + * Persistent-Across-Kexec memory (prmem) - Allocator. * * Copyright (C) 2023 Microsoft Corporation * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) @@ -72,3 +72,113 @@ void prmem_free_pages(struct page *pages, unsigned int = order) spin_unlock(&prmem_lock); } EXPORT_SYMBOL_GPL(prmem_free_pages); + +/* Buffer allocation functions. */ + +#if PAGE_SIZE > 65536 +#error "Page size is too big" +#endif + +static size_t prmem_cache_sizes[PRMEM_MAX_CACHES] =3D { + 8, 16, 32, 64, 128, 256, 512, + 1024, 2048, 4096, 8192, 16384, 32768, 65536, +}; + +static int prmem_cache_index(size_t size) +{ + int i; + + for (i =3D 0; i < PRMEM_MAX_CACHES; i++) { + if (size <=3D prmem_cache_sizes[i]) + return i; + } + BUG(); +} + +static void prmem_refill(void **cache, size_t size) +{ + void *va; + int i, n =3D PAGE_SIZE / size; + + /* Allocate a page. */ + va =3D prmem_alloc_pages_locked(0); + if (!va) + return; + + /* Break up the page into pieces and put them in the cache. */ + for (i =3D 0; i < n; i++, va +=3D size) { + *((void **) va) =3D *cache; + *cache =3D va; + } +} + +void *prmem_alloc_locked(size_t size) +{ + void *va; + int index; + void **cache; + + index =3D prmem_cache_index(size); + size =3D prmem_cache_sizes[index]; + + cache =3D &prmem->caches[index]; + if (!*cache) { + /* Refill the cache. */ + prmem_refill(cache, size); + } + + /* Allocate one from the cache. */ + va =3D *cache; + if (va) + *cache =3D *((void **) va); + return va; +} + +void *prmem_alloc(size_t size, gfp_t gfp) +{ + void *va; + bool zero =3D !!(gfp & __GFP_ZERO); + + if (!prmem_inited || !size) + return NULL; + + /* This function is only for sizes up to a PAGE_SIZE. */ + if (size > PAGE_SIZE) + return NULL; + + spin_lock(&prmem_lock); + va =3D prmem_alloc_locked(size); + spin_unlock(&prmem_lock); + + if (va && zero) + memset(va, 0, size); + return va; +} +EXPORT_SYMBOL_GPL(prmem_alloc); + +void prmem_free_locked(void *va, size_t size) +{ + int index; + void **cache; + + /* Free the object into its cache. */ + index =3D prmem_cache_index(size); + cache =3D &prmem->caches[index]; + *((void **) va) =3D *cache; + *cache =3D va; +} + +void prmem_free(void *va, size_t size) +{ + if (!prmem_inited || !va || !size) + return; + + /* This function is only for sizes up to a PAGE_SIZE. */ + if (size > PAGE_SIZE) + return; + + spin_lock(&prmem_lock); + prmem_free_locked(va, size); + spin_unlock(&prmem_lock); +} +EXPORT_SYMBOL_GPL(prmem_free); --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD63DCDB482 for ; Mon, 16 Oct 2023 23:32:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234285AbjJPXcx (ORCPT ); Mon, 16 Oct 2023 19:32:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53172 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233444AbjJPXck (ORCPT ); Mon, 16 Oct 2023 19:32:40 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 46B3592 for ; Mon, 16 Oct 2023 16:32:38 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 4722E20B74C2; Mon, 16 Oct 2023 16:32:37 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 4722E20B74C2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499158; bh=gCRZvrUNaVTWzlYfg8ajlsIhID+99pYUDz5EwuiJdrY=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Va2MWzBJiE+5QFXQOBeuFDPwG6UohLuTEtvYDdV1NU1S3AnzkcN+lpAsCfnZSws9X 02z8spSvLHk4Rdm+4c5vfn6ld/jFgelZAb8HX8i+sv93KLN8lk6IDZnxa5J7ijOEyU 3+o6EI6+W1t+HNOZbuWl1E6wulnRO3ectF4nwI3E= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 06/10] mm/prmem: Implement persistent XArray (and Radix Tree) Date: Mon, 16 Oct 2023 18:32:11 -0500 Message-Id: <20231016233215.13090-7-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Consumers can persist their data structures by allocating persistent memory for them. Now, data structures are connected to one another using pointers, arrays, linked lists, RB nodes, etc. These can all be persisted by allocating memory for them from persistent memory. E.g., a linked list is persisted if the data structures that embed the list head and the list nodes are allocated from persistent memory. Ditto for RB trees. One important exception is the XArray. The XArray itself can be embedded in a persistent data structure. However, the XA nodes are allocated using the kmem allocator. Implement a persistent XArray. Introduce a new field, xa_persistent, in the XArray. Implement an accessor function to set the field. If xa_persistent is true, allocate XA nodes using the prmem allocator instead of the kmem allocator. This makes the whole XArray persistent. Since Radix Trees (lib/radix-tree.c) are implemented based on the XArray, we also get persistent Radix Trees. The only difference is that pre-loading is not supported for persistent Radix Tree nodes. Signed-off-by: Madhavan T. Venkataraman --- include/linux/radix-tree.h | 4 ++++ include/linux/xarray.h | 15 ++++++++++++ lib/radix-tree.c | 49 +++++++++++++++++++++++++++++++------- lib/xarray.c | 11 +++++---- 4 files changed, 66 insertions(+), 13 deletions(-) diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h index eae67015ce51..74f0bdc60bea 100644 --- a/include/linux/radix-tree.h +++ b/include/linux/radix-tree.h @@ -82,6 +82,7 @@ static inline bool radix_tree_is_internal_node(void *ptr) struct radix_tree_root name =3D RADIX_TREE_INIT(name, mask) =20 #define INIT_RADIX_TREE(root, mask) xa_init_flags(root, mask) +#define PERSIST_RADIX_TREE(root) xa_persistent(root) =20 static inline bool radix_tree_empty(const struct radix_tree_root *root) { @@ -254,6 +255,9 @@ unsigned int radix_tree_gang_lookup_tag_slot(const stru= ct radix_tree_root *, void __rcu ***results, unsigned long first_index, unsigned int max_items, unsigned int tag); int radix_tree_tagged(const struct radix_tree_root *, unsigned int tag); +struct radix_tree_node *radix_node_alloc(struct radix_tree_root *root, + struct list_lru *lru, gfp_t gfp); +void radix_node_free(struct radix_tree_node *node); =20 static inline void radix_tree_preload_end(void) { diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 741703b45f61..3176a5f62caf 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -295,6 +295,7 @@ enum xa_lock_type { */ struct xarray { spinlock_t xa_lock; + bool xa_persistent; /* private: The rest of the data structure is not to be used directly. */ gfp_t xa_flags; void __rcu * xa_head; @@ -302,6 +303,7 @@ struct xarray { =20 #define XARRAY_INIT(name, flags) { \ .xa_lock =3D __SPIN_LOCK_UNLOCKED(name.xa_lock), \ + .xa_persistent =3D false, \ .xa_flags =3D flags, \ .xa_head =3D NULL, \ } @@ -378,6 +380,7 @@ void xa_destroy(struct xarray *); static inline void xa_init_flags(struct xarray *xa, gfp_t flags) { spin_lock_init(&xa->xa_lock); + xa->xa_persistent =3D false; xa->xa_flags =3D flags; xa->xa_head =3D NULL; } @@ -395,6 +398,17 @@ static inline void xa_init(struct xarray *xa) xa_init_flags(xa, 0); } =20 +/** + * xa_peristent() - xa_root and xa_node allocated from persistent memory. + * @xa: XArray. + * + * Context: Any context. + */ +static inline void xa_persistent(struct xarray *xa) +{ + xa->xa_persistent =3D true; +} + /** * xa_empty() - Determine if an array has any present entries. * @xa: XArray. @@ -1142,6 +1156,7 @@ struct xa_node { unsigned char offset; /* Slot offset in parent */ unsigned char count; /* Total entry count */ unsigned char nr_values; /* Value entry count */ + bool persistent; /* Allocated from persistent memory. */ struct xa_node __rcu *parent; /* NULL at top of tree */ struct xarray *array; /* The array we belong to */ union { diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 976b9bd02a1b..d3af6ff6c625 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -21,6 +21,7 @@ #include #include #include /* in_interrupt() */ +#include #include #include #include @@ -225,6 +226,36 @@ static unsigned long next_index(unsigned long index, return (index & ~node_maxindex(node)) + (offset << node->shift); } =20 +static void radix_tree_node_ctor(void *arg); + +struct radix_tree_node * +radix_node_alloc(struct radix_tree_root *root, struct list_lru *lru, gfp_t= gfp) +{ + struct radix_tree_node *node; + + if (root && root->xa_persistent) { + node =3D prmem_alloc(sizeof(struct radix_tree_node), gfp); + if (node) { + radix_tree_node_ctor(node); + node->persistent =3D true; + } + } else { + node =3D kmem_cache_alloc_lru(radix_tree_node_cachep, lru, gfp); + if (node) + node->persistent =3D false; + } + return node; +} + +void radix_node_free(struct radix_tree_node *node) +{ + if (node->persistent) { + prmem_free(node, sizeof(*node)); + return; + } + kmem_cache_free(radix_tree_node_cachep, node); +} + /* * This assumes that the caller has performed appropriate preallocation, a= nd * that the caller has pinned this thread of control to the current CPU. @@ -241,8 +272,11 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tre= e_node *parent, * Preload code isn't irq safe and it doesn't make sense to use * preloading during an interrupt anyway as all the allocations have * to be atomic. So just do normal allocation when in interrupt. + * + * Also, there is no preloading for persistent trees. */ - if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) { + if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt() && + !root->xa_persistent) { struct radix_tree_preload *rtp; =20 /* @@ -250,8 +284,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree= _node *parent, * cache first for the new node to get accounted to the memory * cgroup. */ - ret =3D kmem_cache_alloc(radix_tree_node_cachep, - gfp_mask | __GFP_NOWARN); + ret =3D radix_node_alloc(root, NULL, gfp_mask | __GFP_NOWARN); if (ret) goto out; =20 @@ -273,7 +306,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree= _node *parent, kmemleak_update_trace(ret); goto out; } - ret =3D kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + ret =3D radix_node_alloc(root, NULL, gfp_mask); out: BUG_ON(radix_tree_is_internal_node(ret)); if (ret) { @@ -301,7 +334,7 @@ void radix_tree_node_rcu_free(struct rcu_head *head) memset(node->tags, 0, sizeof(node->tags)); INIT_LIST_HEAD(&node->private_list); =20 - kmem_cache_free(radix_tree_node_cachep, node); + radix_node_free(node); } =20 static inline void @@ -335,7 +368,7 @@ static __must_check int __radix_tree_preload(gfp_t gfp_= mask, unsigned nr) rtp =3D this_cpu_ptr(&radix_tree_preloads); while (rtp->nr < nr) { local_unlock(&radix_tree_preloads.lock); - node =3D kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + node =3D radix_node_alloc(NULL, NULL, gfp_mask); if (node =3D=3D NULL) goto out; local_lock(&radix_tree_preloads.lock); @@ -345,7 +378,7 @@ static __must_check int __radix_tree_preload(gfp_t gfp_= mask, unsigned nr) rtp->nodes =3D node; rtp->nr++; } else { - kmem_cache_free(radix_tree_node_cachep, node); + radix_node_free(node); } } ret =3D 0; @@ -1585,7 +1618,7 @@ static int radix_tree_cpu_dead(unsigned int cpu) while (rtp->nr) { node =3D rtp->nodes; rtp->nodes =3D node->parent; - kmem_cache_free(radix_tree_node_cachep, node); + radix_node_free(node); rtp->nr--; } return 0; diff --git a/lib/xarray.c b/lib/xarray.c index 2071a3718f4e..33a74b713e6a 100644 --- a/lib/xarray.c +++ b/lib/xarray.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include =20 @@ -303,7 +304,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp) } if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) gfp |=3D __GFP_ACCOUNT; - xas->xa_alloc =3D kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lr= u, gfp); + xas->xa_alloc =3D radix_node_alloc(xas->xa, xas->xa_lru, gfp); if (!xas->xa_alloc) return false; xas->xa_alloc->parent =3D NULL; @@ -335,10 +336,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t g= fp) gfp |=3D __GFP_ACCOUNT; if (gfpflags_allow_blocking(gfp)) { xas_unlock_type(xas, lock_type); - xas->xa_alloc =3D kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_l= ru, gfp); + xas->xa_alloc =3D radix_node_alloc(xas->xa, xas->xa_lru, gfp); xas_lock_type(xas, lock_type); } else { - xas->xa_alloc =3D kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_l= ru, gfp); + xas->xa_alloc =3D radix_node_alloc(xas->xa, xas->xa_lru, gfp); } if (!xas->xa_alloc) return false; @@ -372,7 +373,7 @@ static void *xas_alloc(struct xa_state *xas, unsigned i= nt shift) if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT) gfp |=3D __GFP_ACCOUNT; =20 - node =3D kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); + node =3D radix_node_alloc(xas->xa, xas->xa_lru, gfp); if (!node) { xas_set_err(xas, -ENOMEM); return NULL; @@ -1017,7 +1018,7 @@ void xas_split_alloc(struct xa_state *xas, void *entr= y, unsigned int order, void *sibling =3D NULL; struct xa_node *node; =20 - node =3D kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp); + node =3D radix_node_alloc(xas->xa, xas->xa_lru, gfp); if (!node) goto nomem; node->array =3D xas->xa; --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E8DACDB474 for ; Mon, 16 Oct 2023 23:32:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234161AbjJPXc5 (ORCPT ); Mon, 16 Oct 2023 19:32:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234169AbjJPXcl (ORCPT ); Mon, 16 Oct 2023 19:32:41 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 434339F for ; Mon, 16 Oct 2023 16:32:39 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 3AB4920B74C5; Mon, 16 Oct 2023 16:32:38 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 3AB4920B74C5 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499159; bh=SfHUHeC+bsZlpMjAFeDGrDjrFJ/Saq7zYvmmmQIZG2o=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JBDgiG0Iw0V/I83sKz/7TLqCH1WGNHbaqiN/L1djBYTo9C+2lSnwInQ+rPUbadLrK bfb/8Qfc1IRoTuk6K7U+UDcnHXhAQtM1nxDLrlcYMUkQBuvdcgG+KaRapIhBAl0hqB 2GotHRQuOMWqC7nGeuz9y1TDF2EDEuFvRaRAj9tU= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 07/10] mm/prmem: Implement named Persistent Instances. Date: Mon, 16 Oct 2023 18:32:12 -0500 Message-Id: <20231016233215.13090-8-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" To persist any data, a consumer needs to do the following: - create a persistent instance for it. The instance gets recorded in the metadata. - Name the instance. - Record the instance data in the instance. - Retrieve the instance by name after kexec. - Retrieve instance data. Implement the following API for consumers: prmem_get(subsystem, name, create) Get/Create a persistent instance. The consumer provides the name of the subsystem and the name of the instance within the subsystem. E.g., for a persistent ramdisk block device: subsystem =3D "ramdisk" instance =3D "pram0" prmem_set_data() Record a data pointer and a size for the instance. An instance may contain many data structures connected to each other using pointers, etc. A consumer is expected to record the top level data structure in the instance. All other data structures must be reachable from the top level data structure. prmem_get_data() Retrieve the data pointer and the size for the instance. prmem_put() Destroy a persistent instance. The instance data must be NULL at this point. So, the consumer is responsible for freeing the instance data and setting it to NULL in the instance prior to destroying. prmem_list() Walk the instances of a subsystem and call a callback for each. This allows a consumer to enumerate all of the instances associated with a subsystem. Signed-off-by: Madhavan T. Venkataraman --- include/linux/prmem.h | 36 +++++++++ kernel/prmem/Makefile | 2 +- kernel/prmem/prmem_init.c | 1 + kernel/prmem/prmem_instance.c | 139 ++++++++++++++++++++++++++++++++++ 4 files changed, 177 insertions(+), 1 deletion(-) create mode 100644 kernel/prmem/prmem_instance.c diff --git a/include/linux/prmem.h b/include/linux/prmem.h index 1cb4660cf35e..c7034690f7cb 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -50,6 +50,28 @@ struct prmem_region { struct gen_pool_chunk *chunk; }; =20 +#define PRMEM_MAX_NAME 32 + +/* + * To persist any data, a persistent instance is created for it and the da= ta is + * "remembered" in the instance. + * + * node List node + * subsystem Subsystem/driver/module that created the instance. E.g., + * "ramdisk" for the ramdisk driver. + * name Instance name within the subsystem/driver/module. E.g., "pram0" + * for a persistent ramdisk instance. + * data Pointer to data. E.g., the radix tree of pages in a ram disk. + * size Size of data. + */ +struct prmem_instance { + struct list_head node; + char subsystem[PRMEM_MAX_NAME]; + char name[PRMEM_MAX_NAME]; + void *data; + size_t size; +}; + #define PRMEM_MAX_CACHES 14 =20 /* @@ -63,6 +85,8 @@ struct prmem_region { * * regions List of memory regions. * + * instances Persistent instances. + * * caches Caches for different object sizes. For allocations smaller than * PAGE_SIZE, these caches are used. */ @@ -74,6 +98,9 @@ struct prmem { /* Persistent Regions. */ struct list_head regions; =20 + /* Persistent Instances. */ + struct list_head instances; + /* Allocation caches. */ void *caches[PRMEM_MAX_CACHES]; }; @@ -85,6 +112,8 @@ extern size_t prmem_size; extern bool prmem_inited; extern spinlock_t prmem_lock; =20 +typedef int (*prmem_list_func_t)(struct prmem_instance *instance, void *ar= g); + /* Kernel API. */ void prmem_reserve_early(void); void prmem_reserve(void); @@ -98,6 +127,13 @@ void prmem_free_pages(struct page *pages, unsigned int = order); void *prmem_alloc(size_t size, gfp_t gfp); void prmem_free(void *va, size_t size); =20 +/* Persistent Instance API. */ +void *prmem_get(char *subsystem, char *name, bool create); +void prmem_set_data(struct prmem_instance *instance, void *data, size_t si= ze); +void prmem_get_data(struct prmem_instance *instance, void **data, size_t *= size); +bool prmem_put(struct prmem_instance *instance); +int prmem_list(char *subsystem, prmem_list_func_t func, void *arg); + /* Internal functions. */ struct prmem_region *prmem_add_region(unsigned long pa, size_t size); bool prmem_create_pool(struct prmem_region *region, bool new_region); diff --git a/kernel/prmem/Makefile b/kernel/prmem/Makefile index 99bb19f0afd3..0ed7976580d6 100644 --- a/kernel/prmem/Makefile +++ b/kernel/prmem/Makefile @@ -1,4 +1,4 @@ # SPDX-License-Identifier: GPL-2.0 =20 obj-y +=3D prmem_parse.o prmem_reserve.o prmem_init.o prmem_region.o prmem= _misc.o -obj-y +=3D prmem_allocator.o +obj-y +=3D prmem_allocator.o prmem_instance.o diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c index d23833d296fe..166fca688ab3 100644 --- a/kernel/prmem/prmem_init.c +++ b/kernel/prmem/prmem_init.c @@ -21,6 +21,7 @@ void __init prmem_init(void) prmem->metadata =3D prmem_metadata; prmem->size =3D prmem_size; INIT_LIST_HEAD(&prmem->regions); + INIT_LIST_HEAD(&prmem->instances); =20 if (!prmem_add_region(prmem_pa, prmem_size)) return; diff --git a/kernel/prmem/prmem_instance.c b/kernel/prmem/prmem_instance.c new file mode 100644 index 000000000000..ee3554d0ab8b --- /dev/null +++ b/kernel/prmem/prmem_instance.c @@ -0,0 +1,139 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Persistent-Across-Kexec memory (prmem) - Persistent instances. + * + * Copyright (C) 2023 Microsoft Corporation + * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com) + */ +#include + +static struct prmem_instance *prmem_find(char *subsystem, char *name) +{ + struct prmem_instance *instance; + + list_for_each_entry(instance, &prmem->instances, node) { + if (!strcmp(instance->subsystem, subsystem) && + !strcmp(instance->name, name)) { + return instance; + } + } + return NULL; +} + +void *prmem_get(char *subsystem, char *name, bool create) +{ + int subsystem_len =3D strlen(subsystem); + int name_len =3D strlen(name); + struct prmem_instance *instance; + + /* + * In early boot, you are allowed to get an existing instance. But + * you are not allowed to create one until prmem is fully initialized. + */ + if (!prmem || (!prmem_inited && create)) + return NULL; + + if (!subsystem_len || subsystem_len >=3D PRMEM_MAX_NAME || + !name_len || name_len >=3D PRMEM_MAX_NAME) { + return NULL; + } + + spin_lock(&prmem_lock); + + /* Check if it already exists. */ + instance =3D prmem_find(subsystem, name); + if (instance || !create) + goto unlock; + + instance =3D prmem_alloc_locked(sizeof(*instance)); + if (!instance) + goto unlock; + + strcpy(instance->subsystem, subsystem); + strcpy(instance->name, name); + instance->data =3D NULL; + instance->size =3D 0; + + list_add_tail(&instance->node, &prmem->instances); +unlock: + spin_unlock(&prmem_lock); + return instance; +} +EXPORT_SYMBOL_GPL(prmem_get); + +void prmem_set_data(struct prmem_instance *instance, void *data, size_t si= ze) +{ + if (!prmem_inited) + return; + + spin_lock(&prmem_lock); + instance->data =3D data; + instance->size =3D size; + spin_unlock(&prmem_lock); +} +EXPORT_SYMBOL_GPL(prmem_set_data); + +void prmem_get_data(struct prmem_instance *instance, void **data, size_t *= size) +{ + if (!prmem) + return; + + spin_lock(&prmem_lock); + *data =3D instance->data; + *size =3D instance->size; + spin_unlock(&prmem_lock); +} +EXPORT_SYMBOL_GPL(prmem_get_data); + +bool prmem_put(struct prmem_instance *instance) +{ + if (!prmem_inited) + return true; + + spin_lock(&prmem_lock); + + if (instance->data) { + /* + * Caller is responsible for freeing instance data and setting + * it to NULL. + */ + spin_unlock(&prmem_lock); + return false; + } + + /* Free instance. */ + list_del(&instance->node); + prmem_free_locked(instance, sizeof(*instance)); + + spin_unlock(&prmem_lock); + return true; +} +EXPORT_SYMBOL_GPL(prmem_put); + +int prmem_list(char *subsystem, prmem_list_func_t func, void *arg) +{ + int subsystem_len =3D strlen(subsystem); + struct prmem_instance *instance; + int ret; + + if (!prmem) + return 0; + + if (!subsystem_len || subsystem_len >=3D PRMEM_MAX_NAME) + return -EINVAL; + + spin_lock(&prmem_lock); + + list_for_each_entry(instance, &prmem->instances, node) { + if (strcmp(instance->subsystem, subsystem)) + continue; + + ret =3D func(instance, arg); + if (ret) + break; + } + + spin_unlock(&prmem_lock); + return ret; +} +EXPORT_SYMBOL_GPL(prmem_list); --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77D29CDB474 for ; Mon, 16 Oct 2023 23:33:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234256AbjJPXdA (ORCPT ); Mon, 16 Oct 2023 19:33:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53194 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234209AbjJPXcm (ORCPT ); Mon, 16 Oct 2023 19:32:42 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2FA08AB for ; Mon, 16 Oct 2023 16:32:40 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 2EF5020B74C4; Mon, 16 Oct 2023 16:32:39 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 2EF5020B74C4 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499159; bh=Wi5bEAw2HG7TSBnQJouXlCgXiJRkmJRnqgkfKXioARE=; h=From:To:Subject:Date:In-Reply-To:References:From; b=h+s/CIhoqqXXmdUHn4o6cRlxPHYONQ2UELQctHoRl0UdX+hLFDXZEA9ufnCKTcplV v6EaamwqeTUQpldEWvNTx6W2o+11G9mC99sDItfeEZIptUYAwB0QblXZiX0GcytaVE uNeZS9IKw4HZ4HzRCjijYDobl7L5i6hmDIYu7iqA= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 08/10] mm/prmem: Implement Persistent Ramdisk instances. Date: Mon, 16 Oct 2023 18:32:13 -0500 Message-Id: <20231016233215.13090-9-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" Using the prmem APIs, any kernel subsystem can persist its data. For persisting user data, we need a filesystem. Implement persistent ramdisk block device instances so that any filesystem can be created on it. Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent ramdisk devices will be named "pram0", "pram1", "pram2", etc. For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For persistent ones, ramdisk pages are allocated using prmem_alloc_pages(). Each ram disk has a device structure - struct brd_device. For persistent ram disks, allocate this from persistent memory and record it as the instance data of the ram disk instance. The structure contains an XArray of pages allocated to the ram disk. Make it a persistent XArray. The disk size for all normal ramdisks is specified via a module parameter "rd_size". This forces all of the ramdisks to have the same size. For persistent ram disks, take a different approach. Define a module parameter called "prd_sizes" which specifies a comma-separated list of sizes. The sizes are applied in the order in which they are listed to "pram0", "pram1", etc. Ram Disk Usage -------------- sudo modprobe brd prd_sizes=3D"1G,2G" This creates two ram disks with the specified sizes. That is, /dev/pram0 will have a size of 1G. /dev/pram1 will have a size of 2G. sudo mkfs.ext4 /dev/pram0 sudo mkfs.ext4 /dev/pram1 Make filesystems on the persistent ram disks. sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0 sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1 Mount them somewhere. sudo umount /path/to/mountpoint0 sudo umount /path/to/mountpoint1 Unmount the filesystems. After kexec ----------- sudo modprobe brd (you may omit "prd_sizes") This remembers the previously created persistent ram disks. sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0 sudo mount -t ext4 /dev/pram1 /path/to/mountpoint1 Mount the same filesystems. The maximum number of persistent ram disk instances is specified via CONFIG_BLK_DEV_PRAM_MAX. By default, this is zero. Signed-off-by: Madhavan T. Venkataraman --- drivers/block/Kconfig | 11 +++ drivers/block/brd.c | 214 +++++++++++++++++++++++++++++++++++++++--- 2 files changed, 213 insertions(+), 12 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 5b9d4aaebb81..08fa40f6e2de 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -256,6 +256,17 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. =20 +config BLK_DEV_PRAM_MAX + int "Maximum number of Persistent RAM disks" + default "0" + depends on BLK_DEV_RAM + help + This allows the creation of persistent RAM disks. Persistent RAM + disks are used to remember data across a kexec reboot. The default + value is 0 Persistent RAM disks. Change this if you know what you + are doing. The sizes of the ram disks are specified via the boot + arg "prd_sizes" as a comma-separated list of sizes. + config CDROM_PKTCDVD tristate "Packet writing on CD/DVD media (DEPRECATED)" depends on !UML diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 970bd6ff38c4..3a05e56ca16f 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -24,9 +24,12 @@ #include #include #include +#include =20 #include =20 +enum brd_type { BRD_NORMAL =3D 0, BRD_PERSISTENT, }; + /* * Each block ramdisk device has a xarray brd_pages of pages that stores * the pages containing the block device's contents. A brd page's ->index = is @@ -36,6 +39,7 @@ */ struct brd_device { int brd_number; + enum brd_type brd_type; struct gendisk *brd_disk; struct list_head brd_list; =20 @@ -46,6 +50,15 @@ struct brd_device { u64 brd_nr_pages; }; =20 +/* Each of these functions performs an action based on brd_type. */ +static struct brd_device *brd_alloc_device(int i, enum brd_type type); +static void brd_free_device(struct brd_device *brd); +static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp); +static void brd_free_page(struct brd_device *brd, struct page *page); +static void brd_xa_init(struct brd_device *brd); +static void brd_init_name(struct brd_device *brd, char *name); +static void brd_set_capacity(struct brd_device *brd); + /* * Look up and return a brd's page for a given sector. */ @@ -75,7 +88,7 @@ static int brd_insert_page(struct brd_device *brd, sector= _t sector, gfp_t gfp) if (page) return 0; =20 - page =3D alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM); + page =3D brd_alloc_page(brd, gfp | __GFP_ZERO | __GFP_HIGHMEM); if (!page) return -ENOMEM; =20 @@ -87,7 +100,7 @@ static int brd_insert_page(struct brd_device *brd, secto= r_t sector, gfp_t gfp) cur =3D __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp); =20 if (unlikely(cur)) { - __free_page(page); + brd_free_page(brd, page); ret =3D xa_err(cur); if (!ret && (cur->index !=3D idx)) ret =3D -EIO; @@ -110,7 +123,7 @@ static void brd_free_pages(struct brd_device *brd) pgoff_t idx; =20 xa_for_each(&brd->brd_pages, idx, page) { - __free_page(page); + brd_free_page(brd, page); cond_resched(); } =20 @@ -287,6 +300,18 @@ unsigned long rd_size =3D CONFIG_BLK_DEV_RAM_SIZE; module_param(rd_size, ulong, 0444); MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes."); =20 +/* Sizes of persistent ram disks are specified in a comma-separated list. = */ +static char *prd_sizes; +module_param(prd_sizes, charp, 0444); +MODULE_PARM_DESC(prd_sizes, "Sizes of persistent RAM disks."); + +/* Persistent ram disk specific data. */ +struct prd_data { + struct prmem_instance *instance; + unsigned long size; +}; +static struct prd_data prd_data[CONFIG_BLK_DEV_PRAM_MAX]; + static int max_part =3D 1; module_param(max_part, int, 0444); MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices"); @@ -295,6 +320,32 @@ MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); MODULE_ALIAS("rd"); =20 +void __init brd_parse(void) +{ + unsigned long size; + char *cur, *tmp; + int i =3D 0; + + if (!CONFIG_BLK_DEV_PRAM_MAX || !prd_sizes) + return; + + /* Parse persistent ram disk sizes. */ + cur =3D prd_sizes; + do { + /* Get the size of a ramdisk. Sanity check it. */ + size =3D memparse(cur, &tmp); + if (cur =3D=3D tmp || !size) { + pr_warn("%s: Memory value expected\n", __func__); + return; + } + cur =3D tmp; + + /* Add the ramdisk size. */ + prd_data[i++].size =3D size; + + } while (*cur++ =3D=3D ',' && i < CONFIG_BLK_DEV_PRAM_MAX); +} + #ifndef MODULE /* Legacy boot options - nonmodular */ static int __init ramdisk_size(char *str) @@ -314,23 +365,33 @@ static struct dentry *brd_debugfs_dir; =20 static int brd_alloc(int i) { + int brd_number; + enum brd_type brd_type; struct brd_device *brd; struct gendisk *disk; char buf[DISK_NAME_LEN]; int err =3D -ENOMEM; =20 + if (i < rd_nr) { + brd_number =3D i; + brd_type =3D BRD_NORMAL; + } else { + brd_number =3D i - rd_nr; + brd_type =3D BRD_PERSISTENT; + } + list_for_each_entry(brd, &brd_devices, brd_list) - if (brd->brd_number =3D=3D i) + if (brd->brd_number =3D=3D i && brd->brd_type =3D=3D brd_type) return -EEXIST; - brd =3D kzalloc(sizeof(*brd), GFP_KERNEL); + brd =3D brd_alloc_device(brd_number, brd_type); if (!brd) return -ENOMEM; - brd->brd_number =3D i; + brd->brd_number =3D brd_number; list_add_tail(&brd->brd_list, &brd_devices); =20 - xa_init(&brd->brd_pages); + brd_xa_init(brd); =20 - snprintf(buf, DISK_NAME_LEN, "ram%d", i); + brd_init_name(brd, buf); if (!IS_ERR_OR_NULL(brd_debugfs_dir)) debugfs_create_u64(buf, 0444, brd_debugfs_dir, &brd->brd_nr_pages); @@ -345,7 +406,7 @@ static int brd_alloc(int i) disk->fops =3D &brd_fops; disk->private_data =3D brd; strscpy(disk->disk_name, buf, DISK_NAME_LEN); - set_capacity(disk, rd_size * 2); + brd_set_capacity(brd); =09 /* * This is so fdisk will align partitions on 4k, because of @@ -370,7 +431,7 @@ static int brd_alloc(int i) put_disk(disk); out_free_dev: list_del(&brd->brd_list); - kfree(brd); + brd_free_device(brd); return err; } =20 @@ -390,7 +451,7 @@ static void brd_cleanup(void) put_disk(brd->brd_disk); brd_free_pages(brd); list_del(&brd->brd_list); - kfree(brd); + brd_free_device(brd); } } =20 @@ -427,13 +488,21 @@ static int __init brd_init(void) goto out_free; } =20 + /* Parse persistent ram disk sizes. */ + brd_parse(); + + /* Create persistent ram disks. */ + for (i =3D 0; i < CONFIG_BLK_DEV_PRAM_MAX; i++) + brd_alloc(i + rd_nr); + /* * brd module now has a feature to instantiate underlying device * structure on-demand, provided that there is an access dev node. * * (1) if rd_nr is specified, create that many upfront. else * it defaults to CONFIG_BLK_DEV_RAM_COUNT - * (2) User can further extend brd devices by create dev node themselves + * (2) if prd_sizes is specified, create that many upfront. + * (3) User can further extend brd devices by create dev node themselves * and have kernel automatically instantiate actual device * on-demand. Example: * mknod /path/devnod_name b 1 X # 1 is the rd major @@ -469,3 +538,124 @@ static void __exit brd_exit(void) module_init(brd_init); module_exit(brd_exit); =20 +/* Each of these functions performs an action based on brd_type. */ + +static struct brd_device *brd_alloc_device(int i, enum brd_type type) +{ + char name[PRMEM_MAX_NAME]; + struct brd_device *brd; + struct prmem_instance *instance; + size_t size; + bool create; + + if (type =3D=3D BRD_NORMAL) + return kzalloc(sizeof(struct brd_device), GFP_KERNEL); + + /* + * Get the persistent ramdisk instance. If it does not exist, it will + * be created, if a size has been specified. + */ + create =3D !!prd_data[i].size; + snprintf(name, PRMEM_MAX_NAME, "pram%d", i); + instance =3D prmem_get("ramdisk", name, create); + if (!instance) + return NULL; + + prmem_get_data(instance, (void **) &brd, &size); + if (brd) { + /* Existing instance. Ignore the module parameter. */ + prd_data[i].size =3D size; + prd_data[i].instance =3D instance; + return brd; + } + + /* + * New instance. Allocate brd from persistent memory and set it as + * instance data. + */ + brd =3D prmem_alloc(sizeof(*brd), __GFP_ZERO); + if (!brd) { + prmem_put(instance); + return NULL; + } + brd->brd_type =3D BRD_PERSISTENT; + prmem_set_data(instance, brd, prd_data[i].size); + + prd_data[i].instance =3D instance; + return brd; +} + +static void brd_free_device(struct brd_device *brd) +{ + struct prmem_instance *instance; + + if (brd->brd_type =3D=3D BRD_NORMAL) { + kfree(brd); + return; + } + + instance =3D prd_data[brd->brd_number].instance; + prmem_set_data(instance, NULL, 0); + prmem_free(brd, sizeof(*brd)); + prmem_put(instance); +} + +static struct page *brd_alloc_page(struct brd_device *brd, gfp_t gfp) +{ + if (brd->brd_type =3D=3D BRD_NORMAL) + return alloc_page(gfp); + return prmem_alloc_pages(0, gfp); +} + +static void brd_free_page(struct brd_device *brd, struct page *page) +{ + if (brd->brd_type =3D=3D BRD_NORMAL) + __free_page(page); + else + prmem_free_pages(page, 0); +} + +static void brd_xa_init(struct brd_device *brd) +{ + if (brd->brd_type =3D=3D BRD_NORMAL) { + xa_init(&brd->brd_pages); + return; + } + + if (brd->brd_nr_pages) { + /* Existing persistent instance. */ + struct page *page; + pgoff_t idx; + + /* + * The xarray of pages is persistent. However, the page + * indexes are not. Set them here. + */ + xa_for_each(&brd->brd_pages, idx, page) { + page->index =3D idx; + } + } else { + /* New persistent instance. */ + xa_init(&brd->brd_pages); + xa_persistent(&brd->brd_pages); + } +} + +static void brd_init_name(struct brd_device *brd, char *name) +{ + if (brd->brd_type =3D=3D BRD_NORMAL) + snprintf(name, DISK_NAME_LEN, "ram%d", brd->brd_number); + else + snprintf(name, DISK_NAME_LEN, "pram%d", brd->brd_number); +} + +static void brd_set_capacity(struct brd_device *brd) +{ + unsigned long disksize; + + if (brd->brd_type =3D=3D BRD_NORMAL) + disksize =3D rd_size; + else + disksize =3D prd_data[brd->brd_number].size; + set_capacity(brd->brd_disk, disksize * 2); +} --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B25C3CDB474 for ; Mon, 16 Oct 2023 23:33:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234265AbjJPXdE (ORCPT ); Mon, 16 Oct 2023 19:33:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234212AbjJPXcm (ORCPT ); Mon, 16 Oct 2023 19:32:42 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 236CFD9 for ; Mon, 16 Oct 2023 16:32:41 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 2208E20B74C0; Mon, 16 Oct 2023 16:32:40 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 2208E20B74C0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499160; bh=v+MGl8VNxHhEu7SiDqqi9YdWqsYhSKIOB6h48Erpxmk=; h=From:To:Subject:Date:In-Reply-To:References:From; b=VOuXSLBDU9ez2Q4QlOyswMODjVcj8ALhspoyjwkyBIl/9lMFj2JkYJM0OvKd0rIL9 ktPzuHKgOrdTXz13SBV5UJF3JkaZZmm74iIFnpllif0gVmKRyybv6yEkWg/QKr4yJT t2IM2syhiHvyKaci7F5uDHic+s791o2RObM6XnP4= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 09/10] mm/prmem: Implement DAX support for Persistent Ramdisks. Date: Mon, 16 Oct 2023 18:32:14 -0500 Message-Id: <20231016233215.13090-10-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" One problem with using a ramdisk is that the page cache will contain redundant copies of ramdisk data. To avoid this, implement DAX support for persistent ramdisks. To avail this, the filesystem that is installed on the ramdisk must support DAX. Like ext4. Mount the filesystem with the dax option. E.g., sudo mount -t ext4 -o dax /dev/pram0 /path/to/mountpoint Signed-off-by: Madhavan T. Venkataraman --- drivers/block/brd.c | 106 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 3a05e56ca16f..d4a42d3bd212 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -25,6 +25,9 @@ #include #include #include +#include +#include +#include =20 #include =20 @@ -42,6 +45,7 @@ struct brd_device { enum brd_type brd_type; struct gendisk *brd_disk; struct list_head brd_list; + struct dax_device *brd_dax; =20 /* * Backing store of pages. This is the contents of the block device. @@ -58,6 +62,8 @@ static void brd_free_page(struct brd_device *brd, struct = page *page); static void brd_xa_init(struct brd_device *brd); static void brd_init_name(struct brd_device *brd, char *name); static void brd_set_capacity(struct brd_device *brd); +static int brd_dax_init(struct brd_device *brd); +static void brd_dax_cleanup(struct brd_device *brd); =20 /* * Look up and return a brd's page for a given sector. @@ -408,6 +414,9 @@ static int brd_alloc(int i) strscpy(disk->disk_name, buf, DISK_NAME_LEN); brd_set_capacity(brd); =09 + if (brd_dax_init(brd)) + goto out_clean_dax; + /* * This is so fdisk will align partitions on 4k, because of * direct_access API needing 4k alignment, returning a PFN @@ -421,6 +430,8 @@ static int brd_alloc(int i) blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue); blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue); + if (brd->brd_dax) + blk_queue_flag_set(QUEUE_FLAG_DAX, disk->queue); err =3D add_disk(disk); if (err) goto out_cleanup_disk; @@ -429,6 +440,8 @@ static int brd_alloc(int i) =20 out_cleanup_disk: put_disk(disk); +out_clean_dax: + brd_dax_cleanup(brd); out_free_dev: list_del(&brd->brd_list); brd_free_device(brd); @@ -447,6 +460,7 @@ static void brd_cleanup(void) debugfs_remove_recursive(brd_debugfs_dir); =20 list_for_each_entry_safe(brd, next, &brd_devices, brd_list) { + brd_dax_cleanup(brd); del_gendisk(brd->brd_disk); put_disk(brd->brd_disk); brd_free_pages(brd); @@ -659,3 +673,95 @@ static void brd_set_capacity(struct brd_device *brd) disksize =3D prd_data[brd->brd_number].size; set_capacity(brd->brd_disk, disksize * 2); } + +static bool prd_dax_enabled =3D IS_ENABLED(CONFIG_FS_DAX); + +static long brd_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, + enum dax_access_mode mode, + void **kaddr, pfn_t *pfn); +static int brd_dax_zero_page_range(struct dax_device *dax_dev, + pgoff_t pgoff, size_t nr_pages); + +static const struct dax_operations brd_dax_ops =3D { + .direct_access =3D brd_dax_direct_access, + .zero_page_range =3D brd_dax_zero_page_range, +}; + +static int brd_dax_init(struct brd_device *brd) +{ + if (!prd_dax_enabled || brd->brd_type =3D=3D BRD_NORMAL) + return 0; + + brd->brd_dax =3D alloc_dax(brd, &brd_dax_ops); + if (IS_ERR(brd->brd_dax)) { + pr_warn("%s: DAX failed\n", __func__); + brd->brd_dax =3D NULL; + return -ENOMEM; + } + + if (dax_add_host(brd->brd_dax, brd->brd_disk)) { + pr_warn("%s: DAX add failed\n", __func__); + return -ENOMEM; + } + return 0; +} + +static void brd_dax_cleanup(struct brd_device *brd) +{ + if (!prd_dax_enabled || brd->brd_type =3D=3D BRD_NORMAL) + return; + + if (brd->brd_dax) { + dax_remove_host(brd->brd_disk); + kill_dax(brd->brd_dax); + put_dax(brd->brd_dax); + } +} +static int brd_dax_zero_page_range(struct dax_device *dax_dev, + pgoff_t pgoff, size_t nr_pages) +{ + long rc; + void *kaddr; + + rc =3D dax_direct_access(dax_dev, pgoff, nr_pages, DAX_ACCESS, + &kaddr, NULL); + if (rc < 0) + return rc; + memset(kaddr, 0, nr_pages << PAGE_SHIFT); + return 0; +} + +static long __brd_direct_access(struct brd_device *brd, pgoff_t pgoff, + long nr_pages, void **kaddr, pfn_t *pfn) +{ + struct page *page; + sector_t sector =3D (sector_t) pgoff << PAGE_SECTORS_SHIFT; + int ret; + + if (!brd) + return -ENODEV; + + ret =3D brd_insert_page(brd, sector, GFP_NOWAIT); + if (ret) + return ret; + + page =3D brd_lookup_page(brd, sector); + if (!page) + return -ENOSPC; + + *kaddr =3D page_address(page); + if (pfn) + *pfn =3D page_to_pfn_t(page); + + return 1; +} + +static long brd_dax_direct_access(struct dax_device *dax_dev, + pgoff_t pgoff, long nr_pages, enum dax_access_mode mode, + void **kaddr, pfn_t *pfn) +{ + struct brd_device *brd =3D dax_get_private(dax_dev); + + return __brd_direct_access(brd, pgoff, nr_pages, kaddr, pfn); +} --=20 2.25.1 From nobody Wed Dec 17 08:03:18 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBD85CDB465 for ; Mon, 16 Oct 2023 23:33:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234320AbjJPXdM (ORCPT ); Mon, 16 Oct 2023 19:33:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234235AbjJPXco (ORCPT ); Mon, 16 Oct 2023 19:32:44 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 62414F0 for ; Mon, 16 Oct 2023 16:32:42 -0700 (PDT) Received: from localhost.localdomain (unknown [47.186.13.91]) by linux.microsoft.com (Postfix) with ESMTPSA id 157B920B74C2; Mon, 16 Oct 2023 16:32:41 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 157B920B74C2 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1697499161; bh=jdahwvucpOV4HlU+VMir0deFqTNbGFSK5yumW26pSKk=; h=From:To:Subject:Date:In-Reply-To:References:From; b=hwWrpHDqsY/2Vc+kA26DX4AIcUqm9xfrrtFP6iNdt65sOgThXhvGUFVGRtjVyVMbc i9pcqWdkXmXOfByFXjbedYuNDtoZxt7dtYjpDjpGLDLbnu3UYy6RDhfQ5eKpv6S6ay 7KM//RnndVIjbifxyWzsM+TLr8msP/bETMWpn14Y= From: madvenka@linux.microsoft.com To: gregkh@linuxfoundation.org, pbonzini@redhat.com, rppt@kernel.org, jgowans@amazon.com, graf@amazon.de, arnd@arndb.de, keescook@chromium.org, stanislav.kinsburskii@gmail.com, anthony.yznaga@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, madvenka@linux.microsoft.com, jamorris@linux.microsoft.com Subject: [RFC PATCH v1 10/10] mm/prmem: Implement dynamic expansion of prmem. Date: Mon, 16 Oct 2023 18:32:15 -0500 Message-Id: <20231016233215.13090-11-madvenka@linux.microsoft.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231016233215.13090-1-madvenka@linux.microsoft.com> References: <1b1bc25eb87355b91fcde1de7c2f93f38abb2bf9> <20231016233215.13090-1-madvenka@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Madhavan T. Venkataraman" For some use cases, it is hard to predict how much actual memory is needed to store persistent data. This will depend on the workload. Either we would have to overcommit memory for persistent data. Or, we could allow dynamic expansion of prmem memory. Implement dynamic expansion of prmem. When the allocator runs out of memory it calls alloc_pages(MAX_ORDER) to allocate a max order page. It creates a region for that memory and adds it to the list of regions. Then, the allocator can allocate from that region. To allow this, extend the command line parameter: prmem=3Dsize[KMG][,max_size[KMG]] Size is allocated upfront as mentioned before. Between size and max_size, prmem is expanded dynamically as mentioned above. Choosing a max order page means that no fragmentation is created for transparent huge pages and kmem slabs. But fragmentation may be created for 1GB pages. This is not a problem for 1GB pages that are reserved up front. This could be a problem for 1GB pages that are allocated at run time dynamically. If max_size is omitted from the command line parameter, no dynamic expansion will happen. Signed-off-by: Madhavan T. Venkataraman --- include/linux/prmem.h | 8 +++++++ kernel/prmem/prmem_allocator.c | 38 ++++++++++++++++++++++++++++++++++ kernel/prmem/prmem_init.c | 1 + kernel/prmem/prmem_misc.c | 3 ++- kernel/prmem/prmem_parse.c | 20 +++++++++++++++++- kernel/prmem/prmem_region.c | 1 + kernel/prmem/prmem_reserve.c | 1 + 7 files changed, 70 insertions(+), 2 deletions(-) diff --git a/include/linux/prmem.h b/include/linux/prmem.h index c7034690f7cb..bb552946cb5b 100644 --- a/include/linux/prmem.h +++ b/include/linux/prmem.h @@ -83,6 +83,9 @@ struct prmem_instance { * metadata Physical address of the metadata page. * size Size of initial memory allocated to prmem. * + * cur_size Current amount of memory allocated to prmem. + * max_size Maximum amount of memory that can be allocated to prmem. + * * regions List of memory regions. * * instances Persistent instances. @@ -95,6 +98,10 @@ struct prmem { unsigned long metadata; size_t size; =20 + /* Dynamic expansion. */ + size_t cur_size; + size_t max_size; + /* Persistent Regions. */ struct list_head regions; =20 @@ -109,6 +116,7 @@ extern struct prmem *prmem; extern unsigned long prmem_metadata; extern unsigned long prmem_pa; extern size_t prmem_size; +extern size_t prmem_max_size; extern bool prmem_inited; extern spinlock_t prmem_lock; =20 diff --git a/kernel/prmem/prmem_allocator.c b/kernel/prmem/prmem_allocator.c index f12975bc6777..1cb3eae8a3e7 100644 --- a/kernel/prmem/prmem_allocator.c +++ b/kernel/prmem/prmem_allocator.c @@ -9,17 +9,55 @@ =20 /* Page Allocation functions. */ =20 +static void prmem_expand(void) +{ + struct prmem_region *region; + struct page *pages; + unsigned int order =3D MAX_ORDER; + size_t size =3D (1UL << order) << PAGE_SHIFT; + + if (prmem->cur_size + size > prmem->max_size) + return; + + spin_unlock(&prmem_lock); + pages =3D alloc_pages(GFP_NOWAIT, order); + spin_lock(&prmem_lock); + + if (!pages) + return; + + /* cur_size may have changed. Recheck. */ + if (prmem->cur_size + size > prmem->max_size) + goto free; + + region =3D prmem_add_region(page_to_phys(pages), size); + if (!region) + goto free; + + pr_warn("%s: prmem expanded by %ld\n", __func__, size); + return; +free: + __free_pages(pages, order); +} + void *prmem_alloc_pages_locked(unsigned int order) { struct prmem_region *region; void *va; size_t size =3D (1UL << order) << PAGE_SHIFT; + bool expand =3D true; =20 +retry: list_for_each_entry(region, &prmem->regions, node) { va =3D prmem_alloc_pool(region, size, size); if (va) return va; } + if (expand) { + expand =3D false; + prmem_expand(); + goto retry; + } return NULL; } =20 diff --git a/kernel/prmem/prmem_init.c b/kernel/prmem/prmem_init.c index 166fca688ab3..f4814cc88508 100644 --- a/kernel/prmem/prmem_init.c +++ b/kernel/prmem/prmem_init.c @@ -20,6 +20,7 @@ void __init prmem_init(void) /* Cold boot. */ prmem->metadata =3D prmem_metadata; prmem->size =3D prmem_size; + prmem->max_size =3D prmem_max_size; INIT_LIST_HEAD(&prmem->regions); INIT_LIST_HEAD(&prmem->instances); =20 diff --git a/kernel/prmem/prmem_misc.c b/kernel/prmem/prmem_misc.c index 49b6a7232c1a..3100662d2cbe 100644 --- a/kernel/prmem/prmem_misc.c +++ b/kernel/prmem/prmem_misc.c @@ -68,7 +68,8 @@ bool __init prmem_validate(void) unsigned long checksum; =20 /* Sanity check the boot parameter. */ - if (prmem_metadata !=3D prmem->metadata || prmem_size !=3D prmem->size) { + if (prmem_metadata !=3D prmem->metadata || prmem_size !=3D prmem->size || + prmem_max_size !=3D prmem->max_size) { pr_warn("%s: Boot parameter mismatch\n", __func__); return false; } diff --git a/kernel/prmem/prmem_parse.c b/kernel/prmem/prmem_parse.c index 6c1a23c6b84e..3a57b37fa191 100644 --- a/kernel/prmem/prmem_parse.c +++ b/kernel/prmem/prmem_parse.c @@ -8,9 +8,11 @@ #include =20 /* - * Syntax: prmem=3Dsize[KMG] + * Syntax: prmem=3Dsize[KMG][,max_size[KMG]] * * Specifies the size of the initial memory to be allocated to prmem. + * Optionally, specifies the maximum amount of memory to be allocated to + * prmem. prmem will expand dynamically between size and max_size. */ static int __init prmem_size_parse(char *cmdline) { @@ -28,6 +30,22 @@ static int __init prmem_size_parse(char *cmdline) } =20 prmem_size =3D size; + prmem_max_size =3D size; + + cur =3D tmp; + if (*cur++ =3D=3D ',') { + /* Get max size. */ + size =3D memparse(cur, &tmp); + if (cur =3D=3D tmp || !size || size & (PAGE_SIZE - 1) || + size <=3D prmem_size) { + prmem_size =3D 0; + prmem_max_size =3D 0; + pr_warn("%s: Incorrect max size %lx\n", __func__, size); + return -EINVAL; + } + prmem_max_size =3D size; + } + return 0; } early_param("prmem", prmem_size_parse); diff --git a/kernel/prmem/prmem_region.c b/kernel/prmem/prmem_region.c index 6dc88c74d9c8..390329a34b74 100644 --- a/kernel/prmem/prmem_region.c +++ b/kernel/prmem/prmem_region.c @@ -82,5 +82,6 @@ struct prmem_region *prmem_add_region(unsigned long pa, s= ize_t size) return NULL; =20 list_add_tail(®ion->node, &prmem->regions); + prmem->cur_size +=3D size; return region; } diff --git a/kernel/prmem/prmem_reserve.c b/kernel/prmem/prmem_reserve.c index 8000fff05402..c5ae5d7d8f0a 100644 --- a/kernel/prmem/prmem_reserve.c +++ b/kernel/prmem/prmem_reserve.c @@ -11,6 +11,7 @@ struct prmem *prmem; unsigned long prmem_metadata; unsigned long prmem_pa; unsigned long prmem_size; +unsigned long prmem_max_size; =20 void __init prmem_reserve_early(void) { --=20 2.25.1