[RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)

madvenka@linux.microsoft.com posted 10 patches 2 years, 2 months ago
arch/x86/kernel/kexec-bzimage64.c |   5 +-
arch/x86/kernel/setup.c           |   4 +
drivers/block/Kconfig             |  11 +
drivers/block/brd.c               | 320 ++++++++++++++++++++++++++++--
include/linux/genalloc.h          |   6 +
include/linux/memblock.h          |   2 +
include/linux/prmem.h             | 158 +++++++++++++++
include/linux/radix-tree.h        |   4 +
include/linux/xarray.h            |  15 ++
kernel/Makefile                   |   1 +
kernel/prmem/Makefile             |   4 +
kernel/prmem/prmem_allocator.c    | 222 +++++++++++++++++++++
kernel/prmem/prmem_init.c         |  48 +++++
kernel/prmem/prmem_instance.c     | 139 +++++++++++++
kernel/prmem/prmem_misc.c         |  86 ++++++++
kernel/prmem/prmem_parse.c        |  80 ++++++++
kernel/prmem/prmem_region.c       |  87 ++++++++
kernel/prmem/prmem_reserve.c      | 125 ++++++++++++
kernel/reboot.c                   |   2 +
lib/genalloc.c                    |  45 +++--
lib/radix-tree.c                  |  49 ++++-
lib/xarray.c                      |  11 +-
mm/memblock.c                     |  12 ++
mm/mm_init.c                      |   2 +
24 files changed, 1400 insertions(+), 38 deletions(-)
create mode 100644 include/linux/prmem.h
create mode 100644 kernel/prmem/Makefile
create mode 100644 kernel/prmem/prmem_allocator.c
create mode 100644 kernel/prmem/prmem_init.c
create mode 100644 kernel/prmem/prmem_instance.c
create mode 100644 kernel/prmem/prmem_misc.c
create mode 100644 kernel/prmem/prmem_parse.c
create mode 100644 kernel/prmem/prmem_region.c
create mode 100644 kernel/prmem/prmem_reserve.c
[RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
Posted by madvenka@linux.microsoft.com 2 years, 2 months ago
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Introduction
============

This feature can be used to persist kernel and user data across kexec reboots
in RAM for various uses. E.g., persisting:

	- cached data. E.g., database caches.
	- state. E.g., KVM guest states.
	- historical information since the last cold boot. E.g., events, logs
	  and journals.
	- measurements for integrity checks on the next boot.
	- driver data.
	- IOMMU mappings.
	- MMIO config information.

This is useful on systems where there is no non-volatile storage or
non-volatile storage is too small or too slow.

The following sections describe the implementation.

I have enhanced the ram disk block device driver to provide persistent ram
disks on which any filesystem can be created. This is for persisting user data.
I have also implemented DAX support for the persistent ram disks.

I am also working on making ZRAM persistent.

I have also briefly discussed the following use cases:

	- Persisting IOMMU mappings
	- Remembering DMA pages
	- Reserving pages that encounter memory errors
	- Remembering IMA measurements for integrity checks
	- Remembering MMIO config info
	- Implementing prmemfs (special filesystem tailored for persistence)

Allocate metadata
=================

Define a metadata structure to store all persistent memory related information.
The metadata fits into one page. On a cold boot, allocate and initialize the
metadata page.

Allocate data
=============

On a cold boot, allocate some memory for storing persistent data. Call it
persistent memory. Specify the size in a command line parameter:

	prmem=size[KMG][,max_size[KMG]]

	size		Initial amount of memory allocated to prmem during boot
	max_size	Maximum amount of memory that can be allocated to prmem

When the initial memory is exhaused via allocations, expand prmem dynamically
up to max_size. Expansion is done by allocating from the buddy allocator.
Record all allocations in the metadata.

Remember the metadata
=====================

On all (kexec) reboots, remember the metadata page address. This is done via
a new kernel command line parameter:

	prmem_meta=address

When a kexec image is loaded, the kexec command line is set up. Append the
above parameter to the command line automatically.

In early boot, extract the metadata page address from the command line and
reserve the metadata page. From the metadata, get the persistent memory that
has been allocated before and reserve it as well.

Manage persistent memory
========================

Manage persistent memory with the Gen Pool allocator (lib/genalloc.c). This
is so we don't have to implement a new allocator. Make the Gen Pool
persistent so allocations can be remembered across kexecs.

Provide functions for allocating and freeing persistent memory. These are
just wrappers around the Gen Pool functions:

  	prmem_alloc_pages()	(looks like alloc_pages())
	prmem_free_pages()	(looks like __free_pages())
	prmem_alloc()		(looks like kmalloc())
	prmem_free()		(looks like kfree())

Create persistent instances
===========================

Consumers store information in the form of data structures. To persist a data
structure across a kexec, a consumer has to do two things:

	1. Allocate persistent memory for the data structure.

	2. Remember the data structure in a named persistent instance.

A persistent instance has the following attributes:

	Subsystem name    Name of the subsystem/module/driver that created the
			  instance. E.g., "ramdisk" for the ramdisk driver.
	Instance name     Name of the instance within the subsystem. E.g.,
			  "pram0" for a persistent ram disk.
	Data		  Pointer to instance data.
	Size		  Size of instance data.

Provide functions to create and manage persistent instances:

	prmem_get()		Get/Create a persistent instance.
	prmem_set_data()	Record the instance data pointer and size.
	prmem_get_data()	Retrieve the instance data pointer and size.
	prmem_put()		Destroy a persistent instance.
	prmem_list()		Enumerate the instances of a subsystem.

Complex data structures
=======================

A persistent instance may have more than one data structure to remember across
kexec.

Data structures can be connected to other data structures using pointers,
arrays, linked lists, RB trees, etc. As long as each structure is placed in
persistent memory, the whole set of data structures can be remembered
across a kexec.

It is expected that a consumer will create a top level data structure for
an instance from which all other data structures belonging to the instance
can be reached. So, only the top level data structure needs to be registered
as instance data.

Linked list nodes and RB nodes are embedded in data structures. So, persisting
linked lists and RB trees is straight forward. But the XArray needs a little
more work. The XArray itself can be embedded in a persistent data structure.
But the XA nodes are currently allocated from normal memory using the kmem
allocator. Enhance XArrays to include a persistent option so that the XA nodes
as well can be allocated from persistent memory. Then, the whole XArray becomes
persistent.

Since Radix Trees are implemented with XArrays, we get persistent Radix
Trees as well.

The ram disk uses an XArray. Some other use cases can also use an XArray.

Persistent virtual addresses
============================

Apart from consumer data structures, Prmem metadata structures must be
persisted as well. In either case, data structures point to one another
using virtual addresses.

To keep the implementation simple, the virtual addresses used within persistent
memory must not change on a kexec. The alternative is to remap everything on
each kexec. This can be complex and cumbersome.

prmem uses direct map addresses for this reason. However, if PAGE_OFFSET is
randomized by KASLR, this will not work. Until I find an alternative for this,
prmem is currently not supported if kernel memory randomization is enabled.
prmem checks for this at runtime and disables itself. So, for now, include
"nokaslr" in the command line to allow prmem.

Note that kernel text randomization does not affect prmem. So, if an
architecture does not support randomization of PAGE_OFFSET, then there is
no need to include "nokaslr" in the command line.

Validation of metadata
======================

The metadata must be validated on a kexec before it can be used. To allow this,
compute a checksum on the metadata just before the kexec reboot and store it in
the metadata.

After kexec, in early boot, use the checksum to validate the metadata. If the
validation fails, discard the metadata. Treat it as a cold boot. That is,
allocate a new metadata page and initial region and start over.

This means that all persistent data will be lost on a validation failure.

Dynamic Expansion
=================

For some use cases, it may be hard to predict how much actual memory is
needed to store persistent data. This may depend on the workload. Either
we would have to overcommit memory for persistent data. Or, we could
allow dynamic expansion of prmem memory.

Implement dynamic expansion of prmem. When there is no free persistent memory
call alloc_pages(MAX_ORDER) to allocate a max order page. Add it to prmem.

Choosing a max order page means that no fragmentation is created for
transparent huge pages or kmem slabs. But fragmentation may be created for
1GB pages. This is not a problem for 1GB pages that are reserved up front
during boot. This could be a problem for 1GB pages that are allocated at run
time dynamically.

As mentioned before, dynamic expansion is optional. If a max_size is not
specified in the command line, then dynamic expansion does not happen.

Persistent Ramdisks
===================

I have implemented one main use case in this patchset - persistent ram disks.
Any filesystem can be installed on a persistent ram disk. User data can be
persisted on the filesystem.

One problem with using a ramdisk is that the page cache will contain redundant
copies of ramdisk pages. To avoid this, I have implemented DAX support for
persistent ramdisks. This can be availed by installing a filesystem with DAX
support on the ram disks.

Normal ramdisk devices are named "ram0", "ram1", "ram2", etc. Persistent
ramdisk devices will be named "pram0", "pram1", "pram2", etc.

For normal ramdisks, ramdisk pages are allocated using alloc_pages(). For
persistent ones, ramdisk pages are allocated using prmem_alloc_pages().

Each ram disk has a device structure (struct brd_device). This is allocated
from kmem for normal ram disks and from persistent memory for persistent ram
disks. This becomes the instance data. This structure contains an XArray
of pages allocated to the ram disk. A persistent XArray will be used.

The disk size for all normal ramdisks is specified via a module parameter
"rd_size". This forces all of the ramdisks to have the same size. For
persistent ram disks, take a different approach. Define a module parameter
called "prd_sizes" which specifies a comma-separated list of sizes. The
sizes are applied in the order in which they appear to "pram0", "pram1",
etc.

	Persistent Ram Disk Usage:

	sudo modprobe brd prd_sizes="1G,2G"

		This creates two persistent ram disks with the specified sizes.
		That is, /dev/pram0 will have a size of 1G. /dev/pram1 will
		have a size of 2G.

	sudo mkfs.ext4 /dev/pram0
	sudo mkfs.ext4 /dev/pram1

		Make filesystems on the persistent ram disks.

	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

		Mount them somewhere. Note that the -o dax option can be used
		to avail DAX.

	sudo umount /path/to/mountpoint0
	sudo umount /path/to/mountpoint1

		Unmount the filesystems.

On subsequent kexecs, you can load the module with or without specifying the
sizes. The previous devices and sizes will be remembered. After that, simply
mount the filesystems and use them.

	sudo modprobe brd
	sudo mount -t ext4 /dev/pram0 /path/to/mountpoint0
	sudo mount -t ext4 -o dax /dev/pram1 /path/to/mountpoint1

The persistent ramdisk devices are destroyed when the module is explicitly
unloaded (rmmod). But if a reboot happens without the module unload, the
devices are persisted.

Other use cases
===============

I believe that it is possible to implement most use cases. I have listed some
examples below. I am not an expert in these areas. These are just suggestions.
Please let me know if there are any mistakes. Comments are most welcome.

- IOMMU mappings
	The IOVA to PFN mappings can be remembered using a persistent XArray.

- DMA pages
	Someone mentioned this use case. IIUC, DMA operations may be in flight
	when a kexec happens. Instead of waiting for the DMA operations to
	complete, drivers could remember the DMA pages in a persistent XArray.
	Then, in early boot, retrieve the XArray from prmem and reserve those
	individual pages early. Once the DMA operations complete, the pages can
	be removed from the XArray and freed into the buddy allocator.

- Pages that encounter memory errors
	These could be remembered in a persistent XArray. Then, in early boot,
	retrieve the XArray from prmem and reserve the pages so they are never
	used.

- IMA
	IMA tries to remember measurements across a kexec so that integrity
	checks can be performed on a kexec reboot. Currently, IIUC, IMA
	uses a kexec buffer to remember measurements. However, the buffer
	has to be allocated up front when the kexec image is loaded. If the gap
	between loading a kexec image and executing it is large, the
	measurements that come in during that time may not fit into the
	pre-allocated buffer.

	The solution could be to remember measurements using prmem. I am
	working on this. I will add this in a future version of this patchset.

- ZRAM
	The ZRAM block device is a candidate for persistence. This is still
	work in progress. I will add this in a future version of this patchset
	once I get it working.

- MMIO
	I am not familiar with what exactly needs to be persisted for this.
	I will state my understanding of the use case. Please correct me if
	I am wrong. IIUC, during PCI discovery, I/O devices are enumerated,
	memory space allocation is done and the I/O devices are configured.
	If the enumerated devices and their configuration can be remembered
	across kexec, then the discovery phase can be skipped after kexec.
	This will speed up PCI init.

	I believe the MMIO config info can be persisted using prmem.

- prmemfs
	It may be simpler and more efficient if we could implement a special
	filesystem that is tailored for persistence. We don't have to support
	anything that is not required for persistent data. E.g., FIFOs,
	special files, hard links, using the page cache, etc. When files are
	deleted, the memory can be freed back into prmem.

	The instance data for the filesystem would be the superblock. The
	following need to be allocated from pesistent memory - the superblock,
	the inodes and the data pages. The data pages can be remembered in a
	persistent XArray.

	I am looking into this as well.

TBD
===

- Reservations.
	Consumers must be able to reserve persistent memory to guarantee
	sizes for their instances. E.g., for a persistent ramdisk.

- NUMA support.

- Memory Leak detection.
	Something similar to kmemleak may need to be implemented to detect
	memory leaks in persistent memory.

---

Madhavan T. Venkataraman (10):
  mm/prmem: Allocate memory during boot for storing persistent data
  mm/prmem: Reserve metadata and persistent regions in early boot after
    kexec
  mm/prmem: Manage persistent memory with the gen pool allocator.
  mm/prmem: Implement a page allocator for persistent memory
  mm/prmem: Implement a buffer allocator for persistent memory
  mm/prmem: Implement persistent XArray (and Radix Tree)
  mm/prmem: Implement named Persistent Instances.
  mm/prmem: Implement Persistent Ramdisk instances.
  mm/prmem: Implement DAX support for Persistent Ramdisks.
  mm/prmem: Implement dynamic expansion of prmem.

 arch/x86/kernel/kexec-bzimage64.c |   5 +-
 arch/x86/kernel/setup.c           |   4 +
 drivers/block/Kconfig             |  11 +
 drivers/block/brd.c               | 320 ++++++++++++++++++++++++++++--
 include/linux/genalloc.h          |   6 +
 include/linux/memblock.h          |   2 +
 include/linux/prmem.h             | 158 +++++++++++++++
 include/linux/radix-tree.h        |   4 +
 include/linux/xarray.h            |  15 ++
 kernel/Makefile                   |   1 +
 kernel/prmem/Makefile             |   4 +
 kernel/prmem/prmem_allocator.c    | 222 +++++++++++++++++++++
 kernel/prmem/prmem_init.c         |  48 +++++
 kernel/prmem/prmem_instance.c     | 139 +++++++++++++
 kernel/prmem/prmem_misc.c         |  86 ++++++++
 kernel/prmem/prmem_parse.c        |  80 ++++++++
 kernel/prmem/prmem_region.c       |  87 ++++++++
 kernel/prmem/prmem_reserve.c      | 125 ++++++++++++
 kernel/reboot.c                   |   2 +
 lib/genalloc.c                    |  45 +++--
 lib/radix-tree.c                  |  49 ++++-
 lib/xarray.c                      |  11 +-
 mm/memblock.c                     |  12 ++
 mm/mm_init.c                      |   2 +
 24 files changed, 1400 insertions(+), 38 deletions(-)
 create mode 100644 include/linux/prmem.h
 create mode 100644 kernel/prmem/Makefile
 create mode 100644 kernel/prmem/prmem_allocator.c
 create mode 100644 kernel/prmem/prmem_init.c
 create mode 100644 kernel/prmem/prmem_instance.c
 create mode 100644 kernel/prmem/prmem_misc.c
 create mode 100644 kernel/prmem/prmem_parse.c
 create mode 100644 kernel/prmem/prmem_region.c
 create mode 100644 kernel/prmem/prmem_reserve.c


base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
-- 
2.25.1
Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
Posted by Alexander Graf 2 years, 2 months ago
Hey Madhavan!

This patch set looks super exciting - thanks a lot for putting it 
together. We've been poking at a very similar direction for a while as 
well and will discuss the fundamental problem of how to persist kernel 
metadata across kexec at LPC:

   https://lpc.events/event/17/contributions/1485/

It would be great to have you in the room as well then.

Some more comments inline.

On 17.10.23 01:32, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ============
>
> This feature can be used to persist kernel and user data across kexec reboots
> in RAM for various uses. E.g., persisting:
>
>          - cached data. E.g., database caches.
>          - state. E.g., KVM guest states.
>          - historical information since the last cold boot. E.g., events, logs
>            and journals.
>          - measurements for integrity checks on the next boot.
>          - driver data.
>          - IOMMU mappings.
>          - MMIO config information.
>
> This is useful on systems where there is no non-volatile storage or
> non-volatile storage is too small or too slow.


This is useful in more situations. We for example need it to do a kexec 
while a virtual machine is in suspended state, but has IOMMU mappings 
intact (Live Update). For that, we need to ensure DMA can still reach 
the VM memory and that everything gets reassembled identically and 
without interruptions on the receiving end.


> The following sections describe the implementation.
>
> I have enhanced the ram disk block device driver to provide persistent ram
> disks on which any filesystem can be created. This is for persisting user data.
> I have also implemented DAX support for the persistent ram disks.


This is probably the least interesting of the enablements, right? You 
can already today reserve RAM on boot as DAX block device and use it for 
that purpose.


> I am also working on making ZRAM persistent.
>
> I have also briefly discussed the following use cases:
>
>          - Persisting IOMMU mappings
>          - Remembering DMA pages
>          - Reserving pages that encounter memory errors
>          - Remembering IMA measurements for integrity checks
>          - Remembering MMIO config info
>          - Implementing prmemfs (special filesystem tailored for persistence)
>
> Allocate metadata
> =================
>
> Define a metadata structure to store all persistent memory related information.
> The metadata fits into one page. On a cold boot, allocate and initialize the
> metadata page.
>
> Allocate data
> =============
>
> On a cold boot, allocate some memory for storing persistent data. Call it
> persistent memory. Specify the size in a command line parameter:
>
>          prmem=size[KMG][,max_size[KMG]]
>
>          size            Initial amount of memory allocated to prmem during boot
>          max_size        Maximum amount of memory that can be allocated to prmem
>
> When the initial memory is exhaused via allocations, expand prmem dynamically
> up to max_size. Expansion is done by allocating from the buddy allocator.
> Record all allocations in the metadata.


I don't understand why we need a separate allocator. Why can't we just 
use normal Linux allocations and serialize their location for handover? 
We would obviously still need to find a large contiguous piece of memory 
for the target kernel to bootstrap itself into until it can read which 
pages it can and can not use, but we can do that allocation in the 
source environment using CMA, no?

What I'm trying to say is: I think we're better off separating the 
handover mechanism from the allocation mechanism. If we can implement 
handover without a new allocator, we can use it for simple things with a 
slight runtime penalty. To accelerate the handover then, we can later 
add a compacting allocator that can use the handover mechanism we 
already built to persist itself.



I have a WIP branch where I'm toying with such a handover mechanism that 
uses device tree to serialize/deserialize state. By standardizing the 
property naming, we can in the receiving kernel mark all persistent 
allocations as reserved and then slowly either free them again or mark 
them as in-use one by one:

https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42

I used ftrace as example payload to persist: With the handover mechanism 
in place, we serialize/deserialize ftrace ring buffer metadata and are 
thus able to read traces of the previous system after kexec. This way, 
you can for example profile the kexec exit path.

It's not even in RFC state yet, there are a few things where I would 
need a couple days to think hard about data structures, layouts and 
other problems :). But I believe from the patch you get the idea.

One such user of kho could be a new allocator like prmem and each 
subsystem's serialization code could choose to rely on the prmem 
subsystem to persist data instead of doing it themselves. That way you 
get a very non-intrusive enablement path for kexec handover, easily 
amendable data structures that can change compatibly over time as well 
as the ability to recreate ephemeral data structure based on persistent 
information - which will be necessary to persist VFIO containers.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


Re: [RFC PATCH v1 00/10] mm/prmem: Implement the Persistent-Across-Kexec memory feature (prmem)
Posted by Madhavan T. Venkataraman 2 years, 2 months ago
Hey Alex,

Thanks a lot for your comments!

On 10/17/23 03:31, Alexander Graf wrote:
> Hey Madhavan!
> 
> This patch set looks super exciting - thanks a lot for putting it together. We've been poking at a very similar direction for a while as well and will discuss the fundamental problem of how to persist kernel metadata across kexec at LPC:
> 
>   https://lpc.events/event/17/contributions/1485/
> 
> It would be great to have you in the room as well then.
> 

Yes. I am planning to attend. But I am attending virtually as I am not able to travel.

> Some more comments inline.
> 
> On 17.10.23 01:32, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> Introduction
>> ============
>>
>> This feature can be used to persist kernel and user data across kexec reboots
>> in RAM for various uses. E.g., persisting:
>>
>>          - cached data. E.g., database caches.
>>          - state. E.g., KVM guest states.
>>          - historical information since the last cold boot. E.g., events, logs
>>            and journals.
>>          - measurements for integrity checks on the next boot.
>>          - driver data.
>>          - IOMMU mappings.
>>          - MMIO config information.
>>
>> This is useful on systems where there is no non-volatile storage or
>> non-volatile storage is too small or too slow.
> 
> 
> This is useful in more situations. We for example need it to do a kexec while a virtual machine is in suspended state, but has IOMMU mappings intact (Live Update). For that, we need to ensure DMA can still reach the VM memory and that everything gets reassembled identically and without interruptions on the receiving end.
> 
> 

I see.

>> The following sections describe the implementation.
>>
>> I have enhanced the ram disk block device driver to provide persistent ram
>> disks on which any filesystem can be created. This is for persisting user data.
>> I have also implemented DAX support for the persistent ram disks.
> 
> 
> This is probably the least interesting of the enablements, right? You can already today reserve RAM on boot as DAX block device and use it for that purpose.
> 

Yes. pmem provides that functionality.

There are a few differences though. However, I don't have a good feel for how important these differences are to users. May be, they are not very significant. E.g,

	- pmem regions need some setup using the ndctl command.
	- IIUC, one needs to specify a starting address and a size for a pmem region. Having to specify a starting address may make it somewhat less flexible from a configuration point of view.
	- In the case of pmem, the entire range of memory is set aside. In the case of the prmem persistent ram disk, pages are allocated as needed. So, persistent memory is shared among multiple
	  consumers more flexibly.

Also Greg H. wanted to see a filesystem based use case to be presented for persistent memory so we can see how it all comes together. I am working on prmemfs (a special FS tailored for persistence). But that will take some time. So, I wanted to present this ram disk use case as a more flexible alternative to pmem.

But you are right. They are equivalent for all practical purposes.

> 
>> I am also working on making ZRAM persistent.
>>
>> I have also briefly discussed the following use cases:
>>
>>          - Persisting IOMMU mappings
>>          - Remembering DMA pages
>>          - Reserving pages that encounter memory errors
>>          - Remembering IMA measurements for integrity checks
>>          - Remembering MMIO config info
>>          - Implementing prmemfs (special filesystem tailored for persistence)
>>
>> Allocate metadata
>> =================
>>
>> Define a metadata structure to store all persistent memory related information.
>> The metadata fits into one page. On a cold boot, allocate and initialize the
>> metadata page.
>>
>> Allocate data
>> =============
>>
>> On a cold boot, allocate some memory for storing persistent data. Call it
>> persistent memory. Specify the size in a command line parameter:
>>
>>          prmem=size[KMG][,max_size[KMG]]
>>
>>          size            Initial amount of memory allocated to prmem during boot
>>          max_size        Maximum amount of memory that can be allocated to prmem
>>
>> When the initial memory is exhaused via allocations, expand prmem dynamically
>> up to max_size. Expansion is done by allocating from the buddy allocator.
>> Record all allocations in the metadata.
> 
> 
> I don't understand why we need a separate allocator. Why can't we just use normal Linux allocations and serialize their location for handover? We would obviously still need to find a large contiguous piece of memory for the target kernel to bootstrap itself into until it can read which pages it can and can not use, but we can do that allocation in the source environment using CMA, no?
> 
> What I'm trying to say is: I think we're better off separating the handover mechanism from the allocation mechanism. If we can implement handover without a new allocator, we can use it for simple things with a slight runtime penalty. To accelerate the handover then, we can later add a compacting allocator that can use the handover mechanism we already built to persist itself.
> 
> 
> 
> I have a WIP branch where I'm toying with such a handover mechanism that uses device tree to serialize/deserialize state. By standardizing the property naming, we can in the receiving kernel mark all persistent allocations as reserved and then slowly either free them again or mark them as in-use one by one:
> 
> https://github.com/agraf/linux/commit/fd5736a21d549a9a86c178c91acb29ed7f364f42
> 
> I used ftrace as example payload to persist: With the handover mechanism in place, we serialize/deserialize ftrace ring buffer metadata and are thus able to read traces of the previous system after kexec. This way, you can for example profile the kexec exit path.
> 
> It's not even in RFC state yet, there are a few things where I would need a couple days to think hard about data structures, layouts and other problems :). But I believe from the patch you get the idea.
> 
> One such user of kho could be a new allocator like prmem and each subsystem's serialization code could choose to rely on the prmem subsystem to persist data instead of doing it themselves. That way you get a very non-intrusive enablement path for kexec handover, easily amendable data structures that can change compatibly over time as well as the ability to recreate ephemeral data structure based on persistent information - which will be necessary to persist VFIO containers.
> 

OK. I will study your changes and your comments. I will send my feedback as well.

Thanks again!

Madhavan

> 
> Alex
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
>