arch/x86/boot/compressed/Makefile | 3 + arch/x86/boot/compressed/ident_map_64.c | 9 +- arch/x86/boot/compressed/kaslr.c | 10 +- arch/x86/boot/compressed/misc.h | 10 + arch/x86/boot/compressed/pkram.c | 110 ++ arch/x86/kernel/setup.c | 3 + arch/x86/mm/init_64.c | 3 + include/linux/pkram.h | 116 ++ kernel/kexec.c | 9 + kernel/kexec_core.c | 3 + kernel/kexec_file.c | 15 + mm/Kconfig | 9 + mm/Makefile | 2 + mm/pkram.c | 1753 +++++++++++++++++++++++++++++++ mm/pkram_pagetable.c | 375 +++++++ 15 files changed, 2424 insertions(+), 6 deletions(-) create mode 100644 arch/x86/boot/compressed/pkram.c create mode 100644 include/linux/pkram.h create mode 100644 mm/pkram.c create mode 100644 mm/pkram_pagetable.c
Sending out this RFC in part to guage community interest.
This patchset implements preserved-over-kexec memory storage or PKRAM as a
method for saving memory pages of the currently executing kernel so that
they may be restored after kexec into a new kernel. The patches are adapted
from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They
introduce the PKRAM kernel API.
One use case for PKRAM is preserving guest memory and/or auxillary
supporting data (e.g. iommu data) across kexec to support reboot of the
host with minimal disruption to the guest. PKRAM provides a flexible way
for doing this without requiring that the amount of memory used by a fixed
size created a priori. Another use case is for databases to preserve their
block caches in shared memory across reboot.
Changes since RFC v2
- Rebased onto 6.3
- Updated API to save/load folios rather than file pages
- Omitted previous patches for implementing and optimizing preservation
and restoration of shmem files to reduce the number of patches and
focus on core functionality.
Changes since RFC v1
- Rebased onto 5.12-rc4
- Refined the API to reduce the number of calls
and better support multithreading.
- Allow preserving byte data of arbitrary length
(was previously limited to one page).
- Build a new memblock reserved list with the
preserved ranges and then substitute it for
the existing one. (Mike Rapoport)
- Use mem_avoid_overlap() to avoid kaslr stepping
on preserved ranges. (Kees Cook)
-- Implementation details --
* To aid in quickly finding contiguous ranges of memory containing
preserved pages a pseudo physical mapping pagetable is populated
with pages as they are preserved.
* If a page to be preserved is found to be in range of memory that was
previously reserved during early boot or in range of memory where the
kernel will be loaded to on kexec, the page will be copied to a page
outside of those ranges and the new page will be preserved. A compound
page will be copied to and preserved as individual base pages.
Note that this means that a page that cannot be moved (e.g. pinned for
DMA) currently cannot safely be preserved. This could be addressed by
adding functionality to kexec to reconfigure the destination addreses
for the sections of an already-loaded kexec kernel.
* A single page is allocated for the PKRAM super block. For the next kernel
kexec boot to find preserved memory metadata, the pfn of the PKRAM super
block, which is exported via /sys/kernel/pkram, is passed in the 'pkram'
boot option.
* In the newly booted kernel, PKRAM adds all preserved pages to the memblock
reserve list during early boot so that they will not be recycled.
* Since kexec may load the new kernel code to any memory region, it could
destroy preserved memory. When the kernel selects the memory region
(kexec_file_load syscall), kexec will avoid preserved pages. When the
user selects the kexec memory region to use (kexec_load syscall) , kexec
load will fail if there is conflict with preserved pages. Pages preserved
after a kexec kernel is loaded will be relocated if they conflict with
the selected memory region.
[1] https://lkml.org/lkml/2013/7/1/211
Anthony Yznaga (21):
mm: add PKRAM API stubs and Kconfig
mm: PKRAM: implement node load and save functions
mm: PKRAM: implement object load and save functions
mm: PKRAM: implement folio stream operations
mm: PKRAM: implement byte stream operations
mm: PKRAM: link nodes by pfn before reboot
mm: PKRAM: introduce super block
PKRAM: track preserved pages in a physical mapping pagetable
PKRAM: pass a list of preserved ranges to the next kernel
PKRAM: prepare for adding preserved ranges to memblock reserved
mm: PKRAM: reserve preserved memory at boot
PKRAM: free the preserved ranges list
PKRAM: prevent inadvertent use of a stale superblock
PKRAM: provide a way to ban pages from use by PKRAM
kexec: PKRAM: prevent kexec clobbering preserved pages in some cases
PKRAM: provide a way to check if a memory range has preserved pages
kexec: PKRAM: avoid clobbering already preserved pages
mm: PKRAM: allow preserved memory to be freed from userspace
PKRAM: disable feature when running the kdump kernel
x86/KASLR: PKRAM: support physical kaslr
x86/boot/compressed/64: use 1GB pages for mappings
arch/x86/boot/compressed/Makefile | 3 +
arch/x86/boot/compressed/ident_map_64.c | 9 +-
arch/x86/boot/compressed/kaslr.c | 10 +-
arch/x86/boot/compressed/misc.h | 10 +
arch/x86/boot/compressed/pkram.c | 110 ++
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init_64.c | 3 +
include/linux/pkram.h | 116 ++
kernel/kexec.c | 9 +
kernel/kexec_core.c | 3 +
kernel/kexec_file.c | 15 +
mm/Kconfig | 9 +
mm/Makefile | 2 +
mm/pkram.c | 1753 +++++++++++++++++++++++++++++++
mm/pkram_pagetable.c | 375 +++++++
15 files changed, 2424 insertions(+), 6 deletions(-)
create mode 100644 arch/x86/boot/compressed/pkram.c
create mode 100644 include/linux/pkram.h
create mode 100644 mm/pkram.c
create mode 100644 mm/pkram_pagetable.c
--
1.9.4
On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote: > Sending out this RFC in part to guage community interest. > This patchset implements preserved-over-kexec memory storage or PKRAM as a > method for saving memory pages of the currently executing kernel so that > they may be restored after kexec into a new kernel. The patches are adapted > from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They > introduce the PKRAM kernel API. > > One use case for PKRAM is preserving guest memory and/or auxillary > supporting data (e.g. iommu data) across kexec to support reboot of the > host with minimal disruption to the guest. Hi Anthony, Thanks for re-posting this - I'm been wanting to re-kindle the discussion on preserving memory across kexec for a while now. There are a few aspects at play in this space of memory management designed specifically for the virtualisation and live update (kexec) use- case which I think we should consider: 1. Preserving userspace-accessible memory across kexec: this is what pkram addresses. 2. Preserving kernel state: This would include memory required for kexec with DMA passthrough devices, like IOMMU root page and page tables, DMA- able buffers for drivers, etc. Also certain structures for improved kernel boot performance after kexec, like a PCI device cache, clock LPJ and possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC indicates that this should be possible, though IMO this could be more straight forward to do with a new filesystem with first-class support for kernel persistence via something like inode types for kernel data. 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of 2-stage translations it's beneficial to allocate guest memory in large contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If the buddy allocator is used this may be a challenge both from an implementation and a fragmentation perspective, and it may be desirable to have stronger guarantees about allocation sizes. 4. Removing struct page overhead: When doing the huge/gigantic allocations, in generally it won't be necessary to have 4 KiB struct pages. This is something with dmemfs [1, 2] tries to achieve by using a large chunk of reserved memory and managing that by a new filesystem. 5. More "advanced" memory management APIs/ioctls for virtualisation: Being able to support things like DMA-driven post-copy live migration, memory oversubscription, carving out chunks of memory from a VM to launch side- car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This may be easier to achieve with a new filesystem, rather than coupling to tempfs semantics and ioctls. Overall, with the above in mind, my take is that we may have a smoother path to implement a more comprehensive solution by going the route of a new purpose-built file system on top of reserved memory. Sort of like dmemfs with persistence and specifically support for kernel persistence. Does my take here make sense? I'm hoping to put together an RFC for something like the above (dmemfs with persistence) soon, focusing on how the IOMMU persistence will work. This is an important differentiating factor to cover in the RFC, IMO. > PKRAM provides a flexible way > for doing this without requiring that the amount of memory used by a fixed > size created a prior. AFAICT the main down-side of what I'm suggesting here compared to pkram, is that as you say here: pkram doesn't require the up-front reserving of memory - allocations from the global shared pool are dynamic. I'm on the fence as to whether this is actually a desirable property though. Carving out a large chunk of system memory as reserved memory for a persisted filesystem (as I'm suggesting) has the advantages of removing struct page overhead, providing better guarantees about huge/gigantic page allocations, and probably makes the kexec restore path simpler and more self contained. I think there's an argument to be made that having a clearly-defined large range of memory which is persisted, and the rest is normal "ephemeral" kernel memory may be preferable. Keen to hear your (and others) thoughts! JG [0] http://david.woodhou.se/live-update-handover.pdf [1] https://lwn.net/Articles/839216/ [2] https://lkml.org/lkml/2020/12/7/342
On 5/26/23 6:57 AM, Gowans, James wrote: > On Wed, 2023-04-26 at 17:08 -0700, Anthony Yznaga wrote: >> Sending out this RFC in part to guage community interest. >> This patchset implements preserved-over-kexec memory storage or PKRAM as a >> method for saving memory pages of the currently executing kernel so that >> they may be restored after kexec into a new kernel. The patches are adapted >> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They >> introduce the PKRAM kernel API. >> >> One use case for PKRAM is preserving guest memory and/or auxillary >> supporting data (e.g. iommu data) across kexec to support reboot of the >> host with minimal disruption to the guest. > Hi Anthony, Hi James, Thank you for looking at this. > > Thanks for re-posting this - I'm been wanting to re-kindle the discussion > on preserving memory across kexec for a while now. > > There are a few aspects at play in this space of memory management > designed specifically for the virtualisation and live update (kexec) use- > case which I think we should consider: > > 1. Preserving userspace-accessible memory across kexec: this is what pkram > addresses. > > 2. Preserving kernel state: This would include memory required for kexec > with DMA passthrough devices, like IOMMU root page and page tables, DMA- > able buffers for drivers, etc. Also certain structures for improved kernel > boot performance after kexec, like a PCI device cache, clock LPJ and > possible others, sort of what Xen breadcrumbs [0] achieves. The pkram RFC > indicates that this should be possible, though IMO this could be more > straight forward to do with a new filesystem with first-class support for > kernel persistence via something like inode types for kernel data. PKRAM as it is now can preserve kernel data by streaming bytes to a PKRAM object, but the data must be location independent since the data is stored in allocated 4k pages rather than being preserved in place This really isn't usable for things like page tables or memory expected not to move because of DMA, etc. One issue with preserving non-relocatable, regular memory that is not partitioned from the kernel is the risk that a kexec kernel has already been loaded and that its pre-computed destination where it will be copied to on reboot will overwrite the preserved memory. Either some way of re-processing the kexec kernel to load somewhere else would be needed, or kexec load would need to be restricted from loading where memory might be preserved. Plusses for a partitioning approach. > > 3. Ensuring huge/gigantic memory allocations: to improve the TLB perf of > 2-stage translations it's beneficial to allocate guest memory in large > contiguous blocks, preferably PUD-level blocks for multi-GiB guests. If > the buddy allocator is used this may be a challenge both from an > implementation and a fragmentation perspective, and it may be desirable to > have stronger guarantees about allocation sizes. Agreed that guaranteeing large blocks and fragmentation are issues for PKRAM. One possible avenue to address this could be to support preserving hugetlb pages. > > 4. Removing struct page overhead: When doing the huge/gigantic > allocations, in generally it won't be necessary to have 4 KiB struct > pages. This is something with dmemfs [1, 2] tries to achieve by using a > large chunk of reserved memory and managing that by a new filesystem. Has using DAX been considered? Not familiar with dmemfs but it sounds functionally similar. > > 5. More "advanced" memory management APIs/ioctls for virtualisation: Being > able to support things like DMA-driven post-copy live migration, memory > oversubscription, carving out chunks of memory from a VM to launch side- > car VMs, more fine-grain control of IOMMU or MMU permissions, etc. This > may be easier to achieve with a new filesystem, rather than coupling to > tempfs semantics and ioctls. > > Overall, with the above in mind, my take is that we may have a smoother > path to implement a more comprehensive solution by going the route of a > new purpose-built file system on top of reserved memory. Sort of like > dmemfs with persistence and specifically support for kernel persistence. > > Does my take here make sense? Yes, I believe so. There are some serious issues with PKRAM to address before it could be truly viable (fragmentation, relocation, etc), so a memory partitioning approach might be the way to go. > > I'm hoping to put together an RFC for something like the above (dmemfs > with persistence) soon, focusing on how the IOMMU persistence will work. > This is an important differentiating factor to cover in the RFC, IMO. Great! I'll keep an eye out for it. Anthony > >> PKRAM provides a flexible way >> for doing this without requiring that the amount of memory used by a fixed >> size created a prior. > AFAICT the main down-side of what I'm suggesting here compared to pkram, > is that as you say here: pkram doesn't require the up-front reserving of > memory - allocations from the global shared pool are dynamic. I'm on the > fence as to whether this is actually a desirable property though. Carving > out a large chunk of system memory as reserved memory for a persisted > filesystem (as I'm suggesting) has the advantages of removing struct page > overhead, providing better guarantees about huge/gigantic page > allocations, and probably makes the kexec restore path simpler and more > self contained. > > I think there's an argument to be made that having a clearly-defined large > range of memory which is persisted, and the rest is normal "ephemeral" > kernel memory may be preferable. > > Keen to hear your (and others) thoughts! > > JG > > [0] http://david.woodhou.se/live-update-handover.pdf > [1] https://lwn.net/Articles/839216/ > [2] https://lkml.org/lkml/2020/12/7/342
On 04/26/23 at 05:08pm, Anthony Yznaga wrote: > Sending out this RFC in part to guage community interest. > This patchset implements preserved-over-kexec memory storage or PKRAM as a > method for saving memory pages of the currently executing kernel so that > they may be restored after kexec into a new kernel. The patches are adapted > from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They > introduce the PKRAM kernel API. > > One use case for PKRAM is preserving guest memory and/or auxillary > supporting data (e.g. iommu data) across kexec to support reboot of the > host with minimal disruption to the guest. PKRAM provides a flexible way > for doing this without requiring that the amount of memory used by a fixed > size created a priori. Another use case is for databases to preserve their > block caches in shared memory across reboot. If so, I have confusions, who can help clarify: 1) Why kexec reboot was introduced, what do we expect kexec reboot to do? 2) If we need keep these data and those data, can we not reboot? They are definitely located there w/o any concern. 3) What if systems of AI, edge computing, HPC etc also want to carry kinds of data from userspace or kernel, system status, registers etc when kexec reboot is needed, while enligntened by this patch? Thanks Baoquan
On 5/31/23 7:15 PM, Baoquan He wrote: > On 04/26/23 at 05:08pm, Anthony Yznaga wrote: >> Sending out this RFC in part to guage community interest. >> This patchset implements preserved-over-kexec memory storage or PKRAM as a >> method for saving memory pages of the currently executing kernel so that >> they may be restored after kexec into a new kernel. The patches are adapted >> from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They >> introduce the PKRAM kernel API. >> >> One use case for PKRAM is preserving guest memory and/or auxillary >> supporting data (e.g. iommu data) across kexec to support reboot of the >> host with minimal disruption to the guest. PKRAM provides a flexible way >> for doing this without requiring that the amount of memory used by a fixed >> size created a priori. Another use case is for databases to preserve their >> block caches in shared memory across reboot. > If so, I have confusions, who can help clarify: > 1) Why kexec reboot was introduced, what do we expect kexec reboot to > do? > > 2) If we need keep these data and those data, can we not reboot? They > are definitely located there w/o any concern. > > 3) What if systems of AI, edge computing, HPC etc also want to carry > kinds of data from userspace or kernel, system status, registers > etc when kexec reboot is needed, while enligntened by this patch? Hi Baoquan, Avoiding a more significant disruption from having to halt or migrate VMs, failover services, etc. when a reboot is necessary to pick up security fixes is one motivation for exploring preserving memory across the reboot. Anthony > > Thanks > Baoquan >
© 2016 - 2025 Red Hat, Inc.