Hi, this series implements a way to reserve additional crash kernel memory using CMA. Link to the v1 discussion: https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/ See below for the changes since v1 and how concerns from the discussion have been addressed. Currently, all the memory for the crash kernel is not usable by the 1st (production) kernel. It is also unmapped so that it can't be corrupted by the fault that will eventually trigger the crash. This makes sense for the memory actually used by the kexec-loaded crash kernel image and initrd and the data prepared during the load (vmcoreinfo, ...). However, the reserved space needs to be much larger than that to provide enough run-time memory for the crash kernel and the kdump userspace. Estimating the amount of memory to reserve is difficult. Being too careful makes kdump likely to end in OOM, being too generous takes even more memory from the production system. Also, the reservation only allows reserving a single contiguous block (or two with the "low" suffix). I've seen systems where this fails because the physical memory is fragmented. By reserving additional crashkernel memory from CMA, the main crashkernel reservation can be just large enough to fit the kernel and initrd image, minimizing the memory taken away from the production system. Most of the run-time memory for the crash kernel will be memory previously available to userspace in the production system. As this memory is no longer wasted, the reservation can be done with a generous margin, making kdump more reliable. Kernel memory that we need to preserve for dumping is normally not allocated from CMA, unless it is explicitly allocated as movable. Currently this is only the case for memory ballooning and zswap. Such movable memory will be missing from the vmcore. User data is typically not dumped by makedumpfile. When dumping of user data is intended this new CMA reservation cannot be used. There are five patches in this series: The first adds a new ",cma" suffix to the recenly introduced generic crashkernel parsing code. parse_crashkernel() takes one more argument to store the cma reservation size. The second patch implements reserve_crashkernel_cma() which performs the reservation. If the requested size is not available in a single range, multiple smaller ranges will be reserved. The third patch updates Documentation/, explicitly mentioning the potential DMA corruption of the CMA-reserved memory. The fourth patch adds a short delay before booting the kdump kernel, allowing pending DMA transfers to finish. The fifth patch enables the functionality for x86 as a proof of concept. There are just three things every arch needs to do: - call reserve_crashkernel_cma() - include the CMA-reserved ranges in the physical memory map - exclude the CMA-reserved ranges from the memory available through /proc/vmcore by excluding them from the vmcoreinfo PT_LOAD ranges. Adding other architectures is easy and I can do that as soon as this series is merged. With this series applied, specifying crashkernel=100M craskhernel=1G,cma on the command line will make a standard crashkernel reservation of 100M, where kexec will load the kernel and initrd. An additional 1G will be reserved from CMA, still usable by the production system. The crash kernel will have 1.1G memory available. The 100M can be reliably predicted based on the size of the kernel and initrd. The new cma suffix is completely optional. When no crashkernel=size,cma is specified, everything works as before. --- Changes since v4: - v5 is identical to v4 for all patches except patch 4/5, where v5 incorporates feedback from David Hildenbrand --- Changes since v3: - updated for 6.15 - reworked the delay patch: - delay changed to 10 s based on David Hildenbrand's comments - constant changed to variable so that the delay can be easily made configurable in the future - made reserve_crashkernel_cma() return early when cma_size == 0 to avoid printing out the 0-sized cma allocation --- Changes since v2: based on feedback from Baoquan He and David Hildenbrand: - kept original formatting of suffix_tbl[] - updated documentation to mention movable pages missing from vmcore - fixed whitespace in documentation - moved the call crash_cma_clear_pending_dma() after machine_crash_shutdown() so that non-crash CPUs and timers are shut down before the delay --- Changes since v1: The key concern raised in the v1 discussion was that pages in the CMA region may be pinned and used for a DMA transfer, potentially corrupting the new kernel's memory. When the cma suffix is used, kdump may be less reliable and the corruption hard to debug This v2 series addresses this concern in two ways: 1) Clearly stating the potential problem in the updated Documentation and setting the expectation (patch 3/5) Documentation now explicitly states that: - the risk of kdump failure is increased - the CMA reservation is intended for users who can not or don't want to sacrifice enough memory for a standard crashkernel reservation and who prefer less reliable kdump to no kdump at all This is consistent with the documentation of the crash_kexec_post_notifiers option, which can also increase the risk of kdump failure, yet may be the only way to use kdump on some systems. And just like the crash_kexec_post_notifiers option, the cma crashkernel suffix is completely optional: the series has zero effect when the suffix is not used. 2) Giving DMA time to finish before booting the kdump kernel (patch 4/5) Pages can be pinned for long term use using the FOLL_LONGTERM flag. Then they are migrated outside the CMA region. Pinning without this flag shows that the intent of their user is to only use them for short-lived DMA transfers. Delay the boot of the kdump kernel when the CMA reservation is used, giving potential pending DMA transfers time to finish. Other minor changes since v1: - updated for 6.14-rc2 - moved #ifdefs and #defines to header files - added __always_unused in parse_crashkernel() to silence a false unused variable warning -- Jiri Bohac <jbohac@suse.cz> SUSE Labs, Prague, Czechia -- Jiri Bohac <jbohac@suse.cz> SUSE Labs, Prague, Czechia
Hello Jiri, On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > The fifth patch enables the functionality for x86 as a proof of > concept. There are just three things every arch needs to do: > - call reserve_crashkernel_cma() > - include the CMA-reserved ranges in the physical memory map > - exclude the CMA-reserved ranges from the memory available > through /proc/vmcore by excluding them from the vmcoreinfo > PT_LOAD ranges. First, thank you for making this change; it’s very helpful. I haven’t come across anything regarding arm64 support. Is this on anyone’s to-do list? --breno
Hi Breno, On Wed, Aug 20, 2025 at 08:46:54AM -0700, Breno Leitao wrote: > Hello Jiri, > > First, thank you for making this change; it’s very helpful. > I haven’t come across anything regarding arm64 support. Is this on > anyone’s to-do list? Yes, I plan to implement this at least for ppc64, arm64 and s390x, hopefully in time for 6.18. Regards, -- Jiri Bohac <jbohac@suse.cz> SUSE Labs, Prague, Czechia
Hello Jiri, On Wed, Aug 20, 2025 at 06:20:13PM +0200, Jiri Bohac wrote: > On Wed, Aug 20, 2025 at 08:46:54AM -0700, Breno Leitao wrote: > > First, thank you for making this change; it’s very helpful. > > I haven’t come across anything regarding arm64 support. Is this on > > anyone’s to-do list? > > Yes, I plan to implement this at least for ppc64, arm64 and s390x, > hopefully in time for 6.18. Thanks! I have another question. I assume it’s not possible to allocate only the CMA crashkernel area for the kdump kernel, since we need to keep the loaded kernel in the crashkernel area while the system is running. Therefore, specifying crashkernel=X (without ',cma') is necessary. At the same time, since the crashdump environment will use CMA, the crashkernel area itself doesn’t need to be very large, as the CMA space will be allocated later. With that in mind, how do I find what is the recommended size for the crashkernel area, assuming the CMA area will be more than sufficient at runtime? Does it need ot be much higher than the size of kdump kernel and initrd? Thanks --breno
On Thu, Aug 21, 2025 at 01:35:35AM -0700, Breno Leitao wrote: > I have another question. I assume it’s not possible to allocate only the > CMA crashkernel area for the kdump kernel, since we need to keep the > loaded kernel in the crashkernel area while the system is running. > Therefore, specifying crashkernel=X (without ',cma') is necessary. exactly > At the same time, since the crashdump environment will use CMA, the > crashkernel area itself doesn’t need to be very large, as the CMA space > will be allocated later. > > With that in mind, how do I find what is the recommended size for the > crashkernel area, assuming the CMA area will be more than sufficient at > runtime? > > Does it need ot be much higher than the size of kdump kernel and initrd? I don't have a good answer now - I have this on my to-do list and I need to investigate this more to come up with a formula to calculate the required size to include this in the kdump tool I maintain for SUSE. When testing the kernel part I used crashkernel=100M crashkernel=XXX,cma on a machine that would normally require something like crashkernel=430M. -- Jiri Bohac <jbohac@suse.cz> SUSE Labs, Prague, Czechia
Hello Jiri, On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > Currently this is only the case for memory ballooning and zswap. Such movable > memory will be missing from the vmcore. User data is typically not dumped by > makedumpfile. For zswap and zsmalloc pages, I'm wondering whether these pages will be missing from the vmcore, or if there's a possibility they might be present but corrupted—especially since they could reside in the CMA region, which may be overwritten by the kdump environment. My main question is: Do we need to explicitly teach makedumpfile to ignore the CMA area in the vmcore, since it is already being overwritten and thus unreliable? Or does makedumpfile already have mechanisms in place to automatically ignore these special zswap/zsmalloc pages that may have been overwritten if they were located in the CMA region?
On 03.10.25 17:51, Breno Leitao wrote: > Hello Jiri, > > On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > >> Currently this is only the case for memory ballooning and zswap. Such movable >> memory will be missing from the vmcore. User data is typically not dumped by >> makedumpfile. > > For zswap and zsmalloc pages, I'm wondering whether these pages will be missing > from the vmcore, or if there's a possibility they might be present but > corrupted—especially since they could reside in the CMA region, which may be > overwritten by the kdump environment. That's not different to ordinary user pages residing on these areas, right? -- Cheers David / dhildenb
On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote: > On 03.10.25 17:51, Breno Leitao wrote: > > Hello Jiri, > > > > On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > > > > > Currently this is only the case for memory ballooning and zswap. Such movable > > > memory will be missing from the vmcore. User data is typically not dumped by > > > makedumpfile. > > > > For zswap and zsmalloc pages, I'm wondering whether these pages will be missing > > from the vmcore, or if there's a possibility they might be present but > > corrupted—especially since they could reside in the CMA region, which may be > > overwritten by the kdump environment. > > That's not different to ordinary user pages residing on these areas, right? Will zsmalloc on CMA pages be marked as "userpages"? makedump file iterates over the pfns and check for a few flags before "copying" them to disk. In makedumpfile, userpages are basically discarded if they are anonymous pages: #define isAnon(mapping, flags, _mapcount) \ (((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags, _mapcount)) https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164 called from: https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671 For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old page struct (from the first kernel), but, the content has changed (replaced by kdump environment - 2nd kernel). So, whatever decision makedumpfile does based on the PFN, it will dump incorrect data, given that the page content does not match the data anymore. If my understanding is valid, we don't want to dump any page that points to the PFN, because they will probably have garbage. That said, I see two options: 1) Ignore the CMA area completely in makedump. - I don't think there is any way to find that area today. The kernel might need to print the CMA region somewhere (/proc/iomem?) 2) Given that most of the memory in CMA will be anonymous memory, and already discard by other rules, just add an additional entry for zsmalloc pages. Talking to Kirill offline, it seems we can piggy back on MovableOps page flag.
On 06.10.25 18:25, Breno Leitao wrote: > On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote: >> On 03.10.25 17:51, Breno Leitao wrote: >>> Hello Jiri, >>> >>> On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: >>> >>>> Currently this is only the case for memory ballooning and zswap. Such movable >>>> memory will be missing from the vmcore. User data is typically not dumped by >>>> makedumpfile. >>> >>> For zswap and zsmalloc pages, I'm wondering whether these pages will be missing >>> from the vmcore, or if there's a possibility they might be present but >>> corrupted—especially since they could reside in the CMA region, which may be >>> overwritten by the kdump environment. >> >> That's not different to ordinary user pages residing on these areas, right? > > Will zsmalloc on CMA pages be marked as "userpages"? No, but they should have the zsmalloc page type set. > > makedump file iterates over the pfns and check for a few flags before > "copying" them to disk. > > In makedumpfile, userpages are basically discarded if they are anonymous > pages: > #define isAnon(mapping, flags, _mapcount) \ > (((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags, > _mapcount)) > > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164 > > called from: > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671 > > For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old > page struct (from the first kernel), but, the content has changed > (replaced by kdump environment - 2nd kernel). > > So, whatever decision makedumpfile does based on the PFN, it will dump > incorrect data, given that the page content does not match the data > anymore. Right. > > If my understanding is valid, we don't want to dump any page that points > to the PFN, because they will probably have garbage. My theory is that barely anybody will go ahead and check compressed page content, but I agree. We should filter them out. > > That said, I see two options: > > 1) Ignore the CMA area completely in makedump. > - I don't think there is any way to find that area today. The kernel > might need to print the CMA region somewhere (/proc/iomem?) /proc/iomem in the newkernel should indicate the memory region as System RAM (for the new kernel). That can just be filtered out in any case: dumping memory of the new kernel does not make sense in any case. > > 2) Given that most of the memory in CMA will be anonymous memory, and > already discard by other rules, just add an additional entry for > zsmalloc pages. > > Talking to Kirill offline, it seems we can piggy back on MovableOps > page flag. We should likely check the page type instead if we go down that path. -- Cheers David / dhildenb
On 10/06/25 at 06:45pm, David Hildenbrand wrote: > On 06.10.25 18:25, Breno Leitao wrote: > > On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote: > > > On 03.10.25 17:51, Breno Leitao wrote: > > > > Hello Jiri, > > > > > > > > On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > > > > > > > > > Currently this is only the case for memory ballooning and zswap. Such movable > > > > > memory will be missing from the vmcore. User data is typically not dumped by > > > > > makedumpfile. > > > > > > > > For zswap and zsmalloc pages, I'm wondering whether these pages will be missing > > > > from the vmcore, or if there's a possibility they might be present but > > > > corrupted—especially since they could reside in the CMA region, which may be > > > > overwritten by the kdump environment. > > > > > > That's not different to ordinary user pages residing on these areas, right? > > > > Will zsmalloc on CMA pages be marked as "userpages"? > > No, but they should have the zsmalloc page type set. > > > > > makedump file iterates over the pfns and check for a few flags before > > "copying" them to disk. > > > > In makedumpfile, userpages are basically discarded if they are anonymous > > pages: > > #define isAnon(mapping, flags, _mapcount) \ > > (((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags, > > _mapcount)) > > > > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164 > > > > called from: > > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671 > > > > For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old > > page struct (from the first kernel), but, the content has changed > > (replaced by kdump environment - 2nd kernel). > > > > So, whatever decision makedumpfile does based on the PFN, it will dump > > incorrect data, given that the page content does not match the data > > anymore. > > Right. > > > > > If my understanding is valid, we don't want to dump any page that points > > to the PFN, because they will probably have garbage. > > My theory is that barely anybody will go ahead and check compressed page > content, but I agree. We should filter them out. > > > > > That said, I see two options: > > > > 1) Ignore the CMA area completely in makedump. > > - I don't think there is any way to find that area today. The kernel > > might need to print the CMA region somewhere (/proc/iomem?) > > /proc/iomem in the newkernel should indicate the memory region as System RAM > (for the new kernel). That can just be filtered out in any case: dumping > memory of the new kernel does not make sense in any case. Agree. And I saw Jiri has excluded the crashk_cma_ranges[] from the dumped content via elf_header_exclude_ranges(). Have you encountered a real problem about the dumping, or you are just worried about it? > > > > > 2) Given that most of the memory in CMA will be anonymous memory, and > > already discard by other rules, just add an additional entry for > > zsmalloc pages. > > > > Talking to Kirill offline, it seems we can piggy back on MovableOps > > page flag. > > We should likely check the page type instead if we go down that path. Talking about the pages in CMA except of crashk_cma_ranges[], zsmalloc/zswap is true as anon mem and can be discarded. I am wondering if there's any driver or kernel pages residing in CMA and being worth to dump out.
On Tue, Oct 07, 2025 at 11:55:36AM +0800, Baoquan He wrote: > On 10/06/25 at 06:45pm, David Hildenbrand wrote: > > Have you encountered a real problem about the dumping, or you are just > worried about it? I haven't encountered any issues so far, and I already have a set of machines running with this configuration. I'm planning to roll out this feature to a larger group of servers, so I'm currently performing due diligence. Thanks! --breno
On Tue, Oct 07, 2025 at 11:55:36AM +0800, Baoquan He wrote: > And I saw Jiri has excluded the crashk_cma_ranges[] from the dumped > content via elf_header_exclude_ranges(). Exactly, thanks for pointing this out, while I was away from my e-mail. The crashkernel CMA reservation ranges will not be seen at all by makedumpfile. -- Jiri Bohac <jbohac@suse.cz> SUSE Labs, Prague, Czechia
Hi David, On Tue, Oct 7, 2025 at 5:45 AM David Hildenbrand <dhildenb@redhat.com> wrote: > > On 06.10.25 18:25, Breno Leitao wrote: > > On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote: > >> On 03.10.25 17:51, Breno Leitao wrote: > >>> Hello Jiri, > >>> > >>> On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote: > >>> > >>>> Currently this is only the case for memory ballooning and zswap. Such movable > >>>> memory will be missing from the vmcore. User data is typically not dumped by > >>>> makedumpfile. > >>> > >>> For zswap and zsmalloc pages, I'm wondering whether these pages will be missing > >>> from the vmcore, or if there's a possibility they might be present but > >>> corrupted—especially since they could reside in the CMA region, which may be > >>> overwritten by the kdump environment. > >> > >> That's not different to ordinary user pages residing on these areas, right? > > > > Will zsmalloc on CMA pages be marked as "userpages"? > > No, but they should have the zsmalloc page type set. > > > > > makedump file iterates over the pfns and check for a few flags before > > "copying" them to disk. > > > > In makedumpfile, userpages are basically discarded if they are anonymous > > pages: > > #define isAnon(mapping, flags, _mapcount) \ > > (((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags, > > _mapcount)) > > > > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164 > > > > called from: > > https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671 > > > > For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old > > page struct (from the first kernel), but, the content has changed > > (replaced by kdump environment - 2nd kernel). > > > > So, whatever decision makedumpfile does based on the PFN, it will dump > > incorrect data, given that the page content does not match the data > > anymore. > > Right. > > > > > If my understanding is valid, we don't want to dump any page that points > > to the PFN, because they will probably have garbage. > > My theory is that barely anybody will go ahead and check compressed page > content, but I agree. We should filter them out. > > > > > That said, I see two options: > > > > 1) Ignore the CMA area completely in makedump. > > - I don't think there is any way to find that area today. The kernel > > might need to print the CMA region somewhere (/proc/iomem?) > > /proc/iomem in the newkernel should indicate the memory region as System > RAM (for the new kernel). That can just be filtered out in any case: > dumping memory of the new kernel does not make sense in any case. > > > > > 2) Given that most of the memory in CMA will be anonymous memory, and > > already discard by other rules, just add an additional entry for > > zsmalloc pages. > > > > Talking to Kirill offline, it seems we can piggy back on MovableOps > > page flag. > > We should likely check the page type instead if we go down that path. If choosing a proper page type/flag is hard, maybe an ongoing new feature for makedumpfile can help with that. In short, if we can get a workable page flag for CMA to get filtered, then proceed as usual, If cannot, then we can use eppic/btf/kallsyms[1] in makedumpfile to programmably determine page type and filter it out. See the program for determining amdgpu's mm pages[2]. [1]: https://lore.kernel.org/kexec/20250610095743.18073-1-ltao@redhat.com/T/#m901bf1413b844648c86e8a84d75b66d0531b9f92 [2]: https://lore.kernel.org/kexec/20250610095743.18073-1-ltao@redhat.com/T/#m38362d258e3b0bdc14a64e54a6acd5b85810ca26 Cheers, Tao Liu > > -- > Cheers > > David / dhildenb >
© 2016 - 2025 Red Hat, Inc.