mm/memblock.c | 2 +- tools/testing/memblock/internal.h | 4 ++++ 2 files changed, 5 insertions(+), 1 deletion(-)
Hi all,
(I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change
I'm proposing is in memblock.)
I've run into a case where pages are not released from memblock to the buddy
allocator. If deferred struct page init is enabled, and memblock_free_late() is
called before page_alloc_init_late() has run, and the pages being freed are in
the deferred init range, then the pages are never released. memblock_free_late()
calls memblock_free_pages() which only releases the pages if they are not in the
deferred range. That is correct for free pages because they will be initialized
and released by page_alloc_init_late(), but memblock_free_late() is dealing with
reserved pages. If memblock_free_late() doesn't release those pages, they will
forever be reserved. All reserved pages were initialized by memblock_free_all(),
so I believe the fix is to simply have memblock_free_late() call
__free_pages_core() directly instead of memblock_free_pages().
In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN
initialization routines") that added a call to kmsan_memblock_free_pages() in
memblock_free_pages(). It looks to me like it would also be incorrect to make
that call in the memblock_free_late() case, because the KMSAN metadata was
already initialized for all reserved pages by kmsan_init_shadow(), which runs
before memblock_free_all(). Having memblock_free_late() call __free_pages_core()
directly also fixes this issue.
I encountered this issue when I tried to switch some x86_64 VMs I was running
from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services
ranges via memblock_reserve() (part of setup_arch()), and it frees them later
via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI
implementation of the VM I was attempting this on, an Amazon EC2 t3.micro
instance, maps north of 170 MB in boot services ranges that happen to fall in
the deferred init range. I certainly noticed when that much memory went missing
on a 1 GB VM.
I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real
x86_64 EFI systems, and they all look good to me. However, the physical systems
that I have don't actually trigger this issue because they all have more than 4
GB of RAM, so their deferred init range starts above 4 GB (it's always in the
highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings
are below 4 GB.
Deferred struct page init can't be enabled on x86_32 so those systems are
unaffected. I haven't found any other code paths that would trigger this issue,
though I can't promise that there aren't any. I did run with this patch on an
arm64 VM as a sanity check, but memblock=debug didn't show any calls to
memblock_free_late() so that system was unaffected as well.
I am guessing that this change should also go the stable kernels but it may not
apply cleanly (__free_pages_core() was __free_pages_boot_core() and
memblock_free_pages() was __free_pages_bootmem() when this issue was first
introduced). I haven't gone through that process before so please let me know if
I can help with that.
This is the end result on an EC2 t3.micro instance booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816
Aaron Thompson (1):
mm: Always release pages to the buddy allocator in
memblock_free_late().
mm/memblock.c | 2 +-
tools/testing/memblock/internal.h | 4 ++++
2 files changed, 5 insertions(+), 1 deletion(-)
--
2.30.2
Changelog:
v2:
- Add comment in memblock_free_late() (suggested by Mike Rapoport)
- Improve commit message, including an explanation of the x86_64 EFI boot
issue (suggested by Mike Rapoport and David Rientjes)
Hi all,
(I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change
I'm proposing is in memblock.)
I've run into a case where pages are not released from memblock to the buddy
allocator. If deferred struct page init is enabled, and memblock_free_late() is
called before page_alloc_init_late() has run, and the pages being freed are in
the deferred init range, then the pages are never released. memblock_free_late()
calls memblock_free_pages() which only releases the pages if they are not in the
deferred range. That is correct for free pages because they will be initialized
and released by page_alloc_init_late(), but memblock_free_late() is dealing with
reserved pages. If memblock_free_late() doesn't release those pages, they will
forever be reserved. All reserved pages were initialized by memblock_free_all(),
so I believe the fix is to simply have memblock_free_late() call
__free_pages_core() directly instead of memblock_free_pages().
In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN
initialization routines") that added a call to kmsan_memblock_free_pages() in
memblock_free_pages(). It looks to me like it would also be incorrect to make
that call in the memblock_free_late() case, because the KMSAN metadata was
already initialized for all reserved pages by kmsan_init_shadow(), which runs
before memblock_free_all(). Having memblock_free_late() call __free_pages_core()
directly also fixes this issue.
I encountered this issue when I tried to switch some x86_64 VMs I was running
from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services
ranges via memblock_reserve() (part of setup_arch()), and it frees them later
via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI
implementation of the VM I was attempting this on, an Amazon EC2 t3.micro
instance, maps north of 170 MB in boot services ranges that happen to fall in
the deferred init range. I certainly noticed when that much memory went missing
on a 1 GB VM.
I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real
x86_64 EFI systems, and they all look good to me. However, the physical systems
that I have don't actually trigger this issue because they all have more than 4
GB of RAM, so their deferred init range starts above 4 GB (it's always in the
highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings
are below 4 GB.
Deferred struct page init can't be enabled on x86_32 so those systems are
unaffected. I haven't found any other code paths that would trigger this issue,
though I can't promise that there aren't any. I did run with this patch on an
arm64 VM as a sanity check, but memblock=debug didn't show any calls to
memblock_free_late() so that system was unaffected as well.
I am guessing that this change should also go the stable kernels but it may not
apply cleanly (__free_pages_core() was __free_pages_boot_core() and
memblock_free_pages() was __free_pages_bootmem() when this issue was first
introduced). I haven't gone through that process before so please let me know if
I can help with that.
This is the end result on an EC2 t3.micro instance booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816
Aaron Thompson (1):
mm: Always release pages to the buddy allocator in
memblock_free_late().
mm/memblock.c | 8 +++++++-
tools/testing/memblock/internal.h | 4 ++++
2 files changed, 11 insertions(+), 1 deletion(-)
--
2.30.2
On Wed, 4 Jan 2023, Aaron Thompson wrote: > Hi all, > > (I've CC'ed the KMSAN and x86 EFI maintainers as an FYI; the only code change > I'm proposing is in memblock.) > > I've run into a case where pages are not released from memblock to the buddy > allocator. If deferred struct page init is enabled, and memblock_free_late() is > called before page_alloc_init_late() has run, and the pages being freed are in > the deferred init range, then the pages are never released. memblock_free_late() > calls memblock_free_pages() which only releases the pages if they are not in the > deferred range. That is correct for free pages because they will be initialized > and released by page_alloc_init_late(), but memblock_free_late() is dealing with > reserved pages. If memblock_free_late() doesn't release those pages, they will > forever be reserved. All reserved pages were initialized by memblock_free_all(), > so I believe the fix is to simply have memblock_free_late() call > __free_pages_core() directly instead of memblock_free_pages(). > > In addition, there was a recent change (3c20650982609 "init: kmsan: call KMSAN > initialization routines") that added a call to kmsan_memblock_free_pages() in > memblock_free_pages(). It looks to me like it would also be incorrect to make > that call in the memblock_free_late() case, because the KMSAN metadata was > already initialized for all reserved pages by kmsan_init_shadow(), which runs > before memblock_free_all(). Having memblock_free_late() call __free_pages_core() > directly also fixes this issue. > > I encountered this issue when I tried to switch some x86_64 VMs I was running > from BIOS boot to EFI boot. The x86 EFI code reserves all EFI boot services > ranges via memblock_reserve() (part of setup_arch()), and it frees them later > via memblock_free_late() (part of efi_enter_virtual_mode()). The EFI > implementation of the VM I was attempting this on, an Amazon EC2 t3.micro > instance, maps north of 170 MB in boot services ranges that happen to fall in > the deferred init range. I certainly noticed when that much memory went missing > on a 1 GB VM. > > I've tested the patch on EC2 instances, qemu/KVM VMs with OVMF, and some real > x86_64 EFI systems, and they all look good to me. However, the physical systems > that I have don't actually trigger this issue because they all have more than 4 > GB of RAM, so their deferred init range starts above 4 GB (it's always in the > highest zone and ZONE_DMA32 ends at 4 GB) while their EFI boot services mappings > are below 4 GB. > > Deferred struct page init can't be enabled on x86_32 so those systems are > unaffected. I haven't found any other code paths that would trigger this issue, > though I can't promise that there aren't any. I did run with this patch on an > arm64 VM as a sanity check, but memblock=debug didn't show any calls to > memblock_free_late() so that system was unaffected as well. > > I am guessing that this change should also go the stable kernels but it may not > apply cleanly (__free_pages_core() was __free_pages_boot_core() and > memblock_free_pages() was __free_pages_bootmem() when this issue was first > introduced). I haven't gone through that process before so please let me know if > I can help with that. > > This is the end result on an EC2 t3.micro instance booting via EFI: > > v6.2-rc2: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 178867 > > v6.2-rc2 + patch: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 222816 > The above before + after seems useful information to include in the commit description of the change.
© 2016 - 2026 Red Hat, Inc.