arch/x86/kernel/e820.c | 8 ++++++++ 1 file changed, 8 insertions(+)
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b
EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
? set_init_arg+0x70/0x70
? load_ucode_bsp+0x13c/0x1a8
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
Modules linked in:
CR2: 00000000f75fe000
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.
Discard the memory after max_pfn from memblock on 32-bit systems so that
core MM would be aware only of actually usable memory.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/kernel/e820.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5f673bd6c7d7 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+ /*
+ * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+ * to even less without it.
+ * Discard memory after max_pfn - the actual limit detected at runtime.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ memblock_remove(PFN_PHYS(max_pfn), -1);
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
--
2.47.2
Hi,
On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Dave Hansen reports the following crash on a 32-bit system with
> CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
>
> > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> > obviously wasn't allocated, thus the oops.
>
> BUG: unable to handle page fault for address: f75fe000
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
> Oops: Oops: 0002 [#1] SMP NOPTI
> CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> EIP: __free_pages_core+0x3c/0x74
> Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b
>
> EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> Call Trace:
> memblock_free_pages+0x11/0x2c
> memblock_free_all+0x2ce/0x3a0
> mm_core_init+0xf5/0x320
> start_kernel+0x296/0x79c
> ? set_init_arg+0x70/0x70
> ? load_ucode_bsp+0x13c/0x1a8
> i386_start_kernel+0xad/0xb0
> startup_32_smp+0x151/0x154
> Modules linked in:
> CR2: 00000000f75fe000
>
> The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
> by max_pfn.
>
> Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
> high memory was also clamped to the end of ZONE_HIGHMEM but after
> 6faea3422e3b memblock_free_all() tries to free memory above the of
> ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
> the end of the memory map.
>
> Discard the memory after max_pfn from memblock on 32-bit systems so that
> core MM would be aware only of actually usable memory.
>
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Arnd Bergmann <arnd@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
With this patch in pending-fixes ( v6.15-rc2-434-g93ced5296772),
all my i386 test runs crash.
[ 0.020893] Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes
[ 0.021248] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc2-00434-g93ced5296772 #1 PREEMPT(undef)
[ 0.021373] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 0.021549] Call Trace:
[ 0.021711] dump_stack_lvl+0x20/0x104
[ 0.022023] dump_stack+0x12/0x18
[ 0.022064] panic+0x2c1/0x2d8
[ 0.022116] ? vprintk_default+0x29/0x30
[ 0.022163] __memblock_alloc_or_panic+0x57/0x58
[ 0.022221] io_apic_init_mappings+0x2e/0x1a8
[ 0.022284] setup_arch+0x909/0xdac
[ 0.022338] ? vprintk_default+0x29/0x30
[ 0.022410] start_kernel+0x63/0x760
[ 0.022457] ? load_ucode_bsp+0x12c/0x198
[ 0.022507] i386_start_kernel+0x74/0x74
[ 0.022548] startup_32_smp+0x151/0x154
[ 0.023089] ---[ end Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes ]---
Reverting this patch fixes the problem. Bisect log is attached for reference.
Guenter
---
# bad: [93ced5296772b7b704f48e4bad9fcfdf0633c780] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [8ffd015db85fea3e15a77027fda6c02ced4d2444] Linux 6.15-rc2
git bisect start 'HEAD' 'v6.15-rc2'
# good: [5d6f363fc974e32dd9930fecaae63958b68a1df4] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git
git bisect good 5d6f363fc974e32dd9930fecaae63958b68a1df4
# good: [1790b4a242fe119fead08fccc5bf923423c7449a] Merge branch 'dma-mapping-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux.git
git bisect good 1790b4a242fe119fead08fccc5bf923423c7449a
# good: [5d37ee8a1d6455968ea3134d78223090d487c7f4] Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git
git bisect good 5d37ee8a1d6455968ea3134d78223090d487c7f4
# good: [9d4de5ae5208548eb9c6a490ac454601f4fbf00b] Merge branch 'i2c/i2c-host-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux.git
git bisect good 9d4de5ae5208548eb9c6a490ac454601f4fbf00b
# bad: [f737ab93945fb8f0213e1cccc39d028eb5d880e0] Merge branch into tip/master: 'x86/urgent'
git bisect bad f737ab93945fb8f0213e1cccc39d028eb5d880e0
# good: [2e7a2843d0de7677b7bb908ca006dc435e52c416] Merge branch into tip/master: 'irq/urgent'
git bisect good 2e7a2843d0de7677b7bb908ca006dc435e52c416
# good: [d466304c4322ad391797437cd84cca7ce1660de0] x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
git bisect good d466304c4322ad391797437cd84cca7ce1660de0
# good: [39893b1e4ad7c4380abe4cfddaa58b34c4363bf4] Merge branch into tip/master: 'timers/urgent'
git bisect good 39893b1e4ad7c4380abe4cfddaa58b34c4363bf4
# bad: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems
git bisect bad 1e07b9fad022e0e02215150ca1e20912e78e8ec1
# first bad commit: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Hi Mike,
On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
...
> arch/x86/kernel/e820.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 57120f0749cc..5f673bd6c7d7 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> memblock_add(entry->addr, entry->size);
> }
>
> + /*
> + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> + * to even less without it.
> + * Discard memory after max_pfn - the actual limit detected at runtime.
> + */
> + if (IS_ENABLED(CONFIG_X86_32))
> + memblock_remove(PFN_PHYS(max_pfn), -1);
> +
> /* Throw away partial pages: */
> memblock_trim_memory(PAGE_SIZE);
Our CI noticed a boot failure after this change as commit 1e07b9fad022
("x86/e820: Discard high memory that can't be addressed by 32-bit
systems") in -tip when booting i386_defconfig with a simple buildroot
initrd.
$ make -skj"$(nproc)" ARCH=i386 CROSS_COMPILE=i386-linux- mrproper defconfig bzImage
$ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/x86-rootfs.cpio.zst | zstd -d >rootfs.cpio
$ qemu-system-i386 \
-display none \
-nodefaults \
-M q35 \
-d unimp,guest_errors \
-append 'console=ttyS0 earlycon=uart8250,io,0x3f8' \
-kernel arch/x86/boot/bzImage \
-initrd rootfs.cpio \
-cpu host \
-enable-kvm \
-m 512m \
-smp 8 \
-serial mon:stdio
[ 0.000000] Linux version 6.15.0-rc1-00177-g1e07b9fad022 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:02:19 MST 2025
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
[ 0.000000] printk: legacy bootconsole [uart8250] enabled
[ 0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
[ 0.000000] APIC: Static calls initialized
[ 0.000000] SMBIOS 2.8 present.
[ 0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 0.000000] DMI: Memory slots populated: 1/1
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: using sched offset of 196444860 cycles
[ 0.000589] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.002401] tsc: Detected 2750.000 MHz processor
[ 0.003126] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
[ 0.003728] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
[ 0.004664] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.007149] found SMP MP-table at [mem 0x000f5480-0x000f548f]
[ 0.007802] No sub-1M memory is available for the trampoline
[ 0.008435] Failed to release memory for alloc_low_pages()
[ 0.008438] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
[ 0.009571] Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264
[ 0.010486] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00177-g1e07b9fad022 #1 PREEMPT(undef)
[ 0.011601] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 0.012857] Call Trace:
[ 0.013135] dump_stack_lvl+0x43/0x58
[ 0.013555] dump_stack+0xd/0x10
[ 0.013919] panic+0xa5/0x221
[ 0.014252] setup_arch+0x86f/0x9f0
[ 0.014650] ? vprintk_default+0x29/0x30
[ 0.015089] start_kernel+0x4b/0x570
[ 0.015487] i386_start_kernel+0x65/0x68
[ 0.015919] startup_32_smp+0x151/0x154
[ 0.016344] ---[ end Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264 ]---
At the parent change with the same command, the boot completes fine.
[ 0.000000] Linux version 6.15.0-rc1-00176-gd466304c4322 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:00:12 MST 2025
[ 0.000000] BIOS-provided physical RAM map:
...
[ 0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
[ 0.000000] printk: legacy bootconsole [uart8250] enabled
[ 0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
[ 0.000000] APIC: Static calls initialized
[ 0.000000] SMBIOS 2.8 present.
[ 0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 0.000000] DMI: Memory slots populated: 1/1
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000001] kvm-clock: using sched offset of 429786443 cycles
[ 0.000806] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.003278] tsc: Detected 2750.000 MHz processor
[ 0.004730] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
[ 0.006220] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
[ 0.009169] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
[ 0.012840] found SMP MP-table at [mem 0x000f5480-0x000f548f]
[ 0.014310] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
[ 0.015141] ACPI: Early table checksum verification disabled
...
[ 0.046564] 511MB LOWMEM available.
[ 0.047421] mapped low ram: 0 - 1ffe0000
[ 0.048431] low ram: 0 - 1ffe0000
[ 0.049289] Zone ranges:
[ 0.049934] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.051184] Normal [mem 0x0000000001000000-0x000000001ffdffff]
[ 0.053087] Movable zone start for each node
[ 0.054409] Early memory node ranges
[ 0.055513] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.057411] node 0: [mem 0x0000000000100000-0x000000001ffdffff]
[ 0.059176] Initmem setup node 0 [mem 0x0000000000001000-0x000000001ffdffff]
...
Is this an invalid configuration or virtual setup that is being tested
here or is there something else problematic with this change?
Cheers,
Nathan
* Nathan Chancellor <nathan@kernel.org> wrote:
> Hi Mike,
>
> On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> ...
> > arch/x86/kernel/e820.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index 57120f0749cc..5f673bd6c7d7 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > memblock_add(entry->addr, entry->size);
> > }
> >
> > + /*
> > + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > + * to even less without it.
> > + * Discard memory after max_pfn - the actual limit detected at runtime.
> > + */
> > + if (IS_ENABLED(CONFIG_X86_32))
> > + memblock_remove(PFN_PHYS(max_pfn), -1);
> > +
> > /* Throw away partial pages: */
> > memblock_trim_memory(PAGE_SIZE);
>
> Our CI noticed a boot failure after this change as commit 1e07b9fad022
> ("x86/e820: Discard high memory that can't be addressed by 32-bit
> systems") in -tip when booting i386_defconfig with a simple buildroot
> initrd.
I've zapped this commit from tip:x86/urgent for the time being:
1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
until these bugs are better understood.
Thanks,
Ingo
On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
>
> * Nathan Chancellor <nathan@kernel.org> wrote:
>
> > Hi Mike,
> >
> > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > ...
> > > arch/x86/kernel/e820.c | 8 ++++++++
> > > 1 file changed, 8 insertions(+)
> > >
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index 57120f0749cc..5f673bd6c7d7 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > memblock_add(entry->addr, entry->size);
> > > }
> > >
> > > + /*
> > > + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > + * to even less without it.
> > > + * Discard memory after max_pfn - the actual limit detected at runtime.
> > > + */
> > > + if (IS_ENABLED(CONFIG_X86_32))
> > > + memblock_remove(PFN_PHYS(max_pfn), -1);
> > > +
> > > /* Throw away partial pages: */
> > > memblock_trim_memory(PAGE_SIZE);
> >
> > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > systems") in -tip when booting i386_defconfig with a simple buildroot
> > initrd.
>
> I've zapped this commit from tip:x86/urgent for the time being:
>
> 1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
>
> until these bugs are better understood.
With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
overflows and we get memblock_remove(0, -1) :(
Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
under 4G and max_pfn should never overflow.
Another option is to skip e820 entries above 4G and not add them to
memblock at the first place, e.g.
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..2b617f36f11a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1297,6 +1297,17 @@ void __init e820__memblock_setup(void)
if (entry->type != E820_TYPE_RAM)
continue;
+#ifdef CONFIG_X86_32
+ /*
+ * Discard memory above 4GB because 32-bit systems are limited
+ * to 4GB of memory even with HIGHMEM.
+ */
+ if (entry->addr > SZ_4G)
+ continue;
+ if (entry->addr + entry->size > SZ_4G)
+ entry->size = SZ_4G - entry->addr;
+#endif
+
memblock_add(entry->addr, entry->size);
}
> Thanks,
>
> Ingo
--
Sincerely yours,
Mike.
* Mike Rapoport <rppt@kernel.org> wrote:
> On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> >
> > * Nathan Chancellor <nathan@kernel.org> wrote:
> >
> > > Hi Mike,
> > >
> > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > ...
> > > > arch/x86/kernel/e820.c | 8 ++++++++
> > > > 1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > > memblock_add(entry->addr, entry->size);
> > > > }
> > > >
> > > > + /*
> > > > + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > + * to even less without it.
> > > > + * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > + */
> > > > + if (IS_ENABLED(CONFIG_X86_32))
> > > > + memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > +
> > > > /* Throw away partial pages: */
> > > > memblock_trim_memory(PAGE_SIZE);
> > >
> > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > initrd.
> >
> > I've zapped this commit from tip:x86/urgent for the time being:
> >
> > 1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> >
> > until these bugs are better understood.
>
> With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> overflows and we get memblock_remove(0, -1) :(
>
> Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> under 4G and max_pfn should never overflow.
So why don't we use max_pfn like your -v1 fix did IIRC?
Ingo
On Fri, Apr 18, 2025 at 02:59:05PM +0200, Ingo Molnar wrote:
>
> * Mike Rapoport <rppt@kernel.org> wrote:
>
> > On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> > >
> > > * Nathan Chancellor <nathan@kernel.org> wrote:
> > >
> > > > Hi Mike,
> > > >
> > > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > > ...
> > > > > arch/x86/kernel/e820.c | 8 ++++++++
> > > > > 1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > > --- a/arch/x86/kernel/e820.c
> > > > > +++ b/arch/x86/kernel/e820.c
> > > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > > > memblock_add(entry->addr, entry->size);
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > > + * to even less without it.
> > > > > + * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > > + */
> > > > > + if (IS_ENABLED(CONFIG_X86_32))
> > > > > + memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > > +
> > > > > /* Throw away partial pages: */
> > > > > memblock_trim_memory(PAGE_SIZE);
> > > >
> > > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > > initrd.
> > >
> > > I've zapped this commit from tip:x86/urgent for the time being:
> > >
> > > 1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> > >
> > > until these bugs are better understood.
> >
> > With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> > overflows and we get memblock_remove(0, -1) :(
> >
> > Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> > under 4G and max_pfn should never overflow.
>
> So why don't we use max_pfn like your -v1 fix did IIRC?
Dave didn't like max_pfn. I don't feel strongly about using max_pfn or
skipping e820 ranges above 4G and not adding them to memblock.
> Ingo
--
Sincerely yours,
Mike.
On 4/18/25 12:25, Mike Rapoport wrote: >> So why don't we use max_pfn like your -v1 fix did IIRC? > Dave didn't like max_pfn. I don't feel strongly about using max_pfn or > skipping e820 ranges above 4G and not adding them to memblock. I feel more strongly about fixing the bug than avoiding max_pfn. ;) Going back to v1 is fine with me.
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 1e07b9fad022e0e02215150ca1e20912e78e8ec1
Gitweb: https://git.kernel.org/tip/1e07b9fad022e0e02215150ca1e20912e78e8ec1
Author: Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate: Sun, 13 Apr 2025 11:08:58 +03:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:51:02 +02:00
x86/e820: Discard high memory that can't be addressed by 32-bit systems
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
...
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
The bug was introduced by this recent commit:
6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.
To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.
[ mingo: Fixed build failure. ]
Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
arch/x86/kernel/e820.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..2f38175 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+#ifdef CONFIG_X86_32
+ /*
+ * Discard memory above 4GB because 32-bit systems are limited to 4GB
+ * of memory even with HIGHMEM.
+ */
+ memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+#endif
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 3f0036c0b5f850d1200dbfa7365ed24197a0f157
Gitweb: https://git.kernel.org/tip/3f0036c0b5f850d1200dbfa7365ed24197a0f157
Author: Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate: Sun, 13 Apr 2025 11:08:58 +03:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 13 Apr 2025 11:09:39 +02:00
x86/e820: Discard high memory that can't be addressed by 32-bit systems
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
...
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
The bug was introduced by this recent commit:
6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.
To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.
Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
arch/x86/kernel/e820.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..9920122 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+ /*
+ * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+ * to even less without it.
+ * Discard memory after max_pfn - the actual limit detected at runtime.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ memblock_remove(PFN_PHYS(max_pfn), -1);
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> + /*
> + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> + * to even less without it.
> + * Discard memory after max_pfn - the actual limit detected at runtime.
> + */
> + if (IS_ENABLED(CONFIG_X86_32))
> + memblock_remove(PFN_PHYS(max_pfn), -1);
Mike, thanks for the quick fix! I did verify that this gets my silly
test VM booting again.
The patch obviously _works_. But in the case I was hitting max_pfn was
set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
far away from the related warning:
> if (max_pfn > MAX_NONPAE_PFN) {
> max_pfn = MAX_NONPAE_PFN;
> printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
> }
and it's logically doing the same thing: truncating memory at
MAX_NONPAE_PFN.
How about we reuse 'MAX_NONPAE_PFN' like this:
if (IS_ENABLED(CONFIG_X86_32))
memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
Would that make the connection more obvious?
On Mon, Apr 14, 2025 at 07:19:02AM -0700, Dave Hansen wrote:
> On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> > + /*
> > + * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > + * to even less without it.
> > + * Discard memory after max_pfn - the actual limit detected at runtime.
> > + */
> > + if (IS_ENABLED(CONFIG_X86_32))
> > + memblock_remove(PFN_PHYS(max_pfn), -1);
>
> Mike, thanks for the quick fix! I did verify that this gets my silly
> test VM booting again.
>
> The patch obviously _works_. But in the case I was hitting max_pfn was
> set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
> far away from the related warning:
Yeah, my first instinct was to put memblock_remove() in the same 'if',
but there's no memblock there yet :)
> > if (max_pfn > MAX_NONPAE_PFN) {
> > max_pfn = MAX_NONPAE_PFN;
> > printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
> > }
>
> and it's logically doing the same thing: truncating memory at
> MAX_NONPAE_PFN.
>
> How about we reuse 'MAX_NONPAE_PFN' like this:
>
> if (IS_ENABLED(CONFIG_X86_32))
> memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
>
> Would that make the connection more obvious?
Yes, that's better. Here's the updated patch:
From a235764221e4a849fa274a546ff2a3d9f15da2a9 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Sun, 13 Apr 2025 10:36:17 +0300
Subject: [PATCH v2] x86/e820: discard high memory that can't be addressed by
32-bit systems
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b
EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
? set_init_arg+0x70/0x70
? load_ucode_bsp+0x13c/0x1a8
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
Modules linked in:
CR2: 00000000f75fe000
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.
Discard the memory after MAX_NONPAE_PFN from memblock on 32-bit systems
so that core MM would be aware only of actually usable memory.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/kernel/e820.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5e6b1034e6f1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,13 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+ /*
+ * Discard memory above 4GB because 32-bit systems are limited to 4GB
+ * of memory even with HIGHMEM.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
--
2.47.2
--
Sincerely yours,
Mike.
On 4/15/25 00:18, Mike Rapoport wrote: >> How about we reuse 'MAX_NONPAE_PFN' like this: >> >> if (IS_ENABLED(CONFIG_X86_32)) >> memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1); >> >> Would that make the connection more obvious? > Yes, that's better. Here's the updated patch: Looks, great. Thanks for the update and the quick turnaround on the first one after the bug report! Tested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Dave Hansen <dave.hansen@intel.com>
* Dave Hansen <dave.hansen@intel.com> wrote: > On 4/15/25 00:18, Mike Rapoport wrote: > >> How about we reuse 'MAX_NONPAE_PFN' like this: > >> > >> if (IS_ENABLED(CONFIG_X86_32)) > >> memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1); > >> > >> Would that make the connection more obvious? > > Yes, that's better. Here's the updated patch: > > Looks, great. Thanks for the update and the quick turnaround on the > first one after the bug report! > > Tested-by: Dave Hansen <dave.hansen@intel.com> > Acked-by: Dave Hansen <dave.hansen@intel.com> I've amended the fix in tip:x86/urgent accordingly and added your tags, thanks! Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > > * Dave Hansen <dave.hansen@intel.com> wrote: > > > On 4/15/25 00:18, Mike Rapoport wrote: > > >> How about we reuse 'MAX_NONPAE_PFN' like this: > > >> > > >> if (IS_ENABLED(CONFIG_X86_32)) > > >> memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1); > > >> > > >> Would that make the connection more obvious? > > > Yes, that's better. Here's the updated patch: > > > > Looks, great. Thanks for the update and the quick turnaround on the > > first one after the bug report! > > > > Tested-by: Dave Hansen <dave.hansen@intel.com> > > Acked-by: Dave Hansen <dave.hansen@intel.com> > > I've amended the fix in tip:x86/urgent accordingly and added your tags, > thanks! So I had to apply the fix below as well, due to this build failure on x86-defconfig: arch/x86/kernel/e820.c:1307:42: error: ‘MAX_NONPAE_PFN’ undeclared (first use in this function); did you mean ‘MAX_DMA_PFN’? IS_ENABLED(CONFIG_X86_32) can only be used when the code is syntactically correct on !CONFIG_X86_32 kernels too - which it wasn't. So I went for the straightforward #ifdef block instead. Thanks, Ingo ===========> arch/x86/kernel/e820.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index de6238886cb2..c984be8ee060 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1299,13 +1299,14 @@ void __init e820__memblock_setup(void) memblock_add(entry->addr, entry->size); } +#ifdef CONFIG_X86_32 /* * Discard memory above 4GB because 32-bit systems are limited to 4GB * of memory even with HIGHMEM. */ - if (IS_ENABLED(CONFIG_X86_32)) - memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1); + memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1); +#endif /* Throw away partial pages: */ memblock_trim_memory(PAGE_SIZE);
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: e71b6094c20f5dc9c43dc89af8a569ffa511d676
Gitweb: https://git.kernel.org/tip/e71b6094c20f5dc9c43dc89af8a569ffa511d676
Author: Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate: Sun, 13 Apr 2025 11:08:58 +03:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:16:02 +02:00
x86/e820: Discard high memory that can't be addressed by 32-bit systems
Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
> obviously wasn't allocated, thus the oops.
BUG: unable to handle page fault for address: f75fe000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
*pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
Oops: Oops: 0002 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
EIP: __free_pages_core+0x3c/0x74
...
Call Trace:
memblock_free_pages+0x11/0x2c
memblock_free_all+0x2ce/0x3a0
mm_core_init+0xf5/0x320
start_kernel+0x296/0x79c
i386_start_kernel+0xad/0xb0
startup_32_smp+0x151/0x154
The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.
The bug was introduced by this recent commit:
6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.
To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.
Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
arch/x86/kernel/e820.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..de62388 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,13 @@ void __init e820__memblock_setup(void)
memblock_add(entry->addr, entry->size);
}
+ /*
+ * Discard memory above 4GB because 32-bit systems are limited to 4GB
+ * of memory even with HIGHMEM.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
/* Throw away partial pages: */
memblock_trim_memory(PAGE_SIZE);
© 2016 - 2025 Red Hat, Inc.