[PATCH 05/11] x86: remove HIGHMEM64G support

Arnd Bergmann posted 11 patches 1 year ago
There is a newer version of this series
[PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Arnd Bergmann 1 year ago
From: Arnd Bergmann <arnd@arndb.de>

The HIGHMEM64G support was added in linux-2.3.25 to support (then)
high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
addressing, NUMA and PCI-X slots started appearing.

I have found no evidence of this ever being used in regular dual-socket
servers or consumer devices, all the users seem obsolete these days,
even by i386 standards:

 - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
   removed ten years ago.

 - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
   ServerWorks ServerSet/GrandChampion could theoretically still work
   with 8GB, but these were exceptionally rare even 20 years ago and
   would have usually been equipped with than the maximum amount of
   RAM.

 - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
   could still work in a Socket 775 mainboard designed for the later
   Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
   64-bit CPUs.

 - In the early days of x86-64 hardware, there was sometimes the need
   to run a 32-bit kernel to work around bugs in the hardware drivers,
   or in the syscall emulation for 32-bit userspace. This likely still
   works but there should never be a need for this any more.

Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
PAE mode is still required to get access to the 'NX' bit on Atom
'Pentium M' and 'Core Duo' CPUs.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 Documentation/admin-guide/kdump/kdump.rst     |  4 --
 Documentation/arch/x86/usb-legacy-support.rst | 11 +----
 arch/x86/Kconfig                              | 46 +++----------------
 arch/x86/configs/xen.config                   |  2 -
 arch/x86/include/asm/page_32_types.h          |  4 +-
 arch/x86/mm/init_32.c                         |  9 +---
 6 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5376890adbeb..1f7f14c6e184 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
 1) On i386, enable high memory support under "Processor type and
    features"::
 
-	CONFIG_HIGHMEM64G=y
-
-   or::
-
 	CONFIG_HIGHMEM4G
 
 2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel
diff --git a/Documentation/arch/x86/usb-legacy-support.rst b/Documentation/arch/x86/usb-legacy-support.rst
index e01c08b7c981..b17bf122270a 100644
--- a/Documentation/arch/x86/usb-legacy-support.rst
+++ b/Documentation/arch/x86/usb-legacy-support.rst
@@ -20,11 +20,7 @@ It has several drawbacks, though:
    features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
    not be available.
 
-2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
-   system crashes, because the SMM BIOS is not expecting to be in PAE mode.
-   The Intel E7505 is a typical machine where this happens.
-
-3) If AMD64 64-bit mode is enabled, again system crashes often happen,
+2) If AMD64 64-bit mode is enabled, again system crashes often happen,
    because the SMM BIOS isn't expecting the CPU to be in 64-bit mode.  The
    BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
    yet.
@@ -38,11 +34,6 @@ Problem 1)
   compiled-in, too.
 
 Problem 2)
-  can currently only be solved by either disabling HIGHMEM64G
-  in the kernel config or USB Legacy support in the BIOS. A BIOS update
-  could help, but so far no such update exists.
-
-Problem 3)
   is usually fixed by a BIOS update. Check the board
   manufacturers web site. If an update is not available, disable USB
   Legacy support in the BIOS. If this alone doesn't help, try also adding
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 42494739344d..b373db8a8176 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1383,15 +1383,11 @@ config X86_CPUID
 	  with major 203 and minors 0 to 31 for /dev/cpu/0/cpuid to
 	  /dev/cpu/31/cpuid.
 
-choice
-	prompt "High Memory Support"
-	default HIGHMEM4G
+config HIGHMEM4G
+	bool "High Memory Support"
 	depends on X86_32
-
-config NOHIGHMEM
-	bool "off"
 	help
-	  Linux can use up to 64 Gigabytes of physical memory on x86 systems.
+	  Linux can use up to 4 Gigabytes of physical memory on x86 systems.
 	  However, the address space of 32-bit x86 processors is only 4
 	  Gigabytes large. That means that, if you have a large amount of
 	  physical memory, not all of it can be "permanently mapped" by the
@@ -1407,38 +1403,9 @@ config NOHIGHMEM
 	  possible.
 
 	  If the machine has between 1 and 4 Gigabytes physical RAM, then
-	  answer "4GB" here.
+	  answer "Y" here.
 
-	  If more than 4 Gigabytes is used then answer "64GB" here. This
-	  selection turns Intel PAE (Physical Address Extension) mode on.
-	  PAE implements 3-level paging on IA32 processors. PAE is fully
-	  supported by Linux, PAE mode is implemented on all recent Intel
-	  processors (Pentium Pro and better). NOTE: If you say "64GB" here,
-	  then the kernel will not boot on CPUs that don't support PAE!
-
-	  The actual amount of total physical memory will either be
-	  auto detected or can be forced by using a kernel command line option
-	  such as "mem=256M". (Try "man bootparam" or see the documentation of
-	  your boot loader (lilo or loadlin) about how to pass options to the
-	  kernel at boot time.)
-
-	  If unsure, say "off".
-
-config HIGHMEM4G
-	bool "4GB"
-	help
-	  Select this if you have a 32-bit processor and between 1 and 4
-	  gigabytes of physical RAM.
-
-config HIGHMEM64G
-	bool "64GB"
-	depends on X86_HAVE_PAE
-	select X86_PAE
-	help
-	  Select this if you have a 32-bit processor and more than 4
-	  gigabytes of physical RAM.
-
-endchoice
+	  If unsure, say N.
 
 choice
 	prompt "Memory split" if EXPERT
@@ -1484,8 +1451,7 @@ config PAGE_OFFSET
 	depends on X86_32
 
 config HIGHMEM
-	def_bool y
-	depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
+	def_bool HIGHMEM4G
 
 config X86_PAE
 	bool "PAE (Physical Address Extension) Support"
diff --git a/arch/x86/configs/xen.config b/arch/x86/configs/xen.config
index 581296255b39..d5d091e03bd3 100644
--- a/arch/x86/configs/xen.config
+++ b/arch/x86/configs/xen.config
@@ -1,6 +1,4 @@
 # global x86 required specific stuff
-# On 32-bit HIGHMEM4G is not allowed
-CONFIG_HIGHMEM64G=y
 CONFIG_64BIT=y
 
 # These enable us to allow some of the
diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index faf9cc1c14bb..25c32652f404 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -11,8 +11,8 @@
  * a virtual address space of one gigabyte, which limits the
  * amount of physical memory you can use to about 950MB.
  *
- * If you want more physical memory than this then see the CONFIG_HIGHMEM4G
- * and CONFIG_HIGHMEM64G options in the kernel configuration.
+ * If you want more physical memory than this then see the CONFIG_VMSPLIT_2G
+ * and CONFIG_HIGHMEM4G options in the kernel configuration.
  */
 #define __PAGE_OFFSET_BASE	_AC(CONFIG_PAGE_OFFSET, UL)
 #define __PAGE_OFFSET		__PAGE_OFFSET_BASE
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index ac41b1e0940d..f288aad8dc74 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -582,7 +582,7 @@ static void __init lowmem_pfn_init(void)
 	"only %luMB highmem pages available, ignoring highmem size of %luMB!\n"
 
 #define MSG_HIGHMEM_TRIMMED \
-	"Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
+	"Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!\n"
 /*
  * We have more RAM than fits into lowmem - we try to put it into
  * highmem, also taking the highmem=x boot parameter into account:
@@ -606,18 +606,13 @@ static void __init highmem_pfn_init(void)
 #ifndef CONFIG_HIGHMEM
 	/* Maximum memory usable is what is directly addressable */
 	printk(KERN_WARNING "Warning only %ldMB will be used.\n", MAXMEM>>20);
-	if (max_pfn > MAX_NONPAE_PFN)
-		printk(KERN_WARNING "Use a HIGHMEM64G enabled kernel.\n");
-	else
-		printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
+	printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
 	max_pfn = MAXMEM_PFN;
 #else /* !CONFIG_HIGHMEM */
-#ifndef CONFIG_HIGHMEM64G
 	if (max_pfn > MAX_NONPAE_PFN) {
 		max_pfn = MAX_NONPAE_PFN;
 		printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
 	}
-#endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
 }
 
-- 
2.39.5
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Dave Hansen 8 months, 1 week ago
Has anyone run into any problems on 6.15-rc1 with this stuff?

0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
wasn't allocated, thus the oops. Looks like the memblock for the >4GB
memory didn't get removed although the pgdats seem correct.

I'll dig into it some more. Just wanted to make sure there wasn't a fix
out there already.

The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
and "-m 4096" (or more).

> [    0.003806] Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!
...
> [    0.561310] BUG: unable to handle page fault for address: f75fe000
> [    0.562226] #PF: supervisor write access in kernel mode
> [    0.562947] #PF: error_code(0x0002) - not-present page
> [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [    0.568777] EIP: __free_pages_core+0x3c/0x74
> [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.575464] Call Trace:
> [    0.575816]  memblock_free_pages+0x11/0x2c
> [    0.576392]  memblock_free_all+0x2ce/0x3a0
> [    0.576955]  mm_core_init+0xf5/0x320
> [    0.577423]  start_kernel+0x296/0x79c
> [    0.577950]  ? set_init_arg+0x70/0x70
> [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> [    0.579059]  i386_start_kernel+0xad/0xb0
> [    0.579614]  startup_32_smp+0x151/0x154
> [    0.580100] Modules linked in:
> [    0.580358] CR2: 00000000f75fe000
> [    0.580630] ---[ end trace 0000000000000000 ]---
> [    0.581111] EIP: __free_pages_core+0x3c/0x74
> [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

> [    0.561310] BUG: unable to handle page fault for address: f75fe000
> [    0.562226] #PF: supervisor write access in kernel mode
> [    0.562947] #PF: error_code(0x0002) - not-present page
> [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [    0.568777] EIP: __free_pages_core+0x3c/0x74
> [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.575464] Call Trace:
> [    0.575816]  memblock_free_pages+0x11/0x2c
> [    0.576392]  memblock_free_all+0x2ce/0x3a0
> [    0.576955]  mm_core_init+0xf5/0x320
> [    0.577423]  start_kernel+0x296/0x79c
> [    0.577950]  ? set_init_arg+0x70/0x70
> [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> [    0.579059]  i386_start_kernel+0xad/0xb0
> [    0.579614]  startup_32_smp+0x151/0x154
> [    0.580100] Modules linked in:
> [    0.580358] CR2: 00000000f75fe000
> [    0.580630] ---[ end trace 0000000000000000 ]---
> [    0.581111] EIP: __free_pages_core+0x3c/0x74
> [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Arnd Bergmann 8 months, 1 week ago
On Sat, Apr 12, 2025, at 01:44, Dave Hansen wrote:
> Has anyone run into any problems on 6.15-rc1 with this stuff?
>
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> memory didn't get removed although the pgdats seem correct.
>
> I'll dig into it some more. Just wanted to make sure there wasn't a fix
> out there already.
>
> The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
> and "-m 4096" (or more).

I have reproduced the bug now and found that it did not happen in
my series. Bisection points to Mike Rapoport's highmem series,
specifically  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

There was a related bug that was caused by an earlier version
of my series when I also removed CONFIG_PHYS_ADDR_T_64BIT
https://lore.kernel.org/all/202412201005.77fb063-lkp@intel.com/

    Arnd
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Mike Rapoport 8 months, 1 week ago
On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
> Has anyone run into any problems on 6.15-rc1 with this stuff?
> 
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> memory didn't get removed although the pgdats seem correct.

That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
freeing"). 
Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
and after 6faea3422e3b there's no more clamping, so memblock_free_all()
tries to free memory >4G as well.
 
> I'll dig into it some more. Just wanted to make sure there wasn't a fix
> out there already.

This should fix it.

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..4b24c0ccade4 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,8 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 
> The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
> and "-m 4096" (or more).
> 
> > [    0.003806] Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!
> ...
> > [    0.561310] BUG: unable to handle page fault for address: f75fe000
> > [    0.562226] #PF: supervisor write access in kernel mode
> > [    0.562947] #PF: error_code(0x0002) - not-present page
> > [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> > [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> > [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> > [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [    0.568777] EIP: __free_pages_core+0x3c/0x74
> > [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.575464] Call Trace:
> > [    0.575816]  memblock_free_pages+0x11/0x2c
> > [    0.576392]  memblock_free_all+0x2ce/0x3a0
> > [    0.576955]  mm_core_init+0xf5/0x320
> > [    0.577423]  start_kernel+0x296/0x79c
> > [    0.577950]  ? set_init_arg+0x70/0x70
> > [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> > [    0.579059]  i386_start_kernel+0xad/0xb0
> > [    0.579614]  startup_32_smp+0x151/0x154
> > [    0.580100] Modules linked in:
> > [    0.580358] CR2: 00000000f75fe000
> > [    0.580630] ---[ end trace 0000000000000000 ]---
> > [    0.581111] EIP: __free_pages_core+0x3c/0x74
> > [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> > [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> 
> > [    0.561310] BUG: unable to handle page fault for address: f75fe000
> > [    0.562226] #PF: supervisor write access in kernel mode
> > [    0.562947] #PF: error_code(0x0002) - not-present page
> > [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> > [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> > [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> > [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [    0.568777] EIP: __free_pages_core+0x3c/0x74
> > [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.575464] Call Trace:
> > [    0.575816]  memblock_free_pages+0x11/0x2c
> > [    0.576392]  memblock_free_all+0x2ce/0x3a0
> > [    0.576955]  mm_core_init+0xf5/0x320
> > [    0.577423]  start_kernel+0x296/0x79c
> > [    0.577950]  ? set_init_arg+0x70/0x70
> > [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> > [    0.579059]  i386_start_kernel+0xad/0xb0
> > [    0.579614]  startup_32_smp+0x151/0x154
> > [    0.580100] Modules linked in:
> > [    0.580358] CR2: 00000000f75fe000
> > [    0.580630] ---[ end trace 0000000000000000 ]---
> > [    0.581111] EIP: __free_pages_core+0x3c/0x74
> > [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> > [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

-- 
Sincerely yours,
Mike.
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Ingo Molnar 8 months, 1 week ago
* Mike Rapoport <rppt@kernel.org> wrote:

> On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
> > Has anyone run into any problems on 6.15-rc1 with this stuff?
> > 
> > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> > wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> > memory didn't get removed although the pgdats seem correct.
> 
> That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
> freeing"). 
> Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
> and after 6faea3422e3b there's no more clamping, so memblock_free_all()
> tries to free memory >4G as well.
>  
> > I'll dig into it some more. Just wanted to make sure there wasn't a fix
> > out there already.
> 
> This should fix it.
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 57120f0749cc..4b24c0ccade4 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1300,6 +1300,8 @@ void __init e820__memblock_setup(void)
>  		memblock_add(entry->addr, entry->size);
>  	}
>  
> +	memblock_remove(PFN_PHYS(max_pfn), -1);
> +
>  	/* Throw away partial pages: */
>  	memblock_trim_memory(PAGE_SIZE);

Mind sending a full patch with changelog, SOB, Ard's Tested-by, Dave's 
Reported-by, etc.?

Thanks,

	Ingo
[PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Mike Rapoport 8 months, 1 week ago
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b

  EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
  ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
  CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   ? set_init_arg+0x70/0x70
   ? load_ucode_bsp+0x13c/0x1a8
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154
  Modules linked in:
  CR2: 00000000f75fe000

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.

Discard the memory after max_pfn from memblock on 32-bit systems so that
core MM would be aware only of actually usable memory.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5f673bd6c7d7 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+	 * to even less without it.
+	 * Discard memory after max_pfn - the actual limit detected at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
-- 
2.47.2
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Guenter Roeck 8 months ago
Hi,

On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Dave Hansen reports the following crash on a 32-bit system with
> CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 
>   > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
>   > obviously wasn't allocated, thus the oops.
> 
>   BUG: unable to handle page fault for address: f75fe000
>   #PF: supervisor write access in kernel mode
>   #PF: error_code(0x0002) - not-present page
>   *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
>   Oops: Oops: 0002 [#1] SMP NOPTI
>   CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>   EIP: __free_pages_core+0x3c/0x74
>   Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b
> 
>   EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
>   ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
>   DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
>   CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
>   Call Trace:
>    memblock_free_pages+0x11/0x2c
>    memblock_free_all+0x2ce/0x3a0
>    mm_core_init+0xf5/0x320
>    start_kernel+0x296/0x79c
>    ? set_init_arg+0x70/0x70
>    ? load_ucode_bsp+0x13c/0x1a8
>    i386_start_kernel+0xad/0xb0
>    startup_32_smp+0x151/0x154
>   Modules linked in:
>   CR2: 00000000f75fe000
> 
> The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
> by max_pfn.
> 
> Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
> high memory was also clamped to the end of ZONE_HIGHMEM but after
> 6faea3422e3b memblock_free_all() tries to free memory above the of
> ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
> the end of the memory map.
> 
> Discard the memory after max_pfn from memblock on 32-bit systems so that
> core MM would be aware only of actually usable memory.
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Arnd Bergmann <arnd@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

With this patch in pending-fixes ( v6.15-rc2-434-g93ced5296772),
all my i386 test runs crash.

[    0.020893] Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes
[    0.021248] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc2-00434-g93ced5296772 #1 PREEMPT(undef)
[    0.021373] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.021549] Call Trace:
[    0.021711]  dump_stack_lvl+0x20/0x104
[    0.022023]  dump_stack+0x12/0x18
[    0.022064]  panic+0x2c1/0x2d8
[    0.022116]  ? vprintk_default+0x29/0x30
[    0.022163]  __memblock_alloc_or_panic+0x57/0x58
[    0.022221]  io_apic_init_mappings+0x2e/0x1a8
[    0.022284]  setup_arch+0x909/0xdac
[    0.022338]  ? vprintk_default+0x29/0x30
[    0.022410]  start_kernel+0x63/0x760
[    0.022457]  ? load_ucode_bsp+0x12c/0x198
[    0.022507]  i386_start_kernel+0x74/0x74
[    0.022548]  startup_32_smp+0x151/0x154
[    0.023089] ---[ end Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes ]---

Reverting this patch fixes the problem. Bisect log is attached for reference.

Guenter

---
# bad: [93ced5296772b7b704f48e4bad9fcfdf0633c780] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [8ffd015db85fea3e15a77027fda6c02ced4d2444] Linux 6.15-rc2
git bisect start 'HEAD' 'v6.15-rc2'
# good: [5d6f363fc974e32dd9930fecaae63958b68a1df4] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git
git bisect good 5d6f363fc974e32dd9930fecaae63958b68a1df4
# good: [1790b4a242fe119fead08fccc5bf923423c7449a] Merge branch 'dma-mapping-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux.git
git bisect good 1790b4a242fe119fead08fccc5bf923423c7449a
# good: [5d37ee8a1d6455968ea3134d78223090d487c7f4] Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git
git bisect good 5d37ee8a1d6455968ea3134d78223090d487c7f4
# good: [9d4de5ae5208548eb9c6a490ac454601f4fbf00b] Merge branch 'i2c/i2c-host-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux.git
git bisect good 9d4de5ae5208548eb9c6a490ac454601f4fbf00b
# bad: [f737ab93945fb8f0213e1cccc39d028eb5d880e0] Merge branch into tip/master: 'x86/urgent'
git bisect bad f737ab93945fb8f0213e1cccc39d028eb5d880e0
# good: [2e7a2843d0de7677b7bb908ca006dc435e52c416] Merge branch into tip/master: 'irq/urgent'
git bisect good 2e7a2843d0de7677b7bb908ca006dc435e52c416
# good: [d466304c4322ad391797437cd84cca7ce1660de0] x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
git bisect good d466304c4322ad391797437cd84cca7ce1660de0
# good: [39893b1e4ad7c4380abe4cfddaa58b34c4363bf4] Merge branch into tip/master: 'timers/urgent'
git bisect good 39893b1e4ad7c4380abe4cfddaa58b34c4363bf4
# bad: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems
git bisect bad 1e07b9fad022e0e02215150ca1e20912e78e8ec1
# first bad commit: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Nathan Chancellor 8 months ago
Hi Mike,

On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
...
>  arch/x86/kernel/e820.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 57120f0749cc..5f673bd6c7d7 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
>  		memblock_add(entry->addr, entry->size);
>  	}
>  
> +	/*
> +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> +	 * to even less without it.
> +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_32))
> +		memblock_remove(PFN_PHYS(max_pfn), -1);
> +
>  	/* Throw away partial pages: */
>  	memblock_trim_memory(PAGE_SIZE);

Our CI noticed a boot failure after this change as commit 1e07b9fad022
("x86/e820: Discard high memory that can't be addressed by 32-bit
systems") in -tip when booting i386_defconfig with a simple buildroot
initrd.

  $ make -skj"$(nproc)" ARCH=i386 CROSS_COMPILE=i386-linux- mrproper defconfig bzImage

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/x86-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-i386 \
      -display none \
      -nodefaults \
      -M q35 \
      -d unimp,guest_errors \
      -append 'console=ttyS0 earlycon=uart8250,io,0x3f8' \
      -kernel arch/x86/boot/bzImage \
      -initrd rootfs.cpio \
      -cpu host \
      -enable-kvm \
      -m 512m \
      -smp 8 \
      -serial mon:stdio
  [    0.000000] Linux version 6.15.0-rc1-00177-g1e07b9fad022 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:02:19 MST 2025
  [    0.000000] BIOS-provided physical RAM map:
  [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
  [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
  [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable
  [    0.000000] BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
  [    0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
  [    0.000000] printk: legacy bootconsole [uart8250] enabled
  [    0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
  [    0.000000] APIC: Static calls initialized
  [    0.000000] SMBIOS 2.8 present.
  [    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.000000] DMI: Memory slots populated: 1/1
  [    0.000000] Hypervisor detected: KVM
  [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
  [    0.000000] kvm-clock: using sched offset of 196444860 cycles
  [    0.000589] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
  [    0.002401] tsc: Detected 2750.000 MHz processor
  [    0.003126] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
  [    0.003728] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
  [    0.004664] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
  [    0.007149] found SMP MP-table at [mem 0x000f5480-0x000f548f]
  [    0.007802] No sub-1M memory is available for the trampoline
  [    0.008435] Failed to release memory for alloc_low_pages()
  [    0.008438] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
  [    0.009571] Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264
  [    0.010486] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00177-g1e07b9fad022 #1 PREEMPT(undef)
  [    0.011601] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.012857] Call Trace:
  [    0.013135]  dump_stack_lvl+0x43/0x58
  [    0.013555]  dump_stack+0xd/0x10
  [    0.013919]  panic+0xa5/0x221
  [    0.014252]  setup_arch+0x86f/0x9f0
  [    0.014650]  ? vprintk_default+0x29/0x30
  [    0.015089]  start_kernel+0x4b/0x570
  [    0.015487]  i386_start_kernel+0x65/0x68
  [    0.015919]  startup_32_smp+0x151/0x154
  [    0.016344] ---[ end Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264 ]---

At the parent change with the same command, the boot completes fine.

  [    0.000000] Linux version 6.15.0-rc1-00176-gd466304c4322 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:00:12 MST 2025
  [    0.000000] BIOS-provided physical RAM map:
  ...
  [    0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
  [    0.000000] printk: legacy bootconsole [uart8250] enabled
  [    0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
  [    0.000000] APIC: Static calls initialized
  [    0.000000] SMBIOS 2.8 present.
  [    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.000000] DMI: Memory slots populated: 1/1
  [    0.000000] Hypervisor detected: KVM
  [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
  [    0.000001] kvm-clock: using sched offset of 429786443 cycles
  [    0.000806] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
  [    0.003278] tsc: Detected 2750.000 MHz processor
  [    0.004730] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
  [    0.006220] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
  [    0.009169] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
  [    0.012840] found SMP MP-table at [mem 0x000f5480-0x000f548f]
  [    0.014310] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
  [    0.015141] ACPI: Early table checksum verification disabled
  ...
  [    0.046564] 511MB LOWMEM available.
  [    0.047421]   mapped low ram: 0 - 1ffe0000
  [    0.048431]   low ram: 0 - 1ffe0000
  [    0.049289] Zone ranges:
  [    0.049934]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  [    0.051184]   Normal   [mem 0x0000000001000000-0x000000001ffdffff]
  [    0.053087] Movable zone start for each node
  [    0.054409] Early memory node ranges
  [    0.055513]   node   0: [mem 0x0000000000001000-0x000000000009efff]
  [    0.057411]   node   0: [mem 0x0000000000100000-0x000000001ffdffff]
  [    0.059176] Initmem setup node 0 [mem 0x0000000000001000-0x000000001ffdffff]
  ...

Is this an invalid configuration or virtual setup that is being tested
here or is there something else problematic with this change?

Cheers,
Nathan
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Ingo Molnar 8 months ago
* Nathan Chancellor <nathan@kernel.org> wrote:

> Hi Mike,
> 
> On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> ...
> >  arch/x86/kernel/e820.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index 57120f0749cc..5f673bd6c7d7 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> >  		memblock_add(entry->addr, entry->size);
> >  	}
> >  
> > +	/*
> > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > +	 * to even less without it.
> > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_32))
> > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > +
> >  	/* Throw away partial pages: */
> >  	memblock_trim_memory(PAGE_SIZE);
> 
> Our CI noticed a boot failure after this change as commit 1e07b9fad022
> ("x86/e820: Discard high memory that can't be addressed by 32-bit
> systems") in -tip when booting i386_defconfig with a simple buildroot
> initrd.

I've zapped this commit from tip:x86/urgent for the time being:

  1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")

until these bugs are better understood.

Thanks,

	Ingo
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Mike Rapoport 8 months ago
On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> 
> * Nathan Chancellor <nathan@kernel.org> wrote:
> 
> > Hi Mike,
> > 
> > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > ...
> > >  arch/x86/kernel/e820.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index 57120f0749cc..5f673bd6c7d7 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > >  		memblock_add(entry->addr, entry->size);
> > >  	}
> > >  
> > > +	/*
> > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > +	 * to even less without it.
> > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > +
> > >  	/* Throw away partial pages: */
> > >  	memblock_trim_memory(PAGE_SIZE);
> > 
> > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > systems") in -tip when booting i386_defconfig with a simple buildroot
> > initrd.
> 
> I've zapped this commit from tip:x86/urgent for the time being:
> 
>   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> 
> until these bugs are better understood.

With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
overflows and we get memblock_remove(0, -1) :(

Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
under 4G and max_pfn should never overflow.

Another option is to skip e820 entries above 4G and not add them to
memblock at the first place, e.g.

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..2b617f36f11a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1297,6 +1297,17 @@ void __init e820__memblock_setup(void)
 		if (entry->type != E820_TYPE_RAM)
 			continue;
 
+#ifdef CONFIG_X86_32
+		/*
+		 * Discard memory above 4GB because 32-bit systems are limited
+		 * to 4GB of memory even with HIGHMEM.
+		 */
+		if (entry->addr > SZ_4G)
+			continue;
+		if (entry->addr + entry->size > SZ_4G)
+			entry->size = SZ_4G - entry->addr;
+#endif
+
 		memblock_add(entry->addr, entry->size);
 	}
 
 
> Thanks,
> 
> 	Ingo

-- 
Sincerely yours,
Mike.
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Ingo Molnar 8 months ago
* Mike Rapoport <rppt@kernel.org> wrote:

> On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> > 
> > * Nathan Chancellor <nathan@kernel.org> wrote:
> > 
> > > Hi Mike,
> > > 
> > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > ...
> > > >  arch/x86/kernel/e820.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > >  		memblock_add(entry->addr, entry->size);
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > +	 * to even less without it.
> > > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > +	 */
> > > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > +
> > > >  	/* Throw away partial pages: */
> > > >  	memblock_trim_memory(PAGE_SIZE);
> > > 
> > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > initrd.
> > 
> > I've zapped this commit from tip:x86/urgent for the time being:
> > 
> >   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> > 
> > until these bugs are better understood.
> 
> With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> overflows and we get memblock_remove(0, -1) :(
> 
> Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> under 4G and max_pfn should never overflow.

So why don't we use max_pfn like your -v1 fix did IIRC?

	Ingo
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Mike Rapoport 8 months ago
On Fri, Apr 18, 2025 at 02:59:05PM +0200, Ingo Molnar wrote:
> 
> * Mike Rapoport <rppt@kernel.org> wrote:
> 
> > On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> > > 
> > > * Nathan Chancellor <nathan@kernel.org> wrote:
> > > 
> > > > Hi Mike,
> > > > 
> > > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > > ...
> > > > >  arch/x86/kernel/e820.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > > --- a/arch/x86/kernel/e820.c
> > > > > +++ b/arch/x86/kernel/e820.c
> > > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > > >  		memblock_add(entry->addr, entry->size);
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > > +	 * to even less without it.
> > > > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > > +	 */
> > > > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > > +
> > > > >  	/* Throw away partial pages: */
> > > > >  	memblock_trim_memory(PAGE_SIZE);
> > > > 
> > > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > > initrd.
> > > 
> > > I've zapped this commit from tip:x86/urgent for the time being:
> > > 
> > >   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> > > 
> > > until these bugs are better understood.
> > 
> > With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> > overflows and we get memblock_remove(0, -1) :(
> > 
> > Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> > under 4G and max_pfn should never overflow.
> 
> So why don't we use max_pfn like your -v1 fix did IIRC?

Dave didn't like max_pfn. I don't feel strongly about using max_pfn or
skipping e820 ranges above 4G and not adding them to memblock.
 
> 	Ingo

-- 
Sincerely yours,
Mike.
Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
Posted by Dave Hansen 8 months ago
On 4/18/25 12:25, Mike Rapoport wrote:
>> So why don't we use max_pfn like your -v1 fix did IIRC?
> Dave didn't like max_pfn. I don't feel strongly about using max_pfn or
> skipping e820 ranges above 4G and not adding them to memblock.

I feel more strongly about fixing the bug than avoiding max_pfn. ;)

Going back to v1 is fine with me.
[tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by tip-bot2 for Mike Rapoport (Microsoft) 8 months ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     1e07b9fad022e0e02215150ca1e20912e78e8ec1
Gitweb:        https://git.kernel.org/tip/1e07b9fad022e0e02215150ca1e20912e78e8ec1
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:51:02 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

[ mingo: Fixed build failure. ]

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..2f38175 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+#ifdef CONFIG_X86_32
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+#endif
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
[tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by tip-bot2 for Mike Rapoport (Microsoft) 8 months, 1 week ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     3f0036c0b5f850d1200dbfa7365ed24197a0f157
Gitweb:        https://git.kernel.org/tip/3f0036c0b5f850d1200dbfa7365ed24197a0f157
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 13 Apr 2025 11:09:39 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..9920122 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+	 * to even less without it.
+	 * Discard memory after max_pfn - the actual limit detected at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by Dave Hansen 8 months, 1 week ago
On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> +	/*
> +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> +	 * to even less without it.
> +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_32))
> +		memblock_remove(PFN_PHYS(max_pfn), -1);

Mike, thanks for the quick fix! I did verify that this gets my silly
test VM booting again.

The patch obviously _works_. But in the case I was hitting max_pfn was
set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
far away from the related warning:

>         if (max_pfn > MAX_NONPAE_PFN) {
>                 max_pfn = MAX_NONPAE_PFN;
>                 printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
>         }

and it's logically doing the same thing: truncating memory at
MAX_NONPAE_PFN.

How about we reuse 'MAX_NONPAE_PFN' like this:

	if (IS_ENABLED(CONFIG_X86_32))
		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);

Would that make the connection more obvious?
Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by Mike Rapoport 8 months, 1 week ago
On Mon, Apr 14, 2025 at 07:19:02AM -0700, Dave Hansen wrote:
> On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> > +	/*
> > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > +	 * to even less without it.
> > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_32))
> > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> 
> Mike, thanks for the quick fix! I did verify that this gets my silly
> test VM booting again.
> 
> The patch obviously _works_. But in the case I was hitting max_pfn was
> set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
> far away from the related warning:

Yeah, my first instinct was to put memblock_remove() in the same 'if',
but there's no memblock there yet :)
 
> >         if (max_pfn > MAX_NONPAE_PFN) {
> >                 max_pfn = MAX_NONPAE_PFN;
> >                 printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
> >         }
> 
> and it's logically doing the same thing: truncating memory at
> MAX_NONPAE_PFN.
> 
> How about we reuse 'MAX_NONPAE_PFN' like this:
> 
> 	if (IS_ENABLED(CONFIG_X86_32))
> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> 
> Would that make the connection more obvious?

Yes, that's better. Here's the updated patch:

From a235764221e4a849fa274a546ff2a3d9f15da2a9 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Sun, 13 Apr 2025 10:36:17 +0300
Subject: [PATCH v2] x86/e820: discard high memory that can't be addressed by
 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b

  EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
  ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
  CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   ? set_init_arg+0x70/0x70
   ? load_ucode_bsp+0x13c/0x1a8
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154
  Modules linked in:
  CR2: 00000000f75fe000

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.

Discard the memory after MAX_NONPAE_PFN from memblock on 32-bit systems
so that core MM would be aware only of actually usable memory.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/e820.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5e6b1034e6f1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,13 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 
-- 
2.47.2

-- 
Sincerely yours,
Mike.
Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by Dave Hansen 8 months, 1 week ago
On 4/15/25 00:18, Mike Rapoport wrote:
>> How about we reuse 'MAX_NONPAE_PFN' like this:
>>
>> 	if (IS_ENABLED(CONFIG_X86_32))
>> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
>>
>> Would that make the connection more obvious?
> Yes, that's better. Here's the updated patch:

Looks, great. Thanks for the update and the quick turnaround on the
first one after the bug report!

Tested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by Ingo Molnar 8 months ago
* Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/15/25 00:18, Mike Rapoport wrote:
> >> How about we reuse 'MAX_NONPAE_PFN' like this:
> >>
> >> 	if (IS_ENABLED(CONFIG_X86_32))
> >> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> >>
> >> Would that make the connection more obvious?
> > Yes, that's better. Here's the updated patch:
> 
> Looks, great. Thanks for the update and the quick turnaround on the
> first one after the bug report!
> 
> Tested-by: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>

I've amended the fix in tip:x86/urgent accordingly and added your tags, 
thanks!

	Ingo
Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by Ingo Molnar 8 months ago
* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 4/15/25 00:18, Mike Rapoport wrote:
> > >> How about we reuse 'MAX_NONPAE_PFN' like this:
> > >>
> > >> 	if (IS_ENABLED(CONFIG_X86_32))
> > >> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> > >>
> > >> Would that make the connection more obvious?
> > > Yes, that's better. Here's the updated patch:
> > 
> > Looks, great. Thanks for the update and the quick turnaround on the
> > first one after the bug report!
> > 
> > Tested-by: Dave Hansen <dave.hansen@intel.com>
> > Acked-by: Dave Hansen <dave.hansen@intel.com>
> 
> I've amended the fix in tip:x86/urgent accordingly and added your tags, 
> thanks!

So I had to apply the fix below as well, due to this build failure on 
x86-defconfig:

  arch/x86/kernel/e820.c:1307:42: error: ‘MAX_NONPAE_PFN’ undeclared (first use in this function); did you mean ‘MAX_DMA_PFN’?

IS_ENABLED(CONFIG_X86_32) can only be used when the code is 
syntactically correct on !CONFIG_X86_32 kernels too - which it wasn't.

So I went for the straightforward #ifdef block instead.

Thanks,

	Ingo

===========>
 arch/x86/kernel/e820.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index de6238886cb2..c984be8ee060 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,13 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+#ifdef CONFIG_X86_32
 	/*
 	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
 	 * of memory even with HIGHMEM.
 	 */
-	if (IS_ENABLED(CONFIG_X86_32))
-		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+	memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+#endif
 
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
[tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
Posted by tip-bot2 for Mike Rapoport (Microsoft) 8 months ago
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     e71b6094c20f5dc9c43dc89af8a569ffa511d676
Gitweb:        https://git.kernel.org/tip/e71b6094c20f5dc9c43dc89af8a569ffa511d676
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:16:02 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..de62388 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,13 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Arnd Bergmann 8 months, 1 week ago
On Sat, Apr 12, 2025, at 12:05, Mike Rapoport wrote:
> On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
>> Has anyone run into any problems on 6.15-rc1 with this stuff?
>> 
>> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
>> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
>> memory didn't get removed although the pgdats seem correct.
>
> That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
> freeing"). 
> Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
> and after 6faea3422e3b there's no more clamping, so memblock_free_all()
> tries to free memory >4G as well.

Ah, I should have waited with my bisection, you found it first...

>> I'll dig into it some more. Just wanted to make sure there wasn't a fix
>> out there already.
>
> This should fix it.

Confirmed on 6.15-rc1.

     Arnd
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Ingo Molnar 8 months, 1 week ago
* Dave Hansen <dave.hansen@intel.com> wrote:

> Has anyone run into any problems on 6.15-rc1 with this stuff?
> 
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It 
> obviously wasn't allocated, thus the oops. Looks like the memblock 
> for the >4GB memory didn't get removed although the pgdats seem 
> correct.
> 
> I'll dig into it some more. Just wanted to make sure there wasn't a 
> fix out there already.

Not that I'm aware of.

> The way I'm triggering this is booting qemu with a 32-bit PAE kernel, 
> and "-m 4096" (or more).

That's a new regression indeed.

Thanks,

	Ingo
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Brian Gerst 1 year ago
On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> The HIGHMEM64G support was added in linux-2.3.25 to support (then)
> high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
> addressing, NUMA and PCI-X slots started appearing.
>
> I have found no evidence of this ever being used in regular dual-socket
> servers or consumer devices, all the users seem obsolete these days,
> even by i386 standards:
>
>  - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
>    removed ten years ago.
>
>  - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
>    ServerWorks ServerSet/GrandChampion could theoretically still work
>    with 8GB, but these were exceptionally rare even 20 years ago and
>    would have usually been equipped with than the maximum amount of
>    RAM.
>
>  - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
>    could still work in a Socket 775 mainboard designed for the later
>    Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
>    64-bit CPUs.
>
>  - In the early days of x86-64 hardware, there was sometimes the need
>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>    or in the syscall emulation for 32-bit userspace. This likely still
>    works but there should never be a need for this any more.
>
> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> PAE mode is still required to get access to the 'NX' bit on Atom
> 'Pentium M' and 'Core Duo' CPUs.

8GB of memory is still useful for 32-bit guest VMs.


Brian Gerst
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by H. Peter Anvin 1 year ago
On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The HIGHMEM64G support was added in linux-2.3.25 to support (then)
>> high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
>> addressing, NUMA and PCI-X slots started appearing.
>>
>> I have found no evidence of this ever being used in regular dual-socket
>> servers or consumer devices, all the users seem obsolete these days,
>> even by i386 standards:
>>
>>  - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
>>    removed ten years ago.
>>
>>  - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
>>    ServerWorks ServerSet/GrandChampion could theoretically still work
>>    with 8GB, but these were exceptionally rare even 20 years ago and
>>    would have usually been equipped with than the maximum amount of
>>    RAM.
>>
>>  - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
>>    could still work in a Socket 775 mainboard designed for the later
>>    Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
>>    64-bit CPUs.
>>
>>  - In the early days of x86-64 hardware, there was sometimes the need
>>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>>    or in the syscall emulation for 32-bit userspace. This likely still
>>    works but there should never be a need for this any more.
>>
>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>> PAE mode is still required to get access to the 'NX' bit on Atom
>> 'Pentium M' and 'Core Duo' CPUs.
>
>8GB of memory is still useful for 32-bit guest VMs.
>
>
>Brian Gerst
>

By the way, there are 64-bit machines which require swiotlb.
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Arnd Bergmann 1 year ago
On Wed, Dec 4, 2024, at 17:37, H. Peter Anvin wrote:
> On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>>>
>>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>>> PAE mode is still required to get access to the 'NX' bit on Atom
>>> 'Pentium M' and 'Core Duo' CPUs.
>
> By the way, there are 64-bit machines which require swiotlb.

What I meant to write here was that CONFIG_X86_PAE no longer
needs to select PHYS_ADDR_T_64BIT and SWIOTLB. I ended up
splitting that change out to patch 06/11 with a better explanation,
so the sentence above is just wrong now and I've removed it
in my local copy now.

Obviously 64-bit kernels still generally need swiotlb.

       Arnd
Re: [PATCH 05/11] x86: remove HIGHMEM64G support
Posted by Andy Shevchenko 1 year ago
On Wed, Dec 4, 2024 at 6:57 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Dec 4, 2024, at 17:37, H. Peter Anvin wrote:
> > On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
> >>>
> >>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> >>> PAE mode is still required to get access to the 'NX' bit on Atom
> >>> 'Pentium M' and 'Core Duo' CPUs.
> >
> > By the way, there are 64-bit machines which require swiotlb.
>
> What I meant to write here was that CONFIG_X86_PAE no longer
> needs to select PHYS_ADDR_T_64BIT and SWIOTLB. I ended up
> splitting that change out to patch 06/11 with a better explanation,
> so the sentence above is just wrong now and I've removed it
> in my local copy now.
>
> Obviously 64-bit kernels still generally need swiotlb.

Theoretically swiotlb can be useful on 32-bit machines as well for the
DMA controllers that have < 32-bit mask. Dunno if swiotlb was designed
to run on 32-bit machines at all.


-- 
With Best Regards,
Andy Shevchenko