[PATCH v2 1/2] x86/boot: Fix page table access in 5-level to 4-level paging transition

Usama Arif posted 2 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH v2 1/2] x86/boot: Fix page table access in 5-level to 4-level paging transition
Posted by Usama Arif 3 months, 1 week ago
When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3
and applying PAGE_MASK. This approach has several issues:

- __native_read_cr3() returns the raw CR3 register value, which on
  x86_64 includes not just the physical address but also flags. Bits
  above the physical address width of the system i.e. above
  __PHYSICAL_MASK_SHIFT) are also not masked.
- The PGD entry is masked by PAGE_SIZE which doesn't take into account
  the higher bits such as _PAGE_BIT_NOPTISHADOW.

Replace this with proper accessor functions:
- native_read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption
  bit and extracting only the physical address portion.
- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
  flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).

Fixes: e9d0e6330eb8 ("x86/boot/compressed/64: Prepare new top-level page table for trampoline")
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
---
 arch/x86/boot/compressed/pgtable_64.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff77..f812b81a538c2 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -3,6 +3,7 @@
 #include <asm/bootparam.h>
 #include <asm/bootparam_utils.h>
 #include <asm/e820/types.h>
+#include <asm/pgtable.h>
 #include <asm/processor.h>
 #include "../string.h"
 #include "efi.h"
@@ -168,9 +169,10 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 		 * For 4- to 5-level paging transition, set up current CR3 as
 		 * the first and the only entry in a new top-level page table.
 		 */
-		*trampoline_32bit = __native_read_cr3() | _PAGE_TABLE_NOENC;
+		*trampoline_32bit = native_read_cr3_pa() | _PAGE_TABLE_NOENC;
 	} else {
-		unsigned long src;
+		u64 *new_cr3;
+		pgd_t *pgdp;
 
 		/*
 		 * For 5- to 4-level paging transition, copy page table pointed
@@ -180,8 +182,9 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 		 * We cannot just point to the page table from trampoline as it
 		 * may be above 4G.
 		 */
-		src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
-		memcpy(trampoline_32bit, (void *)src, PAGE_SIZE);
+		pgdp = (pgd_t *)native_read_cr3_pa();
+		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
+		memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);
 	}
 
 	toggle_la57(trampoline_32bit);
-- 
2.47.3
Re: [PATCH v2 1/2] x86/boot: Fix page table access in 5-level to 4-level paging transition
Posted by Dave Hansen 3 months, 1 week ago
On 10/28/25 03:55, Usama Arif wrote:
> - native_read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption
>   bit and extracting only the physical address portion.

I guess we can apply these as-is. They do fix a bug.

But I find these descriptions a bit unsatisfying. CR3_ADDR_MASK happens
to work here on 64-bit. Interestingly enough, it wouldn't have been as
good of a fix on PAE paging because it ignores those upper bits instead
of reserving them.

But CR3_ADDR_MASK doesn't "extract... only the physical address
portion". It also extracts reserved bits.

It also doesn't mention the LAM bits. It's not just SME.

This would be better:

 - native_read_cr3_pa(): Uses CR3_ADDR_MASK to additionally mask
   metadata out of CR3 (like SME or LAM bits). All remaining bits are
   real address bits or reserved and must be 0.

> - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
>   flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).

This also isn't _quite_ right. The "flags above physical" address are
dynamic. They move because the max physical address (MAXPHYADDR) is
enumerated and changes from CPU to CPU.

It's OK in this case because moving MAXPHYADDR down just changes bits
from address bits to reserved (must be 0).

In a perfect world, we would construct a kexec CR3 with the dynamic
MAXPHYADDR (plus masking out the lower 12 bits). That would be obviously
correct for *all* 32-bit and 64-bit cases and wouldn't even rely on
knowing where the boundary is between ignored and reserved. The approach
in these patches is a fine improvement We don't need to be perfect.

Ideally this second bullet would be:

 - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
   flags above bit 51 (_PAGE_BIT_NOPTISHADOW in particular). Bits below
   51, but above the max physical address are reserved and must be 0.

But it's fine-ish as-is.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>