When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3
and applying PAGE_MASK. This approach has several issues:
- __native_read_cr3() returns the raw CR3 register value, which on
x86_64 includes not just the physical address but also flags. Bits
above the physical address width of the system i.e. above
__PHYSICAL_MASK_SHIFT) are also not masked.
- The PGD entry is masked by PAGE_SIZE which doesn't take into account
the higher bits such as _PAGE_BIT_NOPTISHADOW.
Replace this with proper accessor functions:
- native_read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption
bit and extracting only the physical address portion.
- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).
Fixes: e9d0e6330eb8 ("x86/boot/compressed/64: Prepare new top-level page table for trampoline")
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
---
arch/x86/boot/compressed/pgtable_64.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff77..f812b81a538c2 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -3,6 +3,7 @@
#include <asm/bootparam.h>
#include <asm/bootparam_utils.h>
#include <asm/e820/types.h>
+#include <asm/pgtable.h>
#include <asm/processor.h>
#include "../string.h"
#include "efi.h"
@@ -168,9 +169,10 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
* For 4- to 5-level paging transition, set up current CR3 as
* the first and the only entry in a new top-level page table.
*/
- *trampoline_32bit = __native_read_cr3() | _PAGE_TABLE_NOENC;
+ *trampoline_32bit = native_read_cr3_pa() | _PAGE_TABLE_NOENC;
} else {
- unsigned long src;
+ u64 *new_cr3;
+ pgd_t *pgdp;
/*
* For 5- to 4-level paging transition, copy page table pointed
@@ -180,8 +182,9 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
* We cannot just point to the page table from trampoline as it
* may be above 4G.
*/
- src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
- memcpy(trampoline_32bit, (void *)src, PAGE_SIZE);
+ pgdp = (pgd_t *)native_read_cr3_pa();
+ new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
+ memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);
}
toggle_la57(trampoline_32bit);
--
2.47.3
On 10/28/25 03:55, Usama Arif wrote: > - native_read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption > bit and extracting only the physical address portion. I guess we can apply these as-is. They do fix a bug. But I find these descriptions a bit unsatisfying. CR3_ADDR_MASK happens to work here on 64-bit. Interestingly enough, it wouldn't have been as good of a fix on PAE paging because it ignores those upper bits instead of reserving them. But CR3_ADDR_MASK doesn't "extract... only the physical address portion". It also extracts reserved bits. It also doesn't mention the LAM bits. It's not just SME. This would be better: - native_read_cr3_pa(): Uses CR3_ADDR_MASK to additionally mask metadata out of CR3 (like SME or LAM bits). All remaining bits are real address bits or reserved and must be 0. > - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for > flags above physical address (_PAGE_BIT_NOPTISHADOW in particular). This also isn't _quite_ right. The "flags above physical" address are dynamic. They move because the max physical address (MAXPHYADDR) is enumerated and changes from CPU to CPU. It's OK in this case because moving MAXPHYADDR down just changes bits from address bits to reserved (must be 0). In a perfect world, we would construct a kexec CR3 with the dynamic MAXPHYADDR (plus masking out the lower 12 bits). That would be obviously correct for *all* 32-bit and 64-bit cases and wouldn't even rely on knowing where the boundary is between ignored and reserved. The approach in these patches is a fine improvement We don't need to be perfect. Ideally this second bullet would be: - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for flags above bit 51 (_PAGE_BIT_NOPTISHADOW in particular). Bits below 51, but above the max physical address are reserved and must be 0. But it's fine-ish as-is. Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
© 2016 - 2026 Red Hat, Inc.