From: Ashish Kalra <ashish.kalra@amd.com>
With SNP guest kexec observe the following efi memmap corruption :
[ 0.000000] efi: EFI v2.7 by EDK II
[ 0.000000] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000 ACPI=0x7e57e000 ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018 Unaccepted=0x7c09e018
[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem03: [type=269370880|attr=0x0e42100e42180e41] range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)
[ 0.000000] efi: mem04: [type=12336|attr=0x0e410686300e4105] range=[0x100e420000000176-0x8c290f26248d200e175] (invalid)
[ 0.000000] efi: mem06: [type=1124304408|attr=0x000030b400000028] range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)
[ 0.000000] efi: mem08: [type=68|attr=0x300e540583280e41] range=[0x0000011affff3cd8-0x486200e54b38c0bcd7] (invalid)
[ 0.000000] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41] range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)
[ 0.000000] efi: mem11: [type=189335566|attr=0x048d200e42038e18] range=[0x0000318c00000048-0xe42029228ce4200047] (invalid)
[ 0.000000] efi: mem12: [type=239142534|attr=0x0000002400000b4b] range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)
[ 0.000000] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a] range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)
[ 0.000000] efi: mem15: [type=239210510|attr=0x00080e660b47080e] range=[0x0000324c0000001c-0xa78028634ce490001b] (invalid)
[ 0.000000] efi: mem16: [type=4294848528|attr=0x0000329400000014] range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)
[ 0.000000] efi: mem19: [type=2250772033|attr=0x42180e42200e4328] range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)
[ 0.000000] efi: mem20: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x00000008ffff4438-0xffff44340090333c437] (invalid)
[ 0.000000] efi: mem22: [Reserved |attr=0x000000c1ffff4420] range=[0xffff442400003398-0x1033a04240003f397] (invalid)
[ 0.000000] efi: mem23: [type=1141080856|attr=0x080e41100e43180e] range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)
[ 0.000000] efi: mem25: [Reserved |attr=0x0000000affff44a0] range=[0xffff44a400003428-0x1034304a400013427] (invalid)
[ 0.000000] efi: mem28: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4488-0xffff448400b034bc487] (invalid)
[ 0.000000] efi: mem30: [Reserved |attr=0x0000000affff4470] range=[0xffff447400003518-0x10352047400013517] (invalid)
[ 0.000000] efi: mem33: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4458-0xffff445400b035ac457] (invalid)
[ 0.000000] efi: mem35: [type=269372416|attr=0x0e42100e42180e41] range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)
[ 0.000000] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42] range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)
[ 0.000000] efi: mem38: [type=1093668417|attr=0x100e420000000270] range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)
[ 0.000000] efi: mem39: [type=76357646|attr=0x180e42200e42280e] range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)
[ 0.000000] efi: mem40: [type=940444268|attr=0x0e42200e42280e41] range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)
[ 0.000000] efi: mem41: [MMIO |attr=0x8c280e42048d200e] range=[0xffff479400003728-0x42138e0c87820292727] (invalid)
[ 0.000000] efi: mem42: [type=1191674680|attr=0x0000004c0000000b] range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)
[ 0.000000] efi: mem43: [type=2010|attr=0x0301f00e4d078338] range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)
[ 0.000000] efi: mem44: [type=1109921345|attr=0x300e44000000006c] range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)
...
This EFI memap corruption is happening with efi_arch_mem_reserve() invocation in case of kexec boot.
( efi_arch_mem_reserve() is invoked with the following call-stack: )
[ 0.310010] efi_arch_mem_reserve+0xb1/0x220
[ 0.311382] efi_mem_reserve+0x36/0x60
[ 0.311973] efi_bgrt_init+0x17d/0x1a0
[ 0.313265] acpi_parse_bgrt+0x12/0x20
[ 0.313858] acpi_table_parse+0x77/0xd0
[ 0.314463] acpi_boot_init+0x362/0x630
[ 0.315069] setup_arch+0xa88/0xf80
[ 0.315629] start_kernel+0x68/0xa90
[ 0.316194] x86_64_start_reservations+0x1c/0x30
[ 0.316921] x86_64_start_kernel+0xbf/0x110
[ 0.317582] common_startup_64+0x13e/0x141
efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
EFI memory map and due to early allocation it uses memblock allocation.
Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
in case of a kexec-ed kernel boot.
This function kexec_enter_virtual_mode() installs the new EFI memory map by
calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
Subsequently, when memblock is freed later in boot flow, this remapped
efi_memmap will have random corruption (similar to a use-after-free scenario).
The corrupted EFI memory map is then passed to the next kexec-ed kernel
which causes a panic when trying to use the corrupted EFI memory map.
Fix this EFI memory map corruption by skipping efi_arch_mem_reserve() for kexec.
Additionally, efi_mem_reserve() is used to reserve boot service memory
eg. bgrt, but it is not necessary for kexec boot, as there are no
boot services in kexec reboot at all after the first kernel ExitBootServices().
The UEFI memmap passed to kexec kernel includes not only the runtime
service memory map but also the boot service memory ranges which were
reserved by the first kernel with efi_mem_reserve, and those boot service
memory ranges have already been marked "EFI_MEMORY_RUNTIME" attribute.
This is the additional reason why efi_mem_reserve can be skipped
for kexec booting and by checking the set EFI_MEMORY_RUNTIME attribute.
Suggested-by: Dave Young <dyoung@redhat.com>
[Dave Young: checking the md attribute instead of checking the efi_setup]
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
arch/x86/platform/efi/quirks.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
- int num_entries;
+ int num_entries, ret;
void *new;
- if (efi_mem_desc_lookup(addr, &md) ||
- md.type != EFI_BOOT_SERVICES_DATA) {
+ /*
+ * efi_mem_reserve() is used to reserve boot service memory, eg. bgrt,
+ * but it is not neccasery for kexec, as there are no boot services in
+ * kexec reboot at all after the first kernel's ExitBootServices().
+ *
+ * Additionally kexec_enter_virtual_mode() during late init will remap
+ * the efi_memmap physical pages allocated here via memblock & then
+ * subsequently cause random EFI memmap corruption once memblock is freed.
+ *
+ * Therefore, skip efi_mem_reserve for kexec booting by checking the
+ * EFI_MEMORY_RUNTIME attribute which indicates boot service memory
+ * ranges reserved by the first kernel using efi_mem_reserve and marked
+ * with EFI_MEMORY_RUNTIME attribute.
+ */
+
+ ret = efi_mem_desc_lookup(addr, &md);
+ if (ret) {
pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}
+ if (md.type != EFI_BOOT_SERVICES_DATA) {
+ pr_err("Skip reserving non EFI Boot Service Data memory for %pa\n", &addr);
+ return;
+ }
+
+ /* Kexec copied the efi memmap from the first kernel, thus skip the case */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;
--
2.34.1
On Thu, May 30, 2024 at 11:36:55PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> With SNP guest kexec observe the following efi memmap corruption :
>
> [ 0.000000] efi: EFI v2.7 by EDK II
> [ 0.000000] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000 ACPI=0x7e57e000 ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018 Unaccepted=0x7c09e018
> [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
> [ 0.000000] efi: mem03: [type=269370880|attr=0x0e42100e42180e41] range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)
> [ 0.000000] efi: mem04: [type=12336|attr=0x0e410686300e4105] range=[0x100e420000000176-0x8c290f26248d200e175] (invalid)
> [ 0.000000] efi: mem06: [type=1124304408|attr=0x000030b400000028] range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)
> [ 0.000000] efi: mem08: [type=68|attr=0x300e540583280e41] range=[0x0000011affff3cd8-0x486200e54b38c0bcd7] (invalid)
> [ 0.000000] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41] range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)
> [ 0.000000] efi: mem11: [type=189335566|attr=0x048d200e42038e18] range=[0x0000318c00000048-0xe42029228ce4200047] (invalid)
> [ 0.000000] efi: mem12: [type=239142534|attr=0x0000002400000b4b] range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)
> [ 0.000000] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a] range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)
> [ 0.000000] efi: mem15: [type=239210510|attr=0x00080e660b47080e] range=[0x0000324c0000001c-0xa78028634ce490001b] (invalid)
> [ 0.000000] efi: mem16: [type=4294848528|attr=0x0000329400000014] range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)
> [ 0.000000] efi: mem19: [type=2250772033|attr=0x42180e42200e4328] range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)
> [ 0.000000] efi: mem20: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x00000008ffff4438-0xffff44340090333c437] (invalid)
> [ 0.000000] efi: mem22: [Reserved |attr=0x000000c1ffff4420] range=[0xffff442400003398-0x1033a04240003f397] (invalid)
> [ 0.000000] efi: mem23: [type=1141080856|attr=0x080e41100e43180e] range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)
> [ 0.000000] efi: mem25: [Reserved |attr=0x0000000affff44a0] range=[0xffff44a400003428-0x1034304a400013427] (invalid)
> [ 0.000000] efi: mem28: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4488-0xffff448400b034bc487] (invalid)
> [ 0.000000] efi: mem30: [Reserved |attr=0x0000000affff4470] range=[0xffff447400003518-0x10352047400013517] (invalid)
> [ 0.000000] efi: mem33: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4458-0xffff445400b035ac457] (invalid)
> [ 0.000000] efi: mem35: [type=269372416|attr=0x0e42100e42180e41] range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)
> [ 0.000000] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42] range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)
> [ 0.000000] efi: mem38: [type=1093668417|attr=0x100e420000000270] range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)
> [ 0.000000] efi: mem39: [type=76357646|attr=0x180e42200e42280e] range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)
> [ 0.000000] efi: mem40: [type=940444268|attr=0x0e42200e42280e41] range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)
> [ 0.000000] efi: mem41: [MMIO |attr=0x8c280e42048d200e] range=[0xffff479400003728-0x42138e0c87820292727] (invalid)
> [ 0.000000] efi: mem42: [type=1191674680|attr=0x0000004c0000000b] range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)
> [ 0.000000] efi: mem43: [type=2010|attr=0x0301f00e4d078338] range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)
> [ 0.000000] efi: mem44: [type=1109921345|attr=0x300e44000000006c] range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)
> ...
>
> This EFI memap corruption is happening with efi_arch_mem_reserve() invocation in case of kexec boot.
>
> ( efi_arch_mem_reserve() is invoked with the following call-stack: )
>
> [ 0.310010] efi_arch_mem_reserve+0xb1/0x220
> [ 0.311382] efi_mem_reserve+0x36/0x60
> [ 0.311973] efi_bgrt_init+0x17d/0x1a0
> [ 0.313265] acpi_parse_bgrt+0x12/0x20
> [ 0.313858] acpi_table_parse+0x77/0xd0
> [ 0.314463] acpi_boot_init+0x362/0x630
> [ 0.315069] setup_arch+0xa88/0xf80
> [ 0.315629] start_kernel+0x68/0xa90
> [ 0.316194] x86_64_start_reservations+0x1c/0x30
> [ 0.316921] x86_64_start_kernel+0xbf/0x110
> [ 0.317582] common_startup_64+0x13e/0x141
>
> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> EFI memory map and due to early allocation it uses memblock allocation.
>
> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> in case of a kexec-ed kernel boot.
>
> This function kexec_enter_virtual_mode() installs the new EFI memory map by
> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
>
> Subsequently, when memblock is freed later in boot flow, this remapped
> efi_memmap will have random corruption (similar to a use-after-free scenario).
>
> The corrupted EFI memory map is then passed to the next kexec-ed kernel
> which causes a panic when trying to use the corrupted EFI memory map.
This sounds fishy: memblock allocated memory is not freed later in the
boot - it remains reserved. Only free memory is freed from memblock to
the buddy allocator.
Or is the problem that memblock-allocated memory cannot be memremapped
because *raisins*?
Mike?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 6/3/2024 3:56 AM, Borislav Petkov wrote >> EFI memory map and due to early allocation it uses memblock allocation. >> >> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() >> in case of a kexec-ed kernel boot. >> >> This function kexec_enter_virtual_mode() installs the new EFI memory map by >> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated >> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. >> >> Subsequently, when memblock is freed later in boot flow, this remapped >> efi_memmap will have random corruption (similar to a use-after-free scenario). >> >> The corrupted EFI memory map is then passed to the next kexec-ed kernel >> which causes a panic when trying to use the corrupted EFI memory map. > This sounds fishy: memblock allocated memory is not freed later in the > boot - it remains reserved. Only free memory is freed from memblock to > the buddy allocator. > > Or is the problem that memblock-allocated memory cannot be memremapped > because *raisins*? This is what seems to be happening: efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for EFI memory map and due to early allocation it uses memblock allocation. And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() in case of a kexec-ed kernel boot. This function kexec_enter_virtual_mode() installs the new EFI memory map by calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. Thanks, Ashish > > Mike? >
On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote: > On 6/3/2024 3:56 AM, Borislav Petkov wrote > > > > EFI memory map and due to early allocation it uses memblock allocation. > > > > > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > > > in case of a kexec-ed kernel boot. > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated > > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. > > > > > > Subsequently, when memblock is freed later in boot flow, this remapped > > > efi_memmap will have random corruption (similar to a use-after-free scenario). > > > > > > The corrupted EFI memory map is then passed to the next kexec-ed kernel > > > which causes a panic when trying to use the corrupted EFI memory map. > > This sounds fishy: memblock allocated memory is not freed later in the > > boot - it remains reserved. Only free memory is freed from memblock to > > the buddy allocator. > > > > Or is the problem that memblock-allocated memory cannot be memremapped > > because *raisins*? > > This is what seems to be happening: > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for > EFI memory map and due to early allocation it uses memblock allocation. > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > in case of a kexec-ed kernel boot. > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. Does the issue happen only with SNP? I didn't really dig, but my theory would be that it has something to do with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c > Thanks, Ashish -- Sincerely yours, Mike.
On 6/3/2024 8:39 AM, Mike Rapoport wrote: > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote: >> On 6/3/2024 3:56 AM, Borislav Petkov wrote >> >>>> EFI memory map and due to early allocation it uses memblock allocation. >>>> >>>> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() >>>> in case of a kexec-ed kernel boot. >>>> >>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by >>>> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated >>>> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. >>>> >>>> Subsequently, when memblock is freed later in boot flow, this remapped >>>> efi_memmap will have random corruption (similar to a use-after-free scenario). >>>> >>>> The corrupted EFI memory map is then passed to the next kexec-ed kernel >>>> which causes a panic when trying to use the corrupted EFI memory map. >>> This sounds fishy: memblock allocated memory is not freed later in the >>> boot - it remains reserved. Only free memory is freed from memblock to >>> the buddy allocator. >>> >>> Or is the problem that memblock-allocated memory cannot be memremapped >>> because *raisins*? >> This is what seems to be happening: >> >> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for >> EFI memory map and due to early allocation it uses memblock allocation. >> >> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() >> in case of a kexec-ed kernel boot. >> >> This function kexec_enter_virtual_mode() installs the new EFI memory map by >> calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. > Does the issue happen only with SNP? This is observed under SNP as efi_arch_mem_reserve() is only being called with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map using memblock. If we skip efi_arch_mem_reserve() (which should probably be anyway skipped for kexec case), then for kexec boot, EFI memmap is memremapped in the same virtual address as the first kernel and not the allocated memblock address. Thanks, Ashish > > I didn't really dig, but my theory would be that it has something to do > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c > >> Thanks, Ashish
On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote: > On 6/3/2024 8:39 AM, Mike Rapoport wrote: > > > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote: > > > On 6/3/2024 3:56 AM, Borislav Petkov wrote > > > > > > > > EFI memory map and due to early allocation it uses memblock allocation. > > > > > > > > > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > > > > > in case of a kexec-ed kernel boot. > > > > > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > > > > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated > > > > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. > > > > > > > > > > Subsequently, when memblock is freed later in boot flow, this remapped > > > > > efi_memmap will have random corruption (similar to a use-after-free scenario). > > > > > > > > > > The corrupted EFI memory map is then passed to the next kexec-ed kernel > > > > > which causes a panic when trying to use the corrupted EFI memory map. > > > > This sounds fishy: memblock allocated memory is not freed later in the > > > > boot - it remains reserved. Only free memory is freed from memblock to > > > > the buddy allocator. > > > > > > > > Or is the problem that memblock-allocated memory cannot be memremapped > > > > because *raisins*? > > > This is what seems to be happening: > > > > > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for > > > EFI memory map and due to early allocation it uses memblock allocation. > > > > > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > > > in case of a kexec-ed kernel boot. > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > > > calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. > > Does the issue happen only with SNP? > > This is observed under SNP as efi_arch_mem_reserve() is only being called > with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map > using memblock. I don't see how efi_arch_mem_reserve() is only called with SNP. What did I miss? > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped > for kexec case), then for kexec boot, EFI memmap is memremapped in the same > virtual address as the first kernel and not the allocated memblock address. Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we still need to understand what's causing memory corruption. > Thanks, Ashish > > > > > I didn't really dig, but my theory would be that it has something to do > > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c > > > Thanks, Ashish -- Sincerely yours, Mike.
On 6/3/2024 10:29 AM, Mike Rapoport wrote: > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote: >> On 6/3/2024 8:39 AM, Mike Rapoport wrote: >> >>> On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote: >>>> On 6/3/2024 3:56 AM, Borislav Petkov wrote >>>> >>>>>> EFI memory map and due to early allocation it uses memblock allocation. >>>>>> >>>>>> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() >>>>>> in case of a kexec-ed kernel boot. >>>>>> >>>>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by >>>>>> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated >>>>>> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. >>>>>> >>>>>> Subsequently, when memblock is freed later in boot flow, this remapped >>>>>> efi_memmap will have random corruption (similar to a use-after-free scenario). >>>>>> >>>>>> The corrupted EFI memory map is then passed to the next kexec-ed kernel >>>>>> which causes a panic when trying to use the corrupted EFI memory map. >>>>> This sounds fishy: memblock allocated memory is not freed later in the >>>>> boot - it remains reserved. Only free memory is freed from memblock to >>>>> the buddy allocator. >>>>> >>>>> Or is the problem that memblock-allocated memory cannot be memremapped >>>>> because *raisins*? >>>> This is what seems to be happening: >>>> >>>> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for >>>> EFI memory map and due to early allocation it uses memblock allocation. >>>> >>>> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() >>>> in case of a kexec-ed kernel boot. >>>> >>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by >>>> calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. >>> Does the issue happen only with SNP? >> This is observed under SNP as efi_arch_mem_reserve() is only being called >> with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map >> using memblock. > I don't see how efi_arch_mem_reserve() is only called with SNP. What did I > miss? > This is the call stack for efi_arch_mem_reserve(): [ 0.310010] efi_arch_mem_reserve+0xb1/0x220 [ 0.311382] efi_mem_reserve+0x36/0x60 [ 0.311973] efi_bgrt_init+0x17d/0x1a0 [ 0.313265] acpi_parse_bgrt+0x12/0x20 [ 0.313858] acpi_table_parse+0x77/0xd0 [ 0.314463] acpi_boot_init+0x362/0x630 [ 0.315069] setup_arch+0xa88/0xf80 [ 0.315629] start_kernel+0x68/0xa90 [ 0.316194] x86_64_start_reservations+0x1c/0x30 [ 0.316921] x86_64_start_kernel+0xbf/0x110 [ 0.317582] common_startup_64+0x13e/0x141 So, probably it is being invoked specifically for AMD platform ? >> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped >> for kexec case), then for kexec boot, EFI memmap is memremapped in the same >> virtual address as the first kernel and not the allocated memblock address. > Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we > still need to understand what's causing memory corruption. When, efi_arch_mem_reserve() allocates memory for EFI memory map using memblock and then later in boot, kexec_enter_virtual_mode() does memremap on this memblock allocated memory, subsequently after this i see EFI memory map corruption, so are there are any issues doing memremap on memblock-allocated memory ? Thanks, Ashish >>> I didn't really dig, but my theory would be that it has something to do >>> with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c >>>> Thanks, Ashish
On Mon, Jun 03, 2024 at 11:56:01AM -0500, Kalra, Ashish wrote: > On 6/3/2024 10:29 AM, Mike Rapoport wrote: > > > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote: > > > On 6/3/2024 8:39 AM, Mike Rapoport wrote: > > > > > > > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote: > > > > > On 6/3/2024 3:56 AM, Borislav Petkov wrote > > > > > > > > > > > > EFI memory map and due to early allocation it uses memblock allocation. > > > > > > > > > > > > > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > > > > > > > in case of a kexec-ed kernel boot. > > > > > > > > > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > > > > > > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated > > > > > > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation. > > > > > > > > > > > > > > Subsequently, when memblock is freed later in boot flow, this remapped > > > > > > > efi_memmap will have random corruption (similar to a use-after-free scenario). > > > > > > > > > > > > > > The corrupted EFI memory map is then passed to the next kexec-ed kernel > > > > > > > which causes a panic when trying to use the corrupted EFI memory map. > > > > > > This sounds fishy: memblock allocated memory is not freed later in the > > > > > > boot - it remains reserved. Only free memory is freed from memblock to > > > > > > the buddy allocator. > > > > > > > > > > > > Or is the problem that memblock-allocated memory cannot be memremapped > > > > > > because *raisins*? > > > > > This is what seems to be happening: > > > > > > > > > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for > > > > > EFI memory map and due to early allocation it uses memblock allocation. > > > > > > > > > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode() > > > > > in case of a kexec-ed kernel boot. > > > > > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by > > > > > calling efi_memmap_init_late() which does memremap() on memblock-allocated memory. > > > > Does the issue happen only with SNP? > > > This is observed under SNP as efi_arch_mem_reserve() is only being called > > > with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map > > > using memblock. > > I don't see how efi_arch_mem_reserve() is only called with SNP. What did I > > miss? > > This is the call stack for efi_arch_mem_reserve(): > > [ 0.310010] efi_arch_mem_reserve+0xb1/0x220 > [ 0.311382] efi_mem_reserve+0x36/0x60 > [ 0.311973] efi_bgrt_init+0x17d/0x1a0 > [ 0.313265] acpi_parse_bgrt+0x12/0x20 > [ 0.313858] acpi_table_parse+0x77/0xd0 > [ 0.314463] acpi_boot_init+0x362/0x630 > [ 0.315069] setup_arch+0xa88/0xf80 > [ 0.315629] start_kernel+0x68/0xa90 > [ 0.316194] x86_64_start_reservations+0x1c/0x30 > [ 0.316921] x86_64_start_kernel+0xbf/0x110 > [ 0.317582] common_startup_64+0x13e/0x141 > > So, probably it is being invoked specifically for AMD platform ? AFAIU, efi_bgrt_init() can be called for any x86 platform, with or without encryption. So if my understating is correct, efi_arch_mem_reserve() will be called with SNP disabled as well. And if kexec works ok without SNP but fails with SNP this may give as a clue to the root cause of the failure. > > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped > > > for kexec case), then for kexec boot, EFI memmap is memremapped in the same > > > virtual address as the first kernel and not the allocated memblock address. > > Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we > > still need to understand what's causing memory corruption. > > When, efi_arch_mem_reserve() allocates memory for EFI memory map using > memblock and then later in boot, kexec_enter_virtual_mode() does memremap on > this memblock allocated memory, subsequently after this i see EFI memory map > corruption, so are there are any issues doing memremap on memblock-allocated > memory ? memblock-allocated memory is just RAM, so my take is that memremap() cannot figure out the encryption bits properly. You can check if there are issues with memrmapp()ing memblock-allocated memory by sticking memblock_phys_alloc() somewhere, filling that memory with a pattern and then calling memremap(addr, size, MEMREMAP_WB) and checking if the pattern is still there. > Thanks, Ashish > > > > > I didn't really dig, but my theory would be that it has something to do > > > > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c > > > > > Thanks, Ashish -- Sincerely yours, Mike.
On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> virtual address as the first kernel and not the allocated memblock address.
Are you saying that we should simply do
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index fdf07dd6f459..410cb0743289 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
return;
+ if (kexec_in_progress)
+ return;
+
if (!memblock_is_region_reserved(addr, size))
memblock_reserve(addr, size);
and skip that whole call?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote: > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote: > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped > > for kexec case), then for kexec boot, EFI memmap is memremapped in the same > > virtual address as the first kernel and not the allocated memblock address. > > Are you saying that we should simply do > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > index fdf07dd6f459..410cb0743289 100644 > --- a/drivers/firmware/efi/efi.c > +++ b/drivers/firmware/efi/efi.c > @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size) > if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT))) > return; > > + if (kexec_in_progress) > + return; > + > if (!memblock_is_region_reserved(addr, size)) > memblock_reserve(addr, size); > > and skip that whole call? I think Ashish suggested rather diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index fdf07dd6f459..eccc10ab15a4 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size) if (!memblock_is_region_reserved(addr, size)) memblock_reserve(addr, size); + if (kexec_in_progress) + return; + /* * Some architectures (x86) reserve all boot services ranges * until efi_free_boot_services() because of buggy firmware > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette -- Sincerely yours, Mike.
On Mon, 3 Jun 2024 at 23:33, Mike Rapoport <rppt@kernel.org> wrote: > > On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote: > > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote: > > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped > > > for kexec case), then for kexec boot, EFI memmap is memremapped in the same > > > virtual address as the first kernel and not the allocated memblock address. > > > > Are you saying that we should simply do > > > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > > index fdf07dd6f459..410cb0743289 100644 > > --- a/drivers/firmware/efi/efi.c > > +++ b/drivers/firmware/efi/efi.c > > @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size) > > if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT))) > > return; > > > > + if (kexec_in_progress) > > + return; > > + kexec_in_progress is only for checking if this is in a reboot (kexec) code path. But eif_mem_reserve is only called during the boot time so checking kexec_in_progress is meaningless here. current_kernel_is_booted_via_kexec != is_rebooting_with_kexec The code change below in the patch looks good to me, but I'm not sure what caused the memory corruption, it indeed worth some more digging, maybe SEV/SNP related. + if (md.attribute & EFI_MEMORY_RUNTIME) + return; Thanks Dave
On Tue, Jun 04, 2024 at 09:23:58AM +0800, Dave Young wrote:
> kexec_in_progress is only for checking if this is in a reboot (kexec) code path.
> But eif_mem_reserve is only called during the boot time so checking
> kexec_in_progress is meaningless here.
> current_kernel_is_booted_via_kexec != is_rebooting_with_kexec
That's exactly what I wanna check: whether this is a kexec-ed kernel. Or
is there a better helper for that?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, 4 Jun 2024 at 17:44, Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Jun 04, 2024 at 09:23:58AM +0800, Dave Young wrote: > > kexec_in_progress is only for checking if this is in a reboot (kexec) code path. > > But eif_mem_reserve is only called during the boot time so checking > > kexec_in_progress is meaningless here. > > current_kernel_is_booted_via_kexec != is_rebooting_with_kexec > > That's exactly what I wanna check: whether this is a kexec-ed kernel. Or > is there a better helper for that? No general way to check if it is a kexec-ed kernel or not, for x86 one can check the efi_setup as Ashish's original patch did, as the kexec booted kernel (efi boot) will have efi setup_data passed in. Otherwise there is a type_of_loader field for x86 boot protocol, kexec-tools is 0x0D, the kexec_file_load also uses this. But adding the type_of_loader was only added in kexec-tools code when Yinghai worked on the kexec-tools bzImage64 load, so older kexec-tools will not set this field. Anyway the in-kernel kexec_file_load code for x86 added 0x0D as loader type from the beginning. Anyway there is not such a helper for all cases. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On Tue, Jun 04, 2024 at 07:09:56PM +0800, Dave Young wrote:
> Anyway there is not such a helper for all cases.
But maybe there should be...
This is not the first case where the need arises to be able to say:
if (am I a kexeced kernel)
in code.
Perhaps we should have a global var kexeced or so which gets incremented
on each kexec-ed kernel, somewhere in very early boot of the kexec-ed
kernel we do
kexeced++;
and then other code can query it and know whether this is a kexec-ed
kernel and how many times it got kexec-ed...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, 5 Jun 2024 at 02:03, Borislav Petkov <bp@alien8.de> wrote: > > On Tue, Jun 04, 2024 at 07:09:56PM +0800, Dave Young wrote: > > Anyway there is not such a helper for all cases. > > But maybe there should be... > > This is not the first case where the need arises to be able to say: > > if (am I a kexeced kernel) > > in code. > > Perhaps we should have a global var kexeced or so which gets incremented > on each kexec-ed kernel, somewhere in very early boot of the kexec-ed > kernel we do > > kexeced++; > > and then other code can query it and know whether this is a kexec-ed > kernel and how many times it got kexec-ed... It's something good to have but not must for the time being, also no idea how to save the status across boot, for EFI boot case probably a EFI var can be used, but how can it be cleared in case of physical boot. Otherwise probably injecting some kernel parameters, anyway this needs more thinking. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette >
On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote:
> It's something good to have but not must for the time being, also no
> idea how to save the status across boot, for EFI boot case probably a
> EFI var can be used;
Yes.
> but how can it be cleared in case of physical boot. Otherwise
> probably injecting some kernel parameters, anyway this needs more
> thinking.
Yeah, this'll need proper analysis whether we can even do that reliably.
We need to increment it only on the kexec reboot paths and clear it on
the normal reboot paths.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, 5 Jun 2024 at 09:43, Borislav Petkov <bp@alien8.de> wrote: > > On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote: > > It's something good to have but not must for the time being, also no > > idea how to save the status across boot, for EFI boot case probably a > > EFI var can be used; > > Yes. > > > but how can it be cleared in case of physical boot. Otherwise > > probably injecting some kernel parameters, anyway this needs more > > thinking. > > Yeah, this'll need proper analysis whether we can even do that reliably. > > We need to increment it only on the kexec reboot paths and clear it on > the normal reboot paths. > I'd argue for the opposite: ideally, the difference between the first boot and not-the-first-boot should be abstracted away by the 'bootloader' side of kexec as much as possible, so that the tricky early startup code doesn't have to be riddled with different code paths depending on !kexec vs kexec. TDX is a good case in point here: rather than add more conditionals, I'd urge to remove them so the TDX startup code doesn't have to care about the difference at all. If there is anything special that needs to be done, it belongs in the kexec implementation of the previous kernel.
On Wed, Jun 05, 2024 at 10:17:22AM +0200, Ard Biesheuvel wrote:
> I'd argue for the opposite: ideally, the difference between the first
> boot and not-the-first-boot should be abstracted away by the
> 'bootloader' side of kexec as much as possible, so that the tricky
> early startup code doesn't have to be riddled with different code
> paths depending on !kexec vs kexec.
Well, off and on we end up needing to be able to ask whether the current
kernel is kexec-ed. So you need to be able to access that aspect in
kernel code - not in the bootloader. Perhaps read it from the
bootloader, sure.
But see my other mail from just now - it might end up not needing it
after all and I'd prefer if we never ever have to ask that question but
just from staring at EFI code it reminded me that we do need to ask that
question already:
if (efi_setup)
kexec_enter_virtual_mode();
else
__efi_enter_virtual_mode();
*exactly* because of EFI and that virtual_map call nonsense of allowing
it only once.
And we check efi_setup here because that works. But you can't use that
globally. And so on...
> TDX is a good case in point here: rather than add more conditionals,
> I'd urge to remove them so the TDX startup code doesn't have to care
> about the difference at all. If there is anything special that needs
> to be done, it belongs in the kexec implementation of the previous
> kernel.
Sure, but reality is not as easy sometimes.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 6/3/2024 10:31 AM, Mike Rapoport wrote:
> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
>> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
>>> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
>>> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
>>> virtual address as the first kernel and not the allocated memblock address.
>> Are you saying that we should simply do
>>
>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>> index fdf07dd6f459..410cb0743289 100644
>> --- a/drivers/firmware/efi/efi.c
>> +++ b/drivers/firmware/efi/efi.c
>> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>> if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
>> return;
>>
>> + if (kexec_in_progress)
>> + return;
>> +
>> if (!memblock_is_region_reserved(addr, size))
>> memblock_reserve(addr, size);
>>
>> and skip that whole call?
> I think Ashish suggested rather
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index fdf07dd6f459..eccc10ab15a4 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
> if (!memblock_is_region_reserved(addr, size))
> memblock_reserve(addr, size);
>
> + if (kexec_in_progress)
> + return;
> +
> /*
> * Some architectures (x86) reserve all boot services ranges
> * until efi_free_boot_services() because of buggy firmware
>
Yes, something similar as above, as efi_mem_reserve() is used to reserve
boot service memory and is not necessary for kexec boot.
So, Dave Young (dyoung@redhat.com) had suggested that we skip
efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME
attribute as below:
diff
<https://lore.kernel.org/lkml/Zl3HfiQ6oHdTdOdA@kernel.org/T/#iZ2e.:..:f4be03b8488665f56a1e5c6e6459f447352dfcf5.1717111180.git.ashish.kalra::40amd.com:1arch:x86:platform:efi:quirks.c>
--git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644 ---
a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@
-255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr,
u64 size) struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
- int num_entries; + int num_entries, ret; void *new;
- if (efi_mem_desc_lookup(addr, &md) || - md.type !=
EFI_BOOT_SERVICES_DATA) { + /* + * efi_mem_reserve() is used to reserve
boot service memory, eg. bgrt, + * but it is not neccasery for kexec, as
there are no boot services in + * kexec reboot at all after the first
kernel's ExitBootServices(). + * + * Therefore, skip efi_mem_reserve for
kexec booting by checking the + * EFI_MEMORY_RUNTIME attribute which
indicates boot service memory + * ranges reserved by the first kernel
using efi_mem_reserve and marked + * with EFI_MEMORY_RUNTIME attribute.
+ */ + + ret = efi_mem_desc_lookup(addr, &md); + if (ret) { pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}
+ if (md.type != EFI_BOOT_SERVICES_DATA) { + pr_err("Skip reserving non
EFI Boot Service Data memory for %pa\n", &addr); + return; + } + + /*
Kexec copied the efi memmap from the first kernel, thus skip the case */
+ if (md.attribute & EFI_MEMORY_RUNTIME) + return; + if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;
--
2.34.1
>> --
>> Regards/Gruss,
>> Boris.
>>
>> https://people.kernel.org/tglx/notes-about-netiquette
Re-sending this, last response got garbled.
On 6/3/2024 11:48 AM, Kalra, Ashish wrote:
> On 6/3/2024 10:31 AM, Mike Rapoport wrote:
>
>> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
>>> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
>>>> If we skip efi_arch_mem_reserve() (which should probably be anyway
>>>> skipped
>>>> for kexec case), then for kexec boot, EFI memmap is memremapped in
>>>> the same
>>>> virtual address as the first kernel and not the allocated memblock
>>>> address.
>>> Are you saying that we should simply do
>>>
>>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>>> index fdf07dd6f459..410cb0743289 100644
>>> --- a/drivers/firmware/efi/efi.c
>>> +++ b/drivers/firmware/efi/efi.c
>>> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr,
>>> u64 size)
>>> if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
>>> return;
>>> + if (kexec_in_progress)
>>> + return;
>>> +
>>> if (!memblock_is_region_reserved(addr, size))
>>> memblock_reserve(addr, size);
>>> and skip that whole call?
>> I think Ashish suggested rather
>>
>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>> index fdf07dd6f459..eccc10ab15a4 100644
>> --- a/drivers/firmware/efi/efi.c
>> +++ b/drivers/firmware/efi/efi.c
>> @@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64
>> size)
>> if (!memblock_is_region_reserved(addr, size))
>> memblock_reserve(addr, size);
>> + if (kexec_in_progress)
>> + return;
>> +
>> /*
>> * Some architectures (x86) reserve all boot services ranges
>> * until efi_free_boot_services() because of buggy firmware
> Yes, something similar as above, as efi_mem_reserve() is used to
> reserve boot service memory and is not necessary for kexec boot.
>
> So, Dave Young (dyoung@redhat.com) had suggested that we skip
> efi_arch_mem_reserve() for kexec by checking the set
> EFI_MEMORY_RUNTIME attribute as below:
>
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr,
u64 size)
struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
- int num_entries;
+ int num_entries, ret;
void *new;
- if (efi_mem_desc_lookup(addr, &md) ||
- md.type != EFI_BOOT_SERVICES_DATA) {
+ /*
+ * efi_mem_reserve() is used to reserve boot service memory, eg.
bgrt,
+ * but it is not neccasery for kexec, as there are no boot
services in
+ * kexec reboot at all after the first kernel's ExitBootServices().
+ *
+ * Therefore, skip efi_mem_reserve for kexec booting by checking the
+ * EFI_MEMORY_RUNTIME attribute which indicates boot service memory
+ * ranges reserved by the first kernel using efi_mem_reserve and
marked
+ * with EFI_MEMORY_RUNTIME attribute.
+ */
+
+ ret = efi_mem_desc_lookup(addr, &md);
+ if (ret) {
pr_err("Failed to lookup EFI memory descriptor for
%pa\n", &addr);
return;
}
+ if (md.type != EFI_BOOT_SERVICES_DATA) {
+ pr_err("Skip reserving non EFI Boot Service Data memory
for %pa\n", &addr);
+ return;
+ }
+
+ /* Kexec copied the efi memmap from the first kernel, thus skip
the case */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages <<
EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n",
&addr);
return;
Thanks, Ashish
On Mon, Jun 03, 2024 at 12:05:45PM -0500, Kalra, Ashish wrote:
> Re-sending this, last response got garbled.
And this got linewrapped.
Thunderbird section in Documentation/process/email-clients.rst.
> index f0cc00032751..6f398c59278a 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64
> size)
^^^
> struct efi_memory_map_data data = { 0 };
> struct efi_mem_range mr;
> efi_memory_desc_t md;
> - int num_entries;
> + int num_entries, ret;
> void *new;
>
> - if (efi_mem_desc_lookup(addr, &md) ||
> - md.type != EFI_BOOT_SERVICES_DATA) {
> + /*
> + * efi_mem_reserve() is used to reserve boot service memory, eg.
> bgrt,
^^^
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jun 03, 2024 at 11:48:03AM -0500, Kalra, Ashish wrote:
> Yes, something similar as above, as efi_mem_reserve() is used to reserve
> boot service memory and is not necessary for kexec boot.
>
> So, Dave Young (dyoung@redhat.com) had suggested that we skip
> efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME
> attribute as below:a
efi_arch_mem_reserve() or efi_mem_reserve() altogether?
Btw, that below got really gibberished by your mail client. Snipped.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 6/3/2024 11:57 AM, Borislav Petkov wrote: > On Mon, Jun 03, 2024 at 11:48:03AM -0500, Kalra, Ashish wrote: >> Yes, something similar as above, as efi_mem_reserve() is used to reserve >> boot service memory and is not necessary for kexec boot. >> >> So, Dave Young (dyoung@redhat.com) had suggested that we skip >> efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME >> attribute as below:a > efi_arch_mem_reserve() or efi_mem_reserve() altogether? efi_arch_mem_reserve(). Thanks, Ashish > > Btw, that below got really gibberished by your mail client. Snipped. >
On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
> efi_arch_mem_reserve().
Now it only remains for you to explain why...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Re-sending as the earlier response got line-wrapped.
On 6/3/2024 12:12 PM, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>> efi_arch_mem_reserve().
> Now it only remains for you to explain why...
Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
...
[ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
[ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
...
Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
...
[ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
[ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
...
Finally, at the end of boot, kexec_enter_virtual_mode() is called.
It does mapping of efi regions which were passed via setup_data.
So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
[ 4.042160] efi: efi memmap phys map 0x27ffcaf80
So, kexec_enter_virtual_mode() does the following :
if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
efi.memmap.desc_size * efi.memmap.nr_map)) { ...
This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
drivers/firmware/efi/memmap.c:
int __init __efi_memmap_init(struct efi_memory_map_data *data)
{
..
phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
if (data->flags & EFI_MEMMAP_LATE)
map.map = memremap(phys_map, data->size, MEMREMAP_WB);
...
...
if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
__efi_memmap_free(efi.memmap.phys_map,
efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
}
...
map.phys_map = data->phys_map;
...
efi.memmap = map;
...
This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
This is confirmed by memblock debugging:
[ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
This corruption is seen just before kexec-ing into the new kernel:
...
[ 11.045522] PEFILE: Unsigned PE binary^M
[ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
...
[ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
[ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
[ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
[ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
[ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
[ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
[ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
[ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
[ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
[ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M
...
This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
Thanks, Ashish
On Wed, 5 Jun 2024 at 06:36, Kalra, Ashish <ashish.kalra@amd.com> wrote:
>
> Re-sending as the earlier response got line-wrapped.
>
> On 6/3/2024 12:12 PM, Borislav Petkov wrote:
> > On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
> >> efi_arch_mem_reserve().
> > Now it only remains for you to explain why...
>
> Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
>
> Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
>
> ...
>
> [ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
> [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
>
> ...
>
> Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
>
> ...
>
> [ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
> [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
>
> ...
>
> Finally, at the end of boot, kexec_enter_virtual_mode() is called.
>
> It does mapping of efi regions which were passed via setup_data.
>
> So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
>
> ( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
>
> [ 4.042160] efi: efi memmap phys map 0x27ffcaf80
>
> So, kexec_enter_virtual_mode() does the following :
>
> if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
> efi.memmap.desc_size * efi.memmap.nr_map)) { ...
>
> This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
>
> drivers/firmware/efi/memmap.c:
>
> int __init __efi_memmap_init(struct efi_memory_map_data *data)
> {
>
> ..
>
> phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>
> if (data->flags & EFI_MEMMAP_LATE)
> map.map = memremap(phys_map, data->size, MEMREMAP_WB);
> ...
> ...
> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> __efi_memmap_free(efi.memmap.phys_map,
> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> }
From your debugging the memmap should not be freed. This piece of
code was added in below commit, added Dan Williams in cc list:
commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
Author: Dan Williams <dan.j.williams@intel.com>
Date: Mon Jan 13 18:22:44 2020 +0100
efi: Fix efi_memmap_alloc() leaks
With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
updated and replaced multiple times. When that happens a previous
dynamically allocated efi memory map can be garbage collected. Use the
new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
allocated memory map is being replaced.
>
> ...
> map.phys_map = data->phys_map;
>
> ...
>
> efi.memmap = map;
>
> ...
>
> This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
>
> This is confirmed by memblock debugging:
>
> [ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
>
> So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
>
> This corruption is seen just before kexec-ing into the new kernel:
>
> ...
> [ 11.045522] PEFILE: Unsigned PE binary^M
> [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
> ...
> [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
> [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
> [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
> [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
> [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
> [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
> [ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
> [ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
> [ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
> [ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M
>
>
> ...
>
> This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
>
> Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
>
> Thanks, Ashish
>
On 6/4/2024 8:48 PM, Dave Young wrote:
> On Wed, 5 Jun 2024 at 06:36, Kalra, Ashish <ashish.kalra@amd.com> wrote:
>> Re-sending as the earlier response got line-wrapped.
>>
>> On 6/3/2024 12:12 PM, Borislav Petkov wrote:
>>> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>>>> efi_arch_mem_reserve().
>>> Now it only remains for you to explain why...
>> Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
>>
>> Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
>>
>> ...
>>
>> [ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
>> [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
>>
>> ...
>>
>> Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
>>
>> ...
>>
>> [ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
>> [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
>>
>> ...
>>
>> Finally, at the end of boot, kexec_enter_virtual_mode() is called.
>>
>> It does mapping of efi regions which were passed via setup_data.
>>
>> So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
>>
>> ( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
>>
>> [ 4.042160] efi: efi memmap phys map 0x27ffcaf80
>>
>> So, kexec_enter_virtual_mode() does the following :
>>
>> if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>> efi.memmap.desc_size * efi.memmap.nr_map)) { ...
>>
>> This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
>>
>> drivers/firmware/efi/memmap.c:
>>
>> int __init __efi_memmap_init(struct efi_memory_map_data *data)
>> {
>>
>> ..
>>
>> phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>>
>> if (data->flags & EFI_MEMMAP_LATE)
>> map.map = memremap(phys_map, data->size, MEMREMAP_WB);
>> ...
>> ...
>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
>> __efi_memmap_free(efi.memmap.phys_map,
>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
>> }
> From your debugging the memmap should not be freed.
Yes, it looks like that it should not be freed, as the new and previous efi memory map can be same.
Thanks, Ashish
> This piece of
> code was added in below commit, added Dan Williams in cc list:
> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> Author: Dan Williams <dan.j.williams@intel.com>
> Date: Mon Jan 13 18:22:44 2020 +0100
>
> efi: Fix efi_memmap_alloc() leaks
>
> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> updated and replaced multiple times. When that happens a previous
> dynamically allocated efi memory map can be garbage collected. Use the
> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> allocated memory map is being replaced.
>
>
>> ...
>> map.phys_map = data->phys_map;
>>
>> ...
>>
>> efi.memmap = map;
>>
>> ...
>>
>> This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
>>
>> This is confirmed by memblock debugging:
>>
>> [ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
>>
>> So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
>>
>> This corruption is seen just before kexec-ing into the new kernel:
>>
>> ...
>> [ 11.045522] PEFILE: Unsigned PE binary^M
>> [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
>> ...
>> [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
>> [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
>> [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
>> [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
>> [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
>> [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
>> [ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
>> [ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
>> [ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
>> [ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M
>>
>>
>> ...
>>
>> This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
>>
>> Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
>>
>> Thanks, Ashish
>>
> > ...
> > if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> > __efi_memmap_free(efi.memmap.phys_map,
> > efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> > }
>
> From your debugging the memmap should not be freed. This piece of
> code was added in below commit, added Dan Williams in cc list:
> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> Author: Dan Williams <dan.j.williams@intel.com>
> Date: Mon Jan 13 18:22:44 2020 +0100
>
> efi: Fix efi_memmap_alloc() leaks
>
> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> updated and replaced multiple times. When that happens a previous
> dynamically allocated efi memory map can be garbage collected. Use the
> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> allocated memory map is being replaced.
>
Dan, probably those regions should be freed only for "fake" memmap?
On Wed, 5 Jun 2024 at 09:52, Dave Young <dyoung@redhat.com> wrote:
>
> > > ...
> > > if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> > > __efi_memmap_free(efi.memmap.phys_map,
> > > efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> > > }
> >
> > From your debugging the memmap should not be freed. This piece of
> > code was added in below commit, added Dan Williams in cc list:
> > commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> > Author: Dan Williams <dan.j.williams@intel.com>
> > Date: Mon Jan 13 18:22:44 2020 +0100
> >
> > efi: Fix efi_memmap_alloc() leaks
> >
> > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > updated and replaced multiple times. When that happens a previous
> > dynamically allocated efi memory map can be garbage collected. Use the
> > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > allocated memory map is being replaced.
> >
>
> Dan, probably those regions should be freed only for "fake" memmap?
Ashish, can you comment out the __efi_memmap_free see if it works for
you just confirm about the behavior.
Hello Dave,
On 6/4/2024 8:58 PM, Dave Young wrote:
> On Wed, 5 Jun 2024 at 09:52, Dave Young <dyoung@redhat.com> wrote:
>>>> ...
>>>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
>>>> __efi_memmap_free(efi.memmap.phys_map,
>>>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
>>>> }
>>> From your debugging the memmap should not be freed. This piece of
>>> code was added in below commit, added Dan Williams in cc list:
>>> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
>>> Author: Dan Williams <dan.j.williams@intel.com>
>>> Date: Mon Jan 13 18:22:44 2020 +0100
>>>
>>> efi: Fix efi_memmap_alloc() leaks
>>>
>>> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
>>> updated and replaced multiple times. When that happens a previous
>>> dynamically allocated efi memory map can be garbage collected. Use the
>>> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
>>> allocated memory map is being replaced.
>>>
>> Dan, probably those regions should be freed only for "fake" memmap?
> Ashish, can you comment out the __efi_memmap_free see if it works for
> you just confirm about the behavior.
Yes, i have already tried and tested that, if i avoid __efi_memmap_free(), then i don't see this memory map corruption.
Thanks, Ashish
On Wed, 5 Jun 2024 at 10:09, Kalra, Ashish <ashish.kalra@amd.com> wrote:
>
> Hello Dave,
>
> On 6/4/2024 8:58 PM, Dave Young wrote:
> > On Wed, 5 Jun 2024 at 09:52, Dave Young <dyoung@redhat.com> wrote:
> >>>> ...
> >>>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> >>>> __efi_memmap_free(efi.memmap.phys_map,
> >>>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> >>>> }
> >>> From your debugging the memmap should not be freed. This piece of
> >>> code was added in below commit, added Dan Williams in cc list:
> >>> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> >>> Author: Dan Williams <dan.j.williams@intel.com>
> >>> Date: Mon Jan 13 18:22:44 2020 +0100
> >>>
> >>> efi: Fix efi_memmap_alloc() leaks
> >>>
> >>> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> >>> updated and replaced multiple times. When that happens a previous
> >>> dynamically allocated efi memory map can be garbage collected. Use the
> >>> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> >>> allocated memory map is being replaced.
> >>>
> >> Dan, probably those regions should be freed only for "fake" memmap?
> > Ashish, can you comment out the __efi_memmap_free see if it works for
> > you just confirm about the behavior.
>
> Yes, i have already tried and tested that, if i avoid __efi_memmap_free(), then i don't see this memory map corruption.
Ok, thanks! I think the right way is creating two patches, one to
remove the __efi_memmap_free, another is skip efi_arch_mem_reserve
when the EFI_MEMORY_RUNTIME bit was set already. But the first one
should be the fix for the root cause.
efi fake mem is only for debugging purposes, the "memleak" mentioned
in commit 0f96a99dab36 should be solved in another way if needed (are
they really leaked? or just not useful anymore)
Anyway this is my opinion, please wait for x86 and efi reviewer's inputs.
>
> Thanks, Ashish
>
Moving Ard and Dan to To:
On Wed, Jun 05, 2024 at 10:28:18AM +0800, Dave Young wrote:
> Ok, thanks! I think the right way is creating two patches, one to
> remove the __efi_memmap_free,
Yap, that
f0ef6523475f ("efi: Fix efi_memmap_alloc() leaks")
needs revisiting.
So AFAIU, the flow is this:
In a kexec-ed kernel:
1. efi_arch_mem_reserve() gets called by bgrt, erst, mokvar... whatever
to hold on to boot services regions for longer otherwise EFI
"implementations" explode.
2. On same kexec-ed kernel, we call into kexec_enter_virtual_mode()
because it needs to get the runtime services regions from the first
kernel
3. As part of that call, it'll do
efi_memmap_init_late->__efi_memmap_init():
if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB))
__efi_memmap_free(efi.memmap.phys_map,
and the memory which got allocated in step 1 is gone, thus reverting
what efi_arch_mem_reserve() is trying to fix.
IOW, we need a
EFI_MEMMAP_DO_NOT_TOUCH_MY_MEMORY
flag which'll stop this from happening. But I'd prefer it if Ard decides
what the right thing to do here is.
> another is skip efi_arch_mem_reserve when the EFI_MEMORY_RUNTIME bit
> was set already.
Can that even happen?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, 5 Jun 2024 at 19:09, Borislav Petkov <bp@alien8.de> wrote:
>
> Moving Ard and Dan to To:
>
> On Wed, Jun 05, 2024 at 10:28:18AM +0800, Dave Young wrote:
> > Ok, thanks! I think the right way is creating two patches, one to
> > remove the __efi_memmap_free,
>
> Yap, that
>
> f0ef6523475f ("efi: Fix efi_memmap_alloc() leaks")
>
> needs revisiting.
>
> So AFAIU, the flow is this:
>
> In a kexec-ed kernel:
>
> 1. efi_arch_mem_reserve() gets called by bgrt, erst, mokvar... whatever
> to hold on to boot services regions for longer otherwise EFI
> "implementations" explode.
>
> 2. On same kexec-ed kernel, we call into kexec_enter_virtual_mode()
> because it needs to get the runtime services regions from the first
> kernel
>
> 3. As part of that call, it'll do
> efi_memmap_init_late->__efi_memmap_init():
>
> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB))
> __efi_memmap_free(efi.memmap.phys_map,
>
> and the memory which got allocated in step 1 is gone, thus reverting
> what efi_arch_mem_reserve() is trying to fix.
>
> IOW, we need a
>
> EFI_MEMMAP_DO_NOT_TOUCH_MY_MEMORY
>
> flag which'll stop this from happening. But I'd prefer it if Ard decides
> what the right thing to do here is.
>
> > another is skip efi_arch_mem_reserve when the EFI_MEMORY_RUNTIME bit
> > was set already.
>
> Can that even happen?
Yes, let's say we have two different cases both go through
drivers/firmware/efi/efi-bgrt.c -> efi_mem_reserve ->
efi_arch_mem_reserve
1. normal boot (non kexec-ed)
The bgrt region is reserved and mark as EFI_MEMORY_RUNTIME with a
new efi mem range which is inserted in the memmap, later kexec will
carry over to 2nd kernel (drop those boot service areas without
EFI_MEMORY_RUNTIME)
2. kexec-ed boot
In the same call path, the previous kernel saved bgrt region has
already set EFI_MEMORY_RUNTIME, but it is re-reserved with a new mem
entry in memmap, this is not necessary and duplicate. I did not
check the efi boot code if it will de-duplicate the memmap later, but
anyway this is useless and it should be skipped.
Thanks
Dave
On 6/3/2024 12:12 PM, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>> efi_arch_mem_reserve().
> Now it only remains for you to explain why...
Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
...
[ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110 [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
...
Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
...
[ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0 [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
...
Finally, at the end of boot, kexec_enter_virtual_mode() is called.
It does mapping of efi regions which were passed via setup_data.
So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
[ 4.042160] efi: efi memmap phys map 0x27ffcaf80
So, kexec_enter_virtual_mode() does the following :
if (efi_memmap_init_late(efi.memmap.phys_map, <---- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve(). efi.memmap.desc_size * efi.memmap.nr_map)) { ...
This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
drivers/firmware/efi/memmap.c:
*/ int __init __efi_memmap_init(struct efi_memory_map_data *data) {
..
phys_map = data->phys_map; <----------------------- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
if (data->flags & EFI_MEMMAP_LATE) map.map = memremap(phys_map, data->size, MEMREMAP_WB); ... ... if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) { __efi_memmap_free(efi.memmap.phys_map, efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags); }
map.phys_map = data->phys_map;
...
efi.memmap = map;
...
This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
This is confirmed by memblock debugging:
[ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
This corruption is seen just before kexec-ing into the new kernel:
...
[ 11.045522] PEFILE: Unsigned PE binary^M [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80 ... [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <---------------- CORRUPTED!!! [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0,
attr = 0x0^M [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061266]
kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061289] kexec-bzImage64: mmap entry, type = 0,
va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr =
0x0^M [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40
...
This EFI memmapphys map address 0x27ffcaf80 being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
Thanks, Ashish
© 2016 - 2025 Red Hat, Inc.