arch/x86/Kconfig | 45 ++++--- arch/x86/Makefile | 24 +++- arch/x86/boot/Makefile | 1 + arch/x86/boot/compressed/Makefile | 2 +- arch/x86/boot/compressed/head_64.S | 4 - arch/x86/boot/compressed/misc.c | 85 +++++++++++-- arch/x86/boot/header.S | 8 +- arch/x86/entry/calling.h | 9 +- arch/x86/entry/entry_64.S | 14 +-- arch/x86/entry/vdso/Makefile | 1 + arch/x86/include/asm/boot.h | 2 - arch/x86/include/asm/pm-trace.h | 4 +- arch/x86/include/asm/sync_core.h | 3 +- arch/x86/kernel/acpi/wakeup_64.S | 11 +- arch/x86/kernel/head_64.S | 17 +-- arch/x86/kernel/idt.c | 5 +- arch/x86/kernel/kvm.c | 5 +- arch/x86/kernel/relocate_kernel_64.S | 2 +- arch/x86/kernel/rethook.c | 6 +- arch/x86/kernel/vmlinux.lds.S | 132 ++++++++++++-------- arch/x86/mm/init_64.c | 5 +- arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/power/hibernate_asm_64.S | 4 +- arch/x86/realmode/rm/Makefile | 1 + drivers/base/power/trace.c | 6 +- drivers/firmware/efi/libstub/x86-stub.c | 4 +- include/asm-generic/codetag.lds.h | 14 ++- include/asm-generic/vmlinux.lds.h | 1 + include/linux/alloc_tag.h | 11 +- include/linux/hidden.h | 2 + include/uapi/linux/elf.h | 3 + lib/alloc_tag.c | 6 +- tools/objtool/check.c | 32 ++++- 33 files changed, 315 insertions(+), 156 deletions(-)
This series is a follow-up to a series I sent a bit more than a year ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite for further hardening measures, such as fg-kaslr [1], as well as further harmonization of the boot protocols between architectures [2]. The main sticking point is the fact that PIE linking on x86_64 requires PIE codegen, and that was shot down before on the basis that a) GOTs in fully linked binaries are stupid b) the code size increase would be prohibitive c) the performance would suffer. This series implements PIE codegen without permitting the use of GOT slots. The code size increase is between 0.2% (clang) and 0.5% (gcc), and I could not identify any performance regressions (using hackbench) on various different micro-architectures that I tried it on. (Suggestions for other benchmarks/test cases are welcome) So now that we have some actual numbers, I would like to try and revisit this discussion, and get a conclusion on whether this is really a non-starter. Note that only the KASLR kernel would rely on this, and disabling CONFIG_RANDOMIZE_BASE will revert to the current situation (provided that patch #4 is applied) Some minor asm tweaks are needed too (patches #9 - #17), but those all seem uncontroversial to me. The first 5 patches are general cleanup, and could be taken into consideration independently of the discussion around PIC codegen. [1] There have been a few attempts at landing fine grained KASLR for x86, but the main problem is that it was tied to the x86 relocation format, which deviates from how fully linked relocatable ELF binaries are generally constructed (using PIE). Implementing fgkaslr in the ELF domain would make it suitable for other architectures too, as well as other use cases (bare metal or hosted) where no dynamic linking is performed (firmware, hypervisors). In order to implement this properly, i.e., with debugging support etc, it needs support from the tooling side. (Fine grained KASLR in combination with execute-only code mappings makes it extremely difficult for an attacker to subvert the control flow in the kernel in a way that can be meaningfully exploited). [2] EFI zboot is already used by various architectures that have no decompressor stage at all (arm64, RISC-V, LoongArch), and this format can be combined with an ELF payload too. EFI zboot accommodates non-EFI boot chains by describing the size, offset, payload type and compression type in its header, so that it can be extracted and booted by other means. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kees Cook <kees@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: linux-hardening@vger.kernel.org Ard Biesheuvel (19): x86/idt: Move idt_table to __ro_after_init section x86/sev: Don't emit BSS_DECRYPT section unless it is in use x86: Combine .data with .bss in kernel mapping x86: Make the 64-bit bzImage always physically relocatable x86/efistub: Simplify early remapping of kernel text alloc_tag: Use __ prefixed ELF section names tools/objtool: Treat indirect ftrace calls as direct calls x86: Use PIE codegen for the relocatable 64-bit kernel x86/pm-trace: Use RIP-relative accesses for .tracedata x86/kvm: Use RIP-relative addressing x86/rethook: Use RIP-relative reference for fake return address x86/sync_core: Use RIP-relative addressing x86/entry_64: Use RIP-relative addressing x86/hibernate: Prefer RIP-relative accesses x64/acpi: Use PIC-compatible references in wakeup_64.S x86/kexec: Use 64-bit wide absolute reference from relocated code x86/head64: Avoid absolute references in startup asm x86/boot: Implement support for RELA/RELR/REL runtime relocations x86/kernel: Switch to PIE linking for the relocatable kernel arch/x86/Kconfig | 45 ++++--- arch/x86/Makefile | 24 +++- arch/x86/boot/Makefile | 1 + arch/x86/boot/compressed/Makefile | 2 +- arch/x86/boot/compressed/head_64.S | 4 - arch/x86/boot/compressed/misc.c | 85 +++++++++++-- arch/x86/boot/header.S | 8 +- arch/x86/entry/calling.h | 9 +- arch/x86/entry/entry_64.S | 14 +-- arch/x86/entry/vdso/Makefile | 1 + arch/x86/include/asm/boot.h | 2 - arch/x86/include/asm/pm-trace.h | 4 +- arch/x86/include/asm/sync_core.h | 3 +- arch/x86/kernel/acpi/wakeup_64.S | 11 +- arch/x86/kernel/head_64.S | 17 +-- arch/x86/kernel/idt.c | 5 +- arch/x86/kernel/kvm.c | 5 +- arch/x86/kernel/relocate_kernel_64.S | 2 +- arch/x86/kernel/rethook.c | 6 +- arch/x86/kernel/vmlinux.lds.S | 132 ++++++++++++-------- arch/x86/mm/init_64.c | 5 +- arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/power/hibernate_asm_64.S | 4 +- arch/x86/realmode/rm/Makefile | 1 + drivers/base/power/trace.c | 6 +- drivers/firmware/efi/libstub/x86-stub.c | 4 +- include/asm-generic/codetag.lds.h | 14 ++- include/asm-generic/vmlinux.lds.h | 1 + include/linux/alloc_tag.h | 11 +- include/linux/hidden.h | 2 + include/uapi/linux/elf.h | 3 + lib/alloc_tag.c | 6 +- tools/objtool/check.c | 32 ++++- 33 files changed, 315 insertions(+), 156 deletions(-) base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8 -- 2.47.3
From: Ard Biesheuvel <ardb@kernel.org> Date: Thu, 8 Jan 2026 09:25:27 +0000 > This series is a follow-up to a series I sent a bit more than a year > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite > for further hardening measures, such as fg-kaslr [1], as well as further > harmonization of the boot protocols between architectures [2]. > > The main sticking point is the fact that PIE linking on x86_64 requires > PIE codegen, and that was shot down before on the basis that > a) GOTs in fully linked binaries are stupid > b) the code size increase would be prohibitive > c) the performance would suffer. > > This series implements PIE codegen without permitting the use of GOT > slots. The code size increase is between 0.2% (clang) and 0.5% (gcc), > and I could not identify any performance regressions (using hackbench) > on various different micro-architectures that I tried it on. > (Suggestions for other benchmarks/test cases are welcome) > > So now that we have some actual numbers, I would like to try and revisit > this discussion, and get a conclusion on whether this is really a > non-starter. Note that only the KASLR kernel would rely on this, and > disabling CONFIG_RANDOMIZE_BASE will revert to the current situation > (provided that patch #4 is applied) > > Some minor asm tweaks are needed too (patches #9 - #17), but those all > seem uncontroversial to me. > > The first 5 patches are general cleanup, and could be taken into > consideration independently of the discussion around PIC codegen. > > [1] There have been a few attempts at landing fine grained KASLR for > x86, but the main problem is that it was tied to the x86 relocation > format, which deviates from how fully linked relocatable ELF binaries > are generally constructed (using PIE). Implementing fgkaslr in the ELF > domain would make it suitable for other architectures too, as well as > other use cases (bare metal or hosted) where no dynamic linking is > performed (firmware, hypervisors). In order to implement this properly, > i.e., with debugging support etc, it needs support from the tooling > side. (Fine grained KASLR in combination with execute-only code mappings > makes it extremely difficult for an attacker to subvert the control flow > in the kernel in a way that can be meaningfully exploited). In case anybody is interested... The latest (to my knowledge) experiments with FG-KALSR was my side project reviving Kristen's old series (and then rewriting it completely): [0] I haven't worked on it since then, as I work in an XDP/netmem/whatever team, i.e. networking, not x86, and free time for side projects shrunk severely since 2022. Maybe someone would pick it up again some day, just like I picked up Kristen's series back then... [0] https://github.com/alobakin/linux/commits/fgkaslr Thanks, Olek
On 2026-01-08 01:25, Ard Biesheuvel wrote: > This series is a follow-up to a series I sent a bit more than a year > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite > for further hardening measures, such as fg-kaslr [1], as well as further > harmonization of the boot protocols between architectures [2]. Kristin Accardi had fg-kasrl running without that, didn't she? From your footnotes, it looks like what you are *really* asking for is to pessimize x86 code to benefit other architectures. That isn't inherently wrong, but stating it as you have above is dishonest. > The main sticking point is the fact that PIE linking on x86_64 requires > PIE codegen, and that was shot down before on the basis that > a) GOTs in fully linked binaries are stupid > b) the code size increase would be prohibitive > c) the performance would suffer. > > This series implements PIE codegen without permitting the use of GOT > slots. The code size increase is between 0.2% (clang) and 0.5% (gcc), > and I could not identify any performance regressions (using hackbench) > on various different micro-architectures that I tried it on. > (Suggestions for other benchmarks/test cases are welcome) Could you show some examples of how the code changes? -hpa > > [1] There have been a few attempts at landing fine grained KASLR for > x86, but the main problem is that it was tied to the x86 relocation > format, which deviates from how fully linked relocatable ELF binaries > are generally constructed (using PIE). Implementing fgkaslr in the ELF > domain would make it suitable for other architectures too, as well as > other use cases (bare metal or hosted) where no dynamic linking is > performed (firmware, hypervisors). In order to implement this properly, > i.e., with debugging support etc, it needs support from the tooling > side. (Fine grained KASLR in combination with execute-only code mappings > makes it extremely difficult for an attacker to subvert the control flow > in the kernel in a way that can be meaningfully exploited). > > [2] EFI zboot is already used by various architectures that have no > decompressor stage at all (arm64, RISC-V, LoongArch), and this format > can be combined with an ELF payload too. EFI zboot accommodates non-EFI > boot chains by describing the size, offset, payload type and compression > type in its header, so that it can be extracted and booted by other > means. The bzImage format already have that for all practical purposes. We *really* don't want to introduce a new binary format for the x86 kernel. A bunch of such attempts have been done in the past, and it is nothing but a mess that breaks things, because now you are encouraging different bootloaders to support a non-overlapping set of binary formats. STRONG NAK on that one. -hpa
On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote: > > On 2026-01-08 01:25, Ard Biesheuvel wrote: > > This series is a follow-up to a series I sent a bit more than a year > > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite > > for further hardening measures, such as fg-kaslr [1], as well as further > > harmonization of the boot protocols between architectures [2]. > > Kristin Accardi had fg-kasrl running without that, didn't she? > Yes, as a proof of concept. But it is tied to the x86 approach of performing runtime relocations based on build time relocation data, which is problematic now that linkers have started to perform relaxations, as these cannot always be translated 1:1. For instance, we already have a latent bug in the x86 relocs tool, which ignores GOTPCREL relocations on the basis that the relocation is relative. However, this is only true for Clang/lld, which does not update the static relocation tables after performing relaxations. ld.bfd does attempt to keep those tables in sync, and so a GOTPCREL relocation should be flagged as a bug when encountered, because it means there is a GOT slot somewhere with no relocation associated with it. One could argue that this example is just a Clang bug, but it is very difficult to make that case with the toolchain developers, given that --emit-relocs (which is what tells the linker to emit the relocations that it received as input) has no specification, and some linker relaxations are not representable as static relocations to begin with (but to be fair, that currently mostly affects other architectures, but there is no reason this could never happen on x86) Doing fgkaslr properly (IMHO) means supporting things like live patch and debug seamlessly, and in a portable manner. Toolchain support is critical, and securing that for a one-off x86 implementation rather than one that can be used across architectures and other bare-metal projects is going to be difficult. > From your footnotes, it looks like what you are *really* asking for is to > pessimize x86 code to benefit other architectures. That isn't inherently > wrong, but stating it as you have above is dishonest. > I was hoping to save the ad-hominems for later in the thread, when things *really* heat up. The point is not to benefit other architectures. The point is to implement something once, and deploy it on all architectures in the same way. ELF is the greatest common denominator across the entire ecosystem, and so using idiomatic ELF to describe how to load the image and how to move it around in the virtual address space is on obvious choice. > > The main sticking point is the fact that PIE linking on x86_64 requires > > PIE codegen, and that was shot down before on the basis that > > a) GOTs in fully linked binaries are stupid > > b) the code size increase would be prohibitive > > c) the performance would suffer. > > > > This series implements PIE codegen without permitting the use of GOT > > slots. The code size increase is between 0.2% (clang) and 0.5% (gcc), > > and I could not identify any performance regressions (using hackbench) > > on various different micro-architectures that I tried it on. > > (Suggestions for other benchmarks/test cases are welcome) > > Could you show some examples of how the code changes? > Taking the address of a symbol (same code size) 0: 48 c7 c0 00 00 00 00 mov $0x0,%rax 3: R_X86_64_32S sym 7: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 0xe a: R_X86_64_PC32 Loading a global variable from memory (one byte shorter in PIC) e: 48 8b 04 25 00 00 00 mov 0x0,%rax 15: 00 12: R_X86_64_32S sym 16: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 0x1d 19: R_X86_64_PC32 sym-0x4 Indexing a global array (3 bytes longer in PIC, needs an additional GPR if source and destination are the same) 1d: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax 24: 00 21: R_X86_64_32S array 25: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # 0x2c 28: R_X86_64_PC32 array-0x4 2c: 48 8b 04 c2 mov (%rdx,%rax,8),%rax Pushing the address of a symbol to the stack ((3 bytes longer in PIC, needs an additional GPR) 30: 68 00 00 00 00 push $0x0 31: R_X86_64_32S sym 35: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 0x3c 38: R_X86_64_PC32 sym-0x4 3c: 50 push %rax Jump tables look completely different, but the table itself is only half the size. Even for non-PIC, jump tables are problematic for objtool, and so these need to be annotated by the compiler. I have some unfinished Clang patches that implement this, which I hope to get back to soon. The asm patches in the series should give a good impression of how the code changes. > > > > [1] There have been a few attempts at landing fine grained KASLR for > > x86, but the main problem is that it was tied to the x86 relocation > > format, which deviates from how fully linked relocatable ELF binaries > > are generally constructed (using PIE). Implementing fgkaslr in the ELF > > domain would make it suitable for other architectures too, as well as > > other use cases (bare metal or hosted) where no dynamic linking is > > performed (firmware, hypervisors). In order to implement this properly, > > i.e., with debugging support etc, it needs support from the tooling > > side. (Fine grained KASLR in combination with execute-only code mappings > > makes it extremely difficult for an attacker to subvert the control flow > > in the kernel in a way that can be meaningfully exploited). > > > > [2] EFI zboot is already used by various architectures that have no > > decompressor stage at all (arm64, RISC-V, LoongArch), and this format > > can be combined with an ELF payload too. EFI zboot accommodates non-EFI > > boot chains by describing the size, offset, payload type and compression > > type in its header, so that it can be extracted and booted by other > > means. > > The bzImage format already have that for all practical purposes. We *really* > don't want to introduce a new binary format for the x86 kernel. A bunch of > such attempts have been done in the past, and it is nothing but a mess that > breaks things, because now you are encouraging different bootloaders to > support a non-overlapping set of binary formats. > > STRONG NAK on that one. > I think it should be feasible to implement a hybrid bzImage/EFI zboot format. There is already prior art in loaders that decompress the ELF payload directly (Xen). Given that a x86_64 bootloader running in long mode needs to do very little beyond loading the ELF at some arbitrary 2M aligned offset and calling the entrypoint with a struct bootparams in %RDI, most of the logic in the decompressor is really only needed when booting in 32-bit mode. So I think there is value in having a generic boot format that can be consumed by EFI directly, or by a generic ELF vmlinux loader (library) that understands the EFI zboot format and knows how to extract the ELF payload. I'd strongly prefer only a single idiom for describing the relocations in the image. On other architectures (i.e., without decompressor), EFI zboot would be a prerequisite for fgkaslr, but it is up to the platform to decide whether to boot via EFI or load the ELF and apply the relocations. On x86_64, the same tooling would work seamlessly, but the decompressor could apply the relocations itself as well.
On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote: > On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote: > > > > On 2026-01-08 01:25, Ard Biesheuvel wrote: > > > This series is a follow-up to a series I sent a bit more than a year > > > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite > > > for further hardening measures, such as fg-kaslr [1], as well as further > > > harmonization of the boot protocols between architectures [2]. > > > > Kristin Accardi had fg-kasrl running without that, didn't she? I understand "such as fg-kaslr" to have been just a terse way of saying "such as a complete multi-architectural fg-kaslr" > Yes, as a proof of concept. But it is tied to the x86 approach of > performing runtime relocations based on build time relocation data, > which is problematic now that linkers have started to perform > relaxations, as these cannot always be translated 1:1. For instance, > we already have a latent bug in the x86 relocs tool, which ignores > GOTPCREL relocations on the basis that the relocation is relative. > However, this is only true for Clang/lld, which does not update the > static relocation tables after performing relaxations. ld.bfd does > attempt to keep those tables in sync, and so a GOTPCREL relocation > should be flagged as a bug when encountered, because it means there is > a GOT slot somewhere with no relocation associated with it. Another historical bit of context is that one of the main reasons Kristen's fg-kaslr got stuck was the linker support needed for (the 65k worth of) section pass-through. That never got resolved, and the solutions either required huge linker files (that tickled performance flaws in the linkers) that resulted in 10 minute linking times, or to disable all the orphan section handling, which was a regression in our sanity checking and bug-finding. So, getting a well-behaved fg-kaslr still needs toolchain support, and getting there is going to need further design work. As far as PIE, this just makes the fg-kaslr toolchain work easier (fewer special cases), along with all the other benefits of moving to PIE. -Kees -- Kees Cook
On 2026-01-14 10:16, Kees Cook wrote: > On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote: >> On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote: >>> >>> On 2026-01-08 01:25, Ard Biesheuvel wrote: >>>> This series is a follow-up to a series I sent a bit more than a year >>>> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite >>>> for further hardening measures, such as fg-kaslr [1], as well as further >>>> harmonization of the boot protocols between architectures [2]. >>> >>> Kristin Accardi had fg-kasrl running without that, didn't she? > > I understand "such as fg-kaslr" to have been just a terse way of saying > "such as a complete multi-architectural fg-kaslr" > >> Yes, as a proof of concept. But it is tied to the x86 approach of >> performing runtime relocations based on build time relocation data, >> which is problematic now that linkers have started to perform >> relaxations, as these cannot always be translated 1:1. For instance, >> we already have a latent bug in the x86 relocs tool, which ignores >> GOTPCREL relocations on the basis that the relocation is relative. >> However, this is only true for Clang/lld, which does not update the >> static relocation tables after performing relaxations. ld.bfd does >> attempt to keep those tables in sync, and so a GOTPCREL relocation >> should be flagged as a bug when encountered, because it means there is >> a GOT slot somewhere with no relocation associated with it. > > Another historical bit of context is that one of the main reasons > Kristen's fg-kaslr got stuck was the linker support needed for (the 65k > worth of) section pass-through. That never got resolved, and the solutions > either required huge linker files (that tickled performance flaws in the > linkers) that resulted in 10 minute linking times, or to disable all the > orphan section handling, which was a regression in our sanity checking > and bug-finding. > > So, getting a well-behaved fg-kaslr still needs toolchain support, > and getting there is going to need further design work. As far as PIE, > this just makes the fg-kaslr toolchain work easier (fewer special cases), > along with all the other benefits of moving to PIE. > As I *explicitly* stated earlier, there isn't anything inherently wrong with putting a small onus on x86 in order to make the general Linux code better -- but please, be honest about it *so we know what the actual tradeoffs are*. For x86, we really do want to maintain the kernel memory model, which allows us to directly reference symbols in complex address expressions and to directly jump across modules. This means the "PIE" will need to be different from the way PIE works in user space, which is in part designed to avoid needing to dirty readonly pages, which would inhibit sharing -- which is explicitly NOT a concern for the kernel. So that is one thing that the toolchain needs to be able to do. I fully expect that we will continue to need to have some kinds of overrides for specific symbols, too, because there aren't any really sane ways to express them to the toolchain; this especially applies to linker-script and some assembly symbols. For example, the real-mode code (which uses the reloc tool as well) has to support segment and segbase-relative relocations, which are something that ELF simply has no concept of. I have a lot more of an issue with trying to change the x86 boot protocol, simply because the way booting works in x86 has been incredibly successful; yes, the bzImage file format is ugly as ****, but that is a direct result of 34 years of continuous backwards compatibility. One of the reasons we have been able to do that is that we have *explicitly* rejected other boot models, such as Grub's self-declared Multiboot "standard" (which they have had to revise multiple times by now) and the early Xen boot model of booting vmlinux directly. We have added *many* capabilities to bzImage as needed, and it has turned out to be quite flexible in the end. That, in turn, has been possible exactly *because* the Linux kernel provides a "prekernel". I don't even really like calling it the "decompressor" anymore; it really has developed far beyond that. Every time you introduce a new boot model you take a serious risk that a boot loader author will say "hey, I'll just support this new model", and you *also* take the serious risk that the boot model isn't adequate. When the Grub authors says "we started using the 'modern' Linux entry point" -- meaning they would bypass the entry stub and call the kernel entry point directly -- they *reduced* the overall functionality, because history has shown that: a. The kernel image is far easier for the end user to update than the potentially many bootloaders, because the bootloader depends on the deployment model (consider network booting, for example, or when the bootloader is in firmware.) b. The kernel image provides ONE CENTRAL PLACE to add features, fix bugs, and develop workarounds for strange systems. Thus, I will do anything I can to continue to veto changes to the x86 boot model, unless they come with a VERY VERY good motivation. We originally made the mistake with EFI to leave too much to the bootloader; because of the downsides above we really needed to backpedal and take over control much earlier in the flow, just as with BIOS -- apparently to the Grub developers' utterly inexplicable chagrin. -hpa
On Tue, 20 Jan 2026 at 21:46, H. Peter Anvin <hpa@zytor.com> wrote: > > On 2026-01-14 10:16, Kees Cook wrote: > > On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote: > >> On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote: > >>> > >>> On 2026-01-08 01:25, Ard Biesheuvel wrote: > >>>> This series is a follow-up to a series I sent a bit more than a year > >>>> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite > >>>> for further hardening measures, such as fg-kaslr [1], as well as further > >>>> harmonization of the boot protocols between architectures [2]. > >>> > >>> Kristin Accardi had fg-kasrl running without that, didn't she? > > > > I understand "such as fg-kaslr" to have been just a terse way of saying > > "such as a complete multi-architectural fg-kaslr" > > > >> Yes, as a proof of concept. But it is tied to the x86 approach of > >> performing runtime relocations based on build time relocation data, > >> which is problematic now that linkers have started to perform > >> relaxations, as these cannot always be translated 1:1. For instance, > >> we already have a latent bug in the x86 relocs tool, which ignores > >> GOTPCREL relocations on the basis that the relocation is relative. > >> However, this is only true for Clang/lld, which does not update the > >> static relocation tables after performing relaxations. ld.bfd does > >> attempt to keep those tables in sync, and so a GOTPCREL relocation > >> should be flagged as a bug when encountered, because it means there is > >> a GOT slot somewhere with no relocation associated with it. > > > > Another historical bit of context is that one of the main reasons > > Kristen's fg-kaslr got stuck was the linker support needed for (the 65k > > worth of) section pass-through. That never got resolved, and the solutions > > either required huge linker files (that tickled performance flaws in the > > linkers) that resulted in 10 minute linking times, or to disable all the > > orphan section handling, which was a regression in our sanity checking > > and bug-finding. > > > > So, getting a well-behaved fg-kaslr still needs toolchain support, > > and getting there is going to need further design work. As far as PIE, > > this just makes the fg-kaslr toolchain work easier (fewer special cases), > > along with all the other benefits of moving to PIE. > > > > As I *explicitly* stated earlier, there isn't anything inherently wrong with > putting a small onus on x86 in order to make the general Linux code better -- > but please, be honest about it *so we know what the actual tradeoffs are*. > It is not just about the general Linux code. The x86 fgkaslr implementation was never merged because the toolchain side needs changes. And convincing the toolchain maintainers to take our changes is difficult if we keep using relocation tables that are not fit for purpose to perform runtime fixups on code that was built using the 'kernel' code model, which is explicitly position dependent. > For x86, we really do want to maintain the kernel memory model, which allows > us to directly reference symbols in complex address expressions and to > directly jump across modules. AIUI, those complex address expressions are mostly indexed loads from global arrays, which do get slightly less efficient, but not in a way that was noticeable in any benchmarking I did (or LKP for that matter, which generally sniffs out any performance regressions). This includes jump tables, but as I already explained, RIP-relative jump tables have an upside too, given that the table itself is only half the size. The ability to directly jump across modules is not affected at all by these changes. > This means the "PIE" will need to be different > from the way PIE works in user space, which is in part designed to avoid > needing to dirty readonly pages, which would inhibit sharing -- which is > explicitly NOT a concern for the kernel. > This is already implemented in this series: no GOT entries are permitted, and text relocations are allowed. > So that is one thing that the toolchain needs to be able to do. > It already can, and this series makes use of it. Note that the size of the relocation table taken from an allmodconfig bzImage drops from 7.3 M to 2.4 M (defconfig goes from 800k to 45k), so there is a minor intrinsic benefit to these changes as well. But it is mostly about moving away from bespoke tooling and formats that are becoming more of a maintenance burden as the number of supported toolchains and languages increases. > I fully expect that we will continue to need to have some kinds of overrides > for specific symbols, too, because there aren't any really sane ways to > express them to the toolchain; this especially applies to linker-script and > some assembly symbols. For example, the real-mode code (which uses the reloc > tool as well) has to support segment and segbase-relative relocations, which > are something that ELF simply has no concept of. > The real mode trampoline is not affected at all by these changes, given that it is built as a separate executable. Using a bespoke relocation format there is fine, because it is internal ABI. > I have a lot more of an issue with trying to change the x86 boot protocol, > simply because the way booting works in x86 has been incredibly successful; > yes, the bzImage file format is ugly as ****, but that is a direct result of > 34 years of continuous backwards compatibility. One of the reasons we have > been able to do that is that we have *explicitly* rejected other boot models, > such as Grub's self-declared Multiboot "standard" (which they have had to > revise multiple times by now) and the early Xen boot model of booting vmlinux > directly. We have added *many* capabilities to bzImage as needed, and it has > turned out to be quite flexible in the end. > > That, in turn, has been possible exactly *because* the Linux kernel provides a > "prekernel". I don't even really like calling it the "decompressor" anymore; > it really has developed far beyond that. > The decompressor is needed when booting the 64-bit kernel from a boot loader that calls it in 32-bit mode. When entering in long mode, with all memory mapped 1:1 (or at least, the kernel image itself, and all assets in memory that the bootloader exposes to the kernel), the decompressor does nothing useful, and all the problems it solves (by doing demand paging etc) only exist because it created them in the first place. SEV-SNP confidential compute made an even bigger mess of this, because it can trigger #VC exceptions too, which also need to be handled. Note that the EFI stub does not bother with the decompressor anymore, and unpacks and boots vmlinux directly. This was needed because the decompressor fundamentally relies on memory that is both writable and executable (as it moves its own executable image around in memory), which is difficult to reconcile with recent PC firmware implementations that are pedantic about mapping memory RWX. But actually, I am not proposing to get rid of bzImage. I am proposing to make it more transparent so generic bootloader components can be constructed that consume the ELF directly.
© 2016 - 2026 Red Hat, Inc.