Link the relocatable x86 kernel as PIE

[RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by Ard Biesheuvel 1 month ago

This series is a follow-up to a series I sent a bit more than a year
ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
for further hardening measures, such as fg-kaslr [1], as well as further
harmonization of the boot protocols between architectures [2].

The main sticking point is the fact that PIE linking on x86_64 requires
PIE codegen, and that was shot down before on the basis that
a) GOTs in fully linked binaries are stupid
b) the code size increase would be prohibitive
c) the performance would suffer.

This series implements PIE codegen without permitting the use of GOT
slots. The code size increase is between 0.2% (clang) and 0.5% (gcc),
and I could not identify any performance regressions (using hackbench)
on various different micro-architectures that I tried it on.
(Suggestions for other benchmarks/test cases are welcome)

So now that we have some actual numbers, I would like to try and revisit
this discussion, and get a conclusion on whether this is really a
non-starter. Note that only the KASLR kernel would rely on this, and
disabling CONFIG_RANDOMIZE_BASE will revert to the current situation
(provided that patch #4 is applied)

Some minor asm tweaks are needed too (patches #9 - #17), but those all
seem uncontroversial to me. 

The first 5 patches are general cleanup, and could be taken into
consideration independently of the discussion around PIC codegen.

[1] There have been a few attempts at landing fine grained KASLR for
x86, but the main problem is that it was tied to the x86 relocation
format, which deviates from how fully linked relocatable ELF binaries
are generally constructed (using PIE). Implementing fgkaslr in the ELF
domain would make it suitable for other architectures too, as well as
other use cases (bare metal or hosted) where no dynamic linking is
performed (firmware, hypervisors). In order to implement this properly,
i.e., with debugging support etc, it needs support from the tooling
side. (Fine grained KASLR in combination with execute-only code mappings
makes it extremely difficult for an attacker to subvert the control flow
in the kernel in a way that can be meaningfully exploited).

[2] EFI zboot is already used by various architectures that have no
decompressor stage at all (arm64, RISC-V, LoongArch), and this format
can be combined with an ELF payload too. EFI zboot accommodates non-EFI
boot chains by describing the size, offset, payload type and compression
type in its header, so that it can be extracted and booted by other
means.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-hardening@vger.kernel.org

Ard Biesheuvel (19):
  x86/idt: Move idt_table to __ro_after_init section
  x86/sev: Don't emit BSS_DECRYPT section unless it is in use
  x86: Combine .data with .bss in kernel mapping
  x86: Make the 64-bit bzImage always physically relocatable
  x86/efistub: Simplify early remapping of kernel text
  alloc_tag: Use __ prefixed ELF section names
  tools/objtool: Treat indirect ftrace calls as direct calls
  x86: Use PIE codegen for the relocatable 64-bit kernel
  x86/pm-trace: Use RIP-relative accesses for .tracedata
  x86/kvm: Use RIP-relative addressing
  x86/rethook: Use RIP-relative reference for fake return address
  x86/sync_core: Use RIP-relative addressing
  x86/entry_64: Use RIP-relative addressing
  x86/hibernate: Prefer RIP-relative accesses
  x64/acpi: Use PIC-compatible references in wakeup_64.S
  x86/kexec: Use 64-bit wide absolute reference from relocated code
  x86/head64: Avoid absolute references in startup asm
  x86/boot: Implement support for RELA/RELR/REL runtime relocations
  x86/kernel: Switch to PIE linking for the relocatable kernel

 arch/x86/Kconfig                        |  45 ++++---
 arch/x86/Makefile                       |  24 +++-
 arch/x86/boot/Makefile                  |   1 +
 arch/x86/boot/compressed/Makefile       |   2 +-
 arch/x86/boot/compressed/head_64.S      |   4 -
 arch/x86/boot/compressed/misc.c         |  85 +++++++++++--
 arch/x86/boot/header.S                  |   8 +-
 arch/x86/entry/calling.h                |   9 +-
 arch/x86/entry/entry_64.S               |  14 +--
 arch/x86/entry/vdso/Makefile            |   1 +
 arch/x86/include/asm/boot.h             |   2 -
 arch/x86/include/asm/pm-trace.h         |   4 +-
 arch/x86/include/asm/sync_core.h        |   3 +-
 arch/x86/kernel/acpi/wakeup_64.S        |  11 +-
 arch/x86/kernel/head_64.S               |  17 +--
 arch/x86/kernel/idt.c                   |   5 +-
 arch/x86/kernel/kvm.c                   |   5 +-
 arch/x86/kernel/relocate_kernel_64.S    |   2 +-
 arch/x86/kernel/rethook.c               |   6 +-
 arch/x86/kernel/vmlinux.lds.S           | 132 ++++++++++++--------
 arch/x86/mm/init_64.c                   |   5 +-
 arch/x86/mm/pat/set_memory.c            |   2 +-
 arch/x86/power/hibernate_asm_64.S       |   4 +-
 arch/x86/realmode/rm/Makefile           |   1 +
 drivers/base/power/trace.c              |   6 +-
 drivers/firmware/efi/libstub/x86-stub.c |   4 +-
 include/asm-generic/codetag.lds.h       |  14 ++-
 include/asm-generic/vmlinux.lds.h       |   1 +
 include/linux/alloc_tag.h               |  11 +-
 include/linux/hidden.h                  |   2 +
 include/uapi/linux/elf.h                |   3 +
 lib/alloc_tag.c                         |   6 +-
 tools/objtool/check.c                   |  32 ++++-
 33 files changed, 315 insertions(+), 156 deletions(-)


base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
-- 
2.47.3

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by Alexander Lobakin 1 month ago

From: Ard Biesheuvel <ardb@kernel.org>
Date: Thu,  8 Jan 2026 09:25:27 +0000

> This series is a follow-up to a series I sent a bit more than a year
> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
> for further hardening measures, such as fg-kaslr [1], as well as further
> harmonization of the boot protocols between architectures [2].
> 
> The main sticking point is the fact that PIE linking on x86_64 requires
> PIE codegen, and that was shot down before on the basis that
> a) GOTs in fully linked binaries are stupid
> b) the code size increase would be prohibitive
> c) the performance would suffer.
> 
> This series implements PIE codegen without permitting the use of GOT
> slots. The code size increase is between 0.2% (clang) and 0.5% (gcc),
> and I could not identify any performance regressions (using hackbench)
> on various different micro-architectures that I tried it on.
> (Suggestions for other benchmarks/test cases are welcome)
> 
> So now that we have some actual numbers, I would like to try and revisit
> this discussion, and get a conclusion on whether this is really a
> non-starter. Note that only the KASLR kernel would rely on this, and
> disabling CONFIG_RANDOMIZE_BASE will revert to the current situation
> (provided that patch #4 is applied)
> 
> Some minor asm tweaks are needed too (patches #9 - #17), but those all
> seem uncontroversial to me. 
> 
> The first 5 patches are general cleanup, and could be taken into
> consideration independently of the discussion around PIC codegen.
> 
> [1] There have been a few attempts at landing fine grained KASLR for
> x86, but the main problem is that it was tied to the x86 relocation
> format, which deviates from how fully linked relocatable ELF binaries
> are generally constructed (using PIE). Implementing fgkaslr in the ELF
> domain would make it suitable for other architectures too, as well as
> other use cases (bare metal or hosted) where no dynamic linking is
> performed (firmware, hypervisors). In order to implement this properly,
> i.e., with debugging support etc, it needs support from the tooling
> side. (Fine grained KASLR in combination with execute-only code mappings
> makes it extremely difficult for an attacker to subvert the control flow
> in the kernel in a way that can be meaningfully exploited).

In case anybody is interested...
The latest (to my knowledge) experiments with FG-KALSR was my side
project reviving Kristen's old series (and then rewriting it
completely): [0]

I haven't worked on it since then, as I work in an
XDP/netmem/whatever team, i.e. networking, not x86, and free time for
side projects shrunk severely since 2022.

Maybe someone would pick it up again some day, just like I picked up
Kristen's series back then...

[0] https://github.com/alobakin/linux/commits/fgkaslr

Thanks,
Olek

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by H. Peter Anvin 1 month ago

On 2026-01-08 01:25, Ard Biesheuvel wrote:
> This series is a follow-up to a series I sent a bit more than a year
> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
> for further hardening measures, such as fg-kaslr [1], as well as further
> harmonization of the boot protocols between architectures [2].

Kristin Accardi had fg-kasrl running without that, didn't she?

From your footnotes, it looks like what you are *really* asking for is to
pessimize x86 code to benefit other architectures. That isn't inherently
wrong, but stating it as you have above is dishonest.

> The main sticking point is the fact that PIE linking on x86_64 requires
> PIE codegen, and that was shot down before on the basis that
> a) GOTs in fully linked binaries are stupid
> b) the code size increase would be prohibitive
> c) the performance would suffer.
> 
> This series implements PIE codegen without permitting the use of GOT
> slots. The code size increase is between 0.2% (clang) and 0.5% (gcc),
> and I could not identify any performance regressions (using hackbench)
> on various different micro-architectures that I tried it on.
> (Suggestions for other benchmarks/test cases are welcome)

Could you show some examples of how the code changes?

	-hpa
> 
> [1] There have been a few attempts at landing fine grained KASLR for
> x86, but the main problem is that it was tied to the x86 relocation
> format, which deviates from how fully linked relocatable ELF binaries
> are generally constructed (using PIE). Implementing fgkaslr in the ELF
> domain would make it suitable for other architectures too, as well as
> other use cases (bare metal or hosted) where no dynamic linking is
> performed (firmware, hypervisors). In order to implement this properly,
> i.e., with debugging support etc, it needs support from the tooling
> side. (Fine grained KASLR in combination with execute-only code mappings
> makes it extremely difficult for an attacker to subvert the control flow
> in the kernel in a way that can be meaningfully exploited).
> 
> [2] EFI zboot is already used by various architectures that have no
> decompressor stage at all (arm64, RISC-V, LoongArch), and this format
> can be combined with an ELF payload too. EFI zboot accommodates non-EFI
> boot chains by describing the size, offset, payload type and compression
> type in its header, so that it can be extracted and booted by other
> means.

The bzImage format already have that for all practical purposes. We *really*
don't want to introduce a new binary format for the x86 kernel. A bunch of
such attempts have been done in the past, and it is nothing but a mess that
breaks things, because now you are encouraging different bootloaders to
support a non-overlapping set of binary formats.

STRONG NAK on that one.

	-hpa

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by Ard Biesheuvel 1 month ago

On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-01-08 01:25, Ard Biesheuvel wrote:
> > This series is a follow-up to a series I sent a bit more than a year
> > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
> > for further hardening measures, such as fg-kaslr [1], as well as further
> > harmonization of the boot protocols between architectures [2].
>
> Kristin Accardi had fg-kasrl running without that, didn't she?
>

Yes, as a proof of concept. But it is tied to the x86 approach of
performing runtime relocations based on build time relocation data,
which is problematic now that linkers have started to perform
relaxations, as these cannot always be translated 1:1. For instance,
we already have a latent bug in the x86 relocs tool, which ignores
GOTPCREL relocations on the basis that the relocation is relative.
However, this is only true for Clang/lld, which does not update the
static relocation tables after performing relaxations. ld.bfd does
attempt to keep those tables in sync, and so a GOTPCREL relocation
should be flagged as a bug when encountered, because it means there is
a GOT slot somewhere with no relocation associated with it.

One could argue that this example is just a Clang bug, but it is very
difficult to make that case with the toolchain developers, given that
--emit-relocs (which is what tells the linker to emit the relocations
that it received as input) has no specification, and some linker
relaxations are not representable as static relocations to begin with
(but to be fair, that currently mostly affects other architectures,
but there is no reason this could never happen on x86)

Doing fgkaslr properly (IMHO) means supporting things like live patch
and debug seamlessly, and in a portable manner. Toolchain support is
critical, and securing that for a one-off x86 implementation rather
than one that can be used across architectures and other bare-metal
projects is going to be difficult.

> From your footnotes, it looks like what you are *really* asking for is to
> pessimize x86 code to benefit other architectures. That isn't inherently
> wrong, but stating it as you have above is dishonest.
>

I was hoping to save the ad-hominems for later in the thread, when
things *really* heat up.

The point is not to benefit other architectures. The point is to
implement something once, and deploy it on all architectures in the
same way. ELF is the greatest common denominator across the entire
ecosystem, and so using idiomatic ELF to describe how to load the
image and how to move it around in the virtual address space is on
obvious choice.

> > The main sticking point is the fact that PIE linking on x86_64 requires
> > PIE codegen, and that was shot down before on the basis that
> > a) GOTs in fully linked binaries are stupid
> > b) the code size increase would be prohibitive
> > c) the performance would suffer.
> >
> > This series implements PIE codegen without permitting the use of GOT
> > slots. The code size increase is between 0.2% (clang) and 0.5% (gcc),
> > and I could not identify any performance regressions (using hackbench)
> > on various different micro-architectures that I tried it on.
> > (Suggestions for other benchmarks/test cases are welcome)
>
> Could you show some examples of how the code changes?
>

Taking the address of a symbol (same code size)

   0: 48 c7 c0 00 00 00 00 mov    $0x0,%rax
3: R_X86_64_32S sym

   7: 48 8d 05 00 00 00 00 lea    0x0(%rip),%rax        # 0xe
a: R_X86_64_PC32

Loading a global variable from memory (one byte shorter in PIC)

   e: 48 8b 04 25 00 00 00 mov    0x0,%rax
  15: 00
12: R_X86_64_32S sym

  16: 48 8b 05 00 00 00 00 mov    0x0(%rip),%rax        # 0x1d
19: R_X86_64_PC32 sym-0x4

Indexing a global array (3 bytes longer in PIC, needs an additional
GPR if source and destination are the same)

  1d: 48 8b 04 c5 00 00 00 mov    0x0(,%rax,8),%rax
  24: 00
21: R_X86_64_32S array

  25: 48 8d 15 00 00 00 00 lea    0x0(%rip),%rdx        # 0x2c
28: R_X86_64_PC32 array-0x4
  2c: 48 8b 04 c2          mov    (%rdx,%rax,8),%rax

Pushing the address of a symbol to the stack ((3 bytes longer in PIC,
needs an additional GPR)

  30: 68 00 00 00 00        push   $0x0
31: R_X86_64_32S sym

  35: 48 8d 05 00 00 00 00 lea    0x0(%rip),%rax        # 0x3c
38: R_X86_64_PC32 sym-0x4
  3c: 50                    push   %rax

Jump tables look completely different, but the table itself is only
half the size. Even for non-PIC, jump tables are problematic for
objtool, and so these need to be annotated by the compiler. I have
some unfinished Clang patches that implement this, which I hope to get
back to soon.

The asm patches in the series should give a good impression of how the
code changes.

> >
> > [1] There have been a few attempts at landing fine grained KASLR for
> > x86, but the main problem is that it was tied to the x86 relocation
> > format, which deviates from how fully linked relocatable ELF binaries
> > are generally constructed (using PIE). Implementing fgkaslr in the ELF
> > domain would make it suitable for other architectures too, as well as
> > other use cases (bare metal or hosted) where no dynamic linking is
> > performed (firmware, hypervisors). In order to implement this properly,
> > i.e., with debugging support etc, it needs support from the tooling
> > side. (Fine grained KASLR in combination with execute-only code mappings
> > makes it extremely difficult for an attacker to subvert the control flow
> > in the kernel in a way that can be meaningfully exploited).
> >
> > [2] EFI zboot is already used by various architectures that have no
> > decompressor stage at all (arm64, RISC-V, LoongArch), and this format
> > can be combined with an ELF payload too. EFI zboot accommodates non-EFI
> > boot chains by describing the size, offset, payload type and compression
> > type in its header, so that it can be extracted and booted by other
> > means.
>
> The bzImage format already have that for all practical purposes. We *really*
> don't want to introduce a new binary format for the x86 kernel. A bunch of
> such attempts have been done in the past, and it is nothing but a mess that
> breaks things, because now you are encouraging different bootloaders to
> support a non-overlapping set of binary formats.
>
> STRONG NAK on that one.
>

I think it should be feasible to implement a hybrid bzImage/EFI zboot
format. There is already prior art in loaders that decompress the ELF
payload directly (Xen).

Given that a x86_64 bootloader running in long mode needs to do very
little beyond loading the ELF at some arbitrary 2M aligned offset and
calling the entrypoint with a struct bootparams in %RDI, most of the
logic in the decompressor is really only needed when booting in 32-bit
mode.

So I think there is value in having a generic boot format that can be
consumed by EFI directly, or by a generic ELF vmlinux loader (library)
that understands the EFI zboot format and knows how to extract the ELF
payload. I'd strongly prefer only a single idiom for describing the
relocations in the image.

On other architectures (i.e., without decompressor), EFI zboot would
be a prerequisite for fgkaslr, but it is up to the platform to decide
whether to boot via EFI or load the ELF and apply the relocations. On
x86_64, the same tooling would work seamlessly, but the decompressor
could apply the relocations itself as well.

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by Kees Cook 3 weeks, 4 days ago

On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote:
> On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > On 2026-01-08 01:25, Ard Biesheuvel wrote:
> > > This series is a follow-up to a series I sent a bit more than a year
> > > ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
> > > for further hardening measures, such as fg-kaslr [1], as well as further
> > > harmonization of the boot protocols between architectures [2].
> >
> > Kristin Accardi had fg-kasrl running without that, didn't she?

I understand "such as fg-kaslr" to have been just a terse way of saying
"such as a complete multi-architectural fg-kaslr"

> Yes, as a proof of concept. But it is tied to the x86 approach of
> performing runtime relocations based on build time relocation data,
> which is problematic now that linkers have started to perform
> relaxations, as these cannot always be translated 1:1. For instance,
> we already have a latent bug in the x86 relocs tool, which ignores
> GOTPCREL relocations on the basis that the relocation is relative.
> However, this is only true for Clang/lld, which does not update the
> static relocation tables after performing relaxations. ld.bfd does
> attempt to keep those tables in sync, and so a GOTPCREL relocation
> should be flagged as a bug when encountered, because it means there is
> a GOT slot somewhere with no relocation associated with it.

Another historical bit of context is that one of the main reasons
Kristen's fg-kaslr got stuck was the linker support needed for (the 65k
worth of) section pass-through. That never got resolved, and the solutions
either required huge linker files (that tickled performance flaws in the
linkers) that resulted in 10 minute linking times, or to disable all the
orphan section handling, which was a regression in our sanity checking
and bug-finding.

So, getting a well-behaved fg-kaslr still needs toolchain support,
and getting there is going to need further design work. As far as PIE,
this just makes the fg-kaslr toolchain work easier (fewer special cases),
along with all the other benefits of moving to PIE.

-Kees

-- 
Kees Cook

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by H. Peter Anvin 2 weeks, 5 days ago

On 2026-01-14 10:16, Kees Cook wrote:
> On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote:
>> On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> On 2026-01-08 01:25, Ard Biesheuvel wrote:
>>>> This series is a follow-up to a series I sent a bit more than a year
>>>> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
>>>> for further hardening measures, such as fg-kaslr [1], as well as further
>>>> harmonization of the boot protocols between architectures [2].
>>>
>>> Kristin Accardi had fg-kasrl running without that, didn't she?
> 
> I understand "such as fg-kaslr" to have been just a terse way of saying
> "such as a complete multi-architectural fg-kaslr"
> 
>> Yes, as a proof of concept. But it is tied to the x86 approach of
>> performing runtime relocations based on build time relocation data,
>> which is problematic now that linkers have started to perform
>> relaxations, as these cannot always be translated 1:1. For instance,
>> we already have a latent bug in the x86 relocs tool, which ignores
>> GOTPCREL relocations on the basis that the relocation is relative.
>> However, this is only true for Clang/lld, which does not update the
>> static relocation tables after performing relaxations. ld.bfd does
>> attempt to keep those tables in sync, and so a GOTPCREL relocation
>> should be flagged as a bug when encountered, because it means there is
>> a GOT slot somewhere with no relocation associated with it.
> 
> Another historical bit of context is that one of the main reasons
> Kristen's fg-kaslr got stuck was the linker support needed for (the 65k
> worth of) section pass-through. That never got resolved, and the solutions
> either required huge linker files (that tickled performance flaws in the
> linkers) that resulted in 10 minute linking times, or to disable all the
> orphan section handling, which was a regression in our sanity checking
> and bug-finding.
> 
> So, getting a well-behaved fg-kaslr still needs toolchain support,
> and getting there is going to need further design work. As far as PIE,
> this just makes the fg-kaslr toolchain work easier (fewer special cases),
> along with all the other benefits of moving to PIE.
> 

As I *explicitly* stated earlier, there isn't anything inherently wrong with
putting a small onus on x86 in order to make the general Linux code better --
but please, be honest about it *so we know what the actual tradeoffs are*.

For x86, we really do want to maintain the kernel memory model, which allows
us to directly reference symbols in complex address expressions and to
directly jump across modules. This means the "PIE" will need to be different
from the way PIE works in user space, which is in part designed to avoid
needing to dirty readonly pages, which would inhibit sharing -- which is
explicitly NOT a concern for the kernel.

So that is one thing that the toolchain needs to be able to do.

I fully expect that we will continue to need to have some kinds of overrides
for specific symbols, too, because there aren't any really sane ways to
express them to the toolchain; this especially applies to linker-script and
some assembly symbols. For example, the real-mode code (which uses the reloc
tool as well) has to support segment and segbase-relative relocations, which
are something that ELF simply has no concept of.

I have a lot more of an issue with trying to change the x86 boot protocol,
simply because the way booting works in x86 has been incredibly successful;
yes, the bzImage file format is ugly as ****, but that is a direct result of
34 years of continuous backwards compatibility. One of the reasons we have
been able to do that is that we have *explicitly* rejected other boot models,
such as Grub's self-declared Multiboot "standard" (which they have had to
revise multiple times by now) and the early Xen boot model of booting vmlinux
directly. We have added *many* capabilities to bzImage as needed, and it has
turned out to be quite flexible in the end.

That, in turn, has been possible exactly *because* the Linux kernel provides a
"prekernel". I don't even really like calling it the "decompressor" anymore;
it really has developed far beyond that.

Every time you introduce a new boot model you take a serious risk that a boot
loader author will say "hey, I'll just support this new model", and you *also*
take the serious risk that the boot model isn't adequate. When the Grub
authors says "we started using the 'modern' Linux entry point" -- meaning they
would bypass the entry stub and call the kernel entry point directly -- they
*reduced* the overall functionality, because history has shown that:

a. The kernel image is far easier for the end user to update than the
potentially many bootloaders, because the bootloader depends on the deployment
model (consider network booting, for example, or when the bootloader is in
firmware.)

b. The kernel image provides ONE CENTRAL PLACE to add features, fix bugs, and
develop workarounds for strange systems.

Thus, I will do anything I can to continue to veto changes to the x86 boot
model, unless they come with a VERY VERY good motivation.

We originally made the mistake with EFI to leave too much to the bootloader;
because of the downsides above we really needed to backpedal and take over
control much earlier in the flow, just as with BIOS -- apparently to the Grub
developers' utterly inexplicable chagrin.

	-hpa

Re: [RFC/RFT PATCH 00/19] Link the relocatable x86 kernel as PIE

Posted by Ard Biesheuvel 2 weeks, 4 days ago

On Tue, 20 Jan 2026 at 21:46, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-01-14 10:16, Kees Cook wrote:
> > On Fri, Jan 09, 2026 at 10:21:49AM +0100, Ard Biesheuvel wrote:
> >> On Fri, 9 Jan 2026 at 01:37, H. Peter Anvin <hpa@zytor.com> wrote:
> >>>
> >>> On 2026-01-08 01:25, Ard Biesheuvel wrote:
> >>>> This series is a follow-up to a series I sent a bit more than a year
> >>>> ago, to switch to PIE linking of x86_64 vmlinux, which is a prerequisite
> >>>> for further hardening measures, such as fg-kaslr [1], as well as further
> >>>> harmonization of the boot protocols between architectures [2].
> >>>
> >>> Kristin Accardi had fg-kasrl running without that, didn't she?
> >
> > I understand "such as fg-kaslr" to have been just a terse way of saying
> > "such as a complete multi-architectural fg-kaslr"
> >
> >> Yes, as a proof of concept. But it is tied to the x86 approach of
> >> performing runtime relocations based on build time relocation data,
> >> which is problematic now that linkers have started to perform
> >> relaxations, as these cannot always be translated 1:1. For instance,
> >> we already have a latent bug in the x86 relocs tool, which ignores
> >> GOTPCREL relocations on the basis that the relocation is relative.
> >> However, this is only true for Clang/lld, which does not update the
> >> static relocation tables after performing relaxations. ld.bfd does
> >> attempt to keep those tables in sync, and so a GOTPCREL relocation
> >> should be flagged as a bug when encountered, because it means there is
> >> a GOT slot somewhere with no relocation associated with it.
> >
> > Another historical bit of context is that one of the main reasons
> > Kristen's fg-kaslr got stuck was the linker support needed for (the 65k
> > worth of) section pass-through. That never got resolved, and the solutions
> > either required huge linker files (that tickled performance flaws in the
> > linkers) that resulted in 10 minute linking times, or to disable all the
> > orphan section handling, which was a regression in our sanity checking
> > and bug-finding.
> >
> > So, getting a well-behaved fg-kaslr still needs toolchain support,
> > and getting there is going to need further design work. As far as PIE,
> > this just makes the fg-kaslr toolchain work easier (fewer special cases),
> > along with all the other benefits of moving to PIE.
> >
>
> As I *explicitly* stated earlier, there isn't anything inherently wrong with
> putting a small onus on x86 in order to make the general Linux code better --
> but please, be honest about it *so we know what the actual tradeoffs are*.
>

It is not just about the general Linux code. The x86 fgkaslr
implementation was never merged because the toolchain side needs
changes. And convincing the toolchain maintainers to take our changes
is difficult if we keep using relocation tables that are not fit for
purpose to perform runtime fixups on code that was built using the
'kernel' code model, which is explicitly position dependent.

> For x86, we really do want to maintain the kernel memory model, which allows
> us to directly reference symbols in complex address expressions and to
> directly jump across modules.

AIUI, those complex address expressions are mostly indexed loads from
global arrays, which do get slightly less efficient, but not in a way
that was noticeable in any benchmarking I did (or LKP for that matter,
which generally sniffs out any performance regressions). This includes
jump tables, but as I already explained, RIP-relative jump tables have
an upside too, given that the table itself is only half the size.

The ability to directly jump across modules is not affected at all by
these changes.

> This means the "PIE" will need to be different
> from the way PIE works in user space, which is in part designed to avoid
> needing to dirty readonly pages, which would inhibit sharing -- which is
> explicitly NOT a concern for the kernel.
>

This is already implemented in this series: no GOT entries are
permitted, and text relocations are allowed.

> So that is one thing that the toolchain needs to be able to do.
>

It already can, and this series makes use of it.

Note that the size of the relocation table taken from an allmodconfig
bzImage drops from 7.3 M to 2.4 M (defconfig goes from 800k to 45k),
so there is a minor intrinsic benefit to these changes as well. But it
is mostly about moving away from bespoke tooling and formats that are
becoming more of a maintenance burden as the number of supported
toolchains and languages increases.

> I fully expect that we will continue to need to have some kinds of overrides
> for specific symbols, too, because there aren't any really sane ways to
> express them to the toolchain; this especially applies to linker-script and
> some assembly symbols. For example, the real-mode code (which uses the reloc
> tool as well) has to support segment and segbase-relative relocations, which
> are something that ELF simply has no concept of.
>

The real mode trampoline is not affected at all by these changes,
given that it is built as a separate executable. Using a bespoke
relocation format there is fine, because it is internal ABI.

> I have a lot more of an issue with trying to change the x86 boot protocol,
> simply because the way booting works in x86 has been incredibly successful;
> yes, the bzImage file format is ugly as ****, but that is a direct result of
> 34 years of continuous backwards compatibility. One of the reasons we have
> been able to do that is that we have *explicitly* rejected other boot models,
> such as Grub's self-declared Multiboot "standard" (which they have had to
> revise multiple times by now) and the early Xen boot model of booting vmlinux
> directly. We have added *many* capabilities to bzImage as needed, and it has
> turned out to be quite flexible in the end.
>
> That, in turn, has been possible exactly *because* the Linux kernel provides a
> "prekernel". I don't even really like calling it the "decompressor" anymore;
> it really has developed far beyond that.
>

The decompressor is needed when booting the 64-bit kernel from a boot
loader that calls it in 32-bit mode.

When entering in long mode, with all memory mapped 1:1 (or at least,
the kernel image itself, and all assets in memory that the bootloader
exposes to the kernel), the decompressor does nothing useful, and all
the problems it solves (by doing demand paging etc) only exist because
it created them in the first place.

SEV-SNP confidential compute made an even bigger mess of this, because
it can trigger #VC exceptions too, which also need to be handled.

Note that the EFI stub does not bother with the decompressor anymore,
and unpacks and boots vmlinux directly. This was needed because the
decompressor fundamentally relies on memory that is both writable and
executable (as it moves its own executable image around in memory),
which is difficult to reconcile with recent PC firmware
implementations that are pedantic about mapping memory RWX.

But actually, I am not proposing to get rid of bzImage. I am proposing
to make it more transparent so generic bootloader components can be
constructed that consume the ELF directly.