From: Ard Biesheuvel <ardb@kernel.org>
As an intermediate step towards enabling PIE linking for the 64-bit x86
kernel, enable PIE codegen for all objects that are linked into the
kernel proper.
This substantially reduces the number of relocations that need to be
processed when booting a relocatable KASLR kernel.
Before (size in bytes of the reloc table):
797372 arch/x86/boot/compressed/vmlinux.relocs
After:
400252 arch/x86/boot/compressed/vmlinux.relocs
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/Makefile | 11 ++++++++++-
arch/x86/boot/Makefile | 1 +
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/entry/vdso/Makefile | 1 +
arch/x86/realmode/rm/Makefile | 1 +
include/asm-generic/vmlinux.lds.h | 1 +
6 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b78b7623a4a9..83d20f402535 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -193,13 +193,22 @@ else
KBUILD_RUSTFLAGS += -Cno-redzone=y
KBUILD_RUSTFLAGS += -Ccode-model=kernel
+ PIE_CFLAGS-y := -fpie -mcmodel=small \
+ -include $(srctree)/include/linux/hidden.h
+
+ PIE_CFLAGS-$(CONFIG_CC_IS_GCC) += $(call cc-option.-mdirect-extern-access)
+ PIE_CFLAGS-$(CONFIG_CC_IS_CLANG) += -fdirect-access-external-data
+
ifeq ($(CONFIG_STACKPROTECTOR),y)
KBUILD_CFLAGS += -mstack-protector-guard-symbol=fixed_percpu_data
+
+ # the 'small' C model defaults to %fs
+ PIE_CFLAGS-$(CONFIG_SMP) += -mstack-protector-guard-reg=gs
endif
# Don't emit relaxable GOTPCREL relocations
KBUILD_AFLAGS_KERNEL += -Wa,-mrelax-relocations=no
- KBUILD_CFLAGS_KERNEL += -Wa,-mrelax-relocations=no
+ KBUILD_CFLAGS_KERNEL += -Wa,-mrelax-relocations=no $(PIE_CFLAGS-y)
endif
#
diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 9cc0ff6e9067..4d3ba35cb619 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -57,6 +57,7 @@ KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
KBUILD_CFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
KBUILD_CFLAGS += $(CONFIG_CC_IMPLICIT_FALLTHROUGH)
+KBUILD_CFLAGS_KERNEL :=
$(obj)/bzImage: asflags-y := $(SVGA_MODE)
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index f2051644de94..c362d36b5b69 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -73,7 +73,7 @@ LDFLAGS_vmlinux += -T
hostprogs := mkpiggy
HOST_EXTRACFLAGS += -I$(srctree)/tools/include
-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__start_rodata\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABbCDdGRSTtVW] \(_text\|__start_rodata\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
quiet_cmd_voffset = VOFFSET $@
cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index c9216ac4fb1e..7af9fecf9abb 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -141,6 +141,7 @@ endif
endif
$(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
+$(obj)/vdso32.so.dbg: KBUILD_CFLAGS_KERNEL :=
$(obj)/vdso32.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32) FORCE
$(call if_changed,vdso_and_check)
diff --git a/arch/x86/realmode/rm/Makefile b/arch/x86/realmode/rm/Makefile
index a0fb39abc5c8..70bf0a26da91 100644
--- a/arch/x86/realmode/rm/Makefile
+++ b/arch/x86/realmode/rm/Makefile
@@ -67,3 +67,4 @@ KBUILD_CFLAGS := $(REALMODE_CFLAGS) -D_SETUP -D_WAKEUP \
-I$(srctree)/arch/x86/boot
KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
+KBUILD_CFLAGS_KERNEL :=
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 2b079f73820f..3a084ac77109 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -349,6 +349,7 @@
*(DATA_MAIN) \
*(.data..decrypted) \
*(.ref.data) \
+ *(.data.rel*) \
*(.data..shared_aligned) /* percpu related */ \
*(.data.unlikely) \
__start_once = .; \
--
2.46.0.792.g87dc391469-goog
On 9/25/24 08:01, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > As an intermediate step towards enabling PIE linking for the 64-bit x86 > kernel, enable PIE codegen for all objects that are linked into the > kernel proper. > > This substantially reduces the number of relocations that need to be > processed when booting a relocatable KASLR kernel. > This really seems like going completely backwards to me. You are imposing a more restrictive code model on the kernel, optimizing for boot time in a way that will exert a permanent cost on the running kernel. There is a *huge* difference between the kernel and user space here: KERNEL MEMORY IS PERMANENTLY ALLOCATED, AND IS NEVER SHARED. Dirtying user pages requires them to be unshared and dirty, which is undesirable. Kernel pages are *always* unshared and dirty. > It also brings us much closer to the ordinary PIE relocation model used > for most of user space, which is therefore much better supported and > less likely to create problems as we increase the range of compilers and > linkers that need to be supported. We have been resisting *for ages* making the kernel worse to accomodate broken compilers. We don't "need" to support more compilers -- we need the compilers to support us. We have working compilers; any new compiler that wants to play should be expected to work correctly. -hpa
Hi Peter, Thanks for taking a look. On Tue, 1 Oct 2024 at 23:13, H. Peter Anvin <hpa@zytor.com> wrote: > > On 9/25/24 08:01, Ard Biesheuvel wrote: > > From: Ard Biesheuvel <ardb@kernel.org> > > > > As an intermediate step towards enabling PIE linking for the 64-bit x86 > > kernel, enable PIE codegen for all objects that are linked into the > > kernel proper. > > > > This substantially reduces the number of relocations that need to be > > processed when booting a relocatable KASLR kernel. > > > > This really seems like going completely backwards to me. > > You are imposing a more restrictive code model on the kernel, optimizing > for boot time in a way that will exert a permanent cost on the running > kernel. > Fair point about the boot time. This is not the only concern, though, and arguably the least important one. As I responded to Andi before, it is also about using a code model and relocation model that matches the reality of how the code is executed: - the early C code runs from the 1:1 mapping, and needs special hacks to accommodate this - KASLR runs the kernel from a different virtual address than the one we told the linker about > There is a *huge* difference between the kernel and user space here: > > KERNEL MEMORY IS PERMANENTLY ALLOCATED, AND IS NEVER SHARED. > No need to shout. > Dirtying user pages requires them to be unshared and dirty, which is > undesirable. Kernel pages are *always* unshared and dirty. > I guess you are referring to the use of a GOT? That is a valid concern, but it does not apply here. With hidden visibility and compiler command line options like -mdirect-access-extern, all emitted symbol references are direct. Disallowing text relocations could be trivially enabled with this series if desired, and actually helps avoid the tricky bugs we keep fixing in the early startup code that executes from the 1:1 mapping (the C code in .head.text) So it mostly comes down to minor differences in addressing modes, e.g., movq $sym, %reg actually uses more bytes than leaq sym(%rip), %reg whereas movq sym, %reg and movq sym(%rip), %reg are the same length. OTOH, indexing a statically allocated global array like movl array(,%reg1,4), %reg2 will be converted into leaq array(%rip), %reg2 movl (%reg2,%reg1,4), %reg2 and is therefore less efficient in terms of code footprint. But in general, the x86_64 ISA and psABI are quite flexible in this regard, and extrapolating from past experiences with PIC code on i386 is not really justified here. As Andi also pointed out, what ultimately matters is the performance, as well as code size where it impacts performance, through the I-cache footprint. I'll do some testing before reposting, and maybe not bother if the impact is negative. > > It also brings us much closer to the ordinary PIE relocation model used > > for most of user space, which is therefore much better supported and > > less likely to create problems as we increase the range of compilers and > > linkers that need to be supported. > > We have been resisting *for ages* making the kernel worse to accomodate > broken compilers. We don't "need" to support more compilers -- we need > the compilers to support us. We have working compilers; any new compiler > that wants to play should be expected to work correctly. > We are in a much better place now than we were before in that regard, which is actually how this effort came about: instead of lying to the compiler, and maintaining our own pile of scripts and relocation tools, we can just do what other arches are doing in Linux, and let the toolchain do it for us.
On Wed, 2 Oct 2024 at 08:31, Ard Biesheuvel <ardb@kernel.org> wrote: > > I guess you are referring to the use of a GOT? That is a valid > concern, but it does not apply here. With hidden visibility and > compiler command line options like -mdirect-access-extern, all emitted > symbol references are direct. I absolutely hate GOT entries. We definitely shouldn't ever do anything that causes them on x86-64. I'd much rather just do boot-time relocation, and I don't think the "we run code at a different location than we told the linker" is an arghument against it. Please, let's make sure we never have any of the global offset table horror. Yes, yes, you can't avoid them on other architectures. That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I feel are a complete no-brainer and should be done regardless of any other code generation issues. Let's not do relocation for no good reason. Linus
On Wed, 2 Oct 2024 at 22:02, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 2 Oct 2024 at 08:31, Ard Biesheuvel <ardb@kernel.org> wrote: > > > > I guess you are referring to the use of a GOT? That is a valid > > concern, but it does not apply here. With hidden visibility and > > compiler command line options like -mdirect-access-extern, all emitted > > symbol references are direct. > > I absolutely hate GOT entries. We definitely shouldn't ever do > anything that causes them on x86-64. > > I'd much rather just do boot-time relocation, and I don't think the > "we run code at a different location than we told the linker" is an > arghument against it. > > Please, let's make sure we never have any of the global offset table horror. > > Yes, yes, you can't avoid them on other architectures. > GCC/binutils never needs them. GCC 13 and older will emit some horrible indirect fentry hook calls via the GOT, but those are NOPed out by objtool anyway, so that is easily fixable. Clang/LLD is slightly trickier, because Clang emits relaxable GOTPCREL relocations, but LLD doesn't update the relocations emitted via --emit-relocs, so the relocs tool gets confused. This is one of the reasons I had for proposing to simply switch to PIE linking, and let the linker deal with all of that. Note that Clang may emit these even when not generating PIC/PIE code at all. So this is the reason I wanted to add support for GOTPCREL relocations in the relocs tool - it is really quite trivial to do, and makes our jobs slightly easier when dealing with these compiler quirks. The alternative would be to teach objtool how to relax 'movq foo@GOTPCREL(%rip)' into 'leaq foo(%rip)' - these are GOTPCREL relaxations described in the x86_64 psABI for ELF, which is why compilers are assuming more and more that emitting these is fine even without -fpic, given that the linker is supposed to elide them if possible. Note that there are 1 or 2 cases in the asm code where it is actually quite useful to refer to the address of foo as 'foo@GOTPCREL(%rip)' in instructions that take memory operands, but those individual cases are easily converted to something else if even having a GOT with just 2 entries is a dealbreaker for you. > That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I > feel are a complete no-brainer and should be done regardless of any > other code generation issues. > Yes, this is the primary reason I ended up looking into this in the first place. Earlier this year, we ended up having to introduce RIP_REL_REF() to emit those RIP-relative references explicitly, in order to prevent the C code that is called via the early 1:1 mapping from exploding. The amount of C code called in that manner has been growing steadily over time with the introduction of 5-level paging and SEV-SNP and TDX support, which need to play all kinds of tricks before the normal kernel mappings are created. Compiling with -fpie and linking with --pie -z text produces an executable that is guaranteed to have only RIP-relative references in the .text segment, removing the need for RIP_REL_REF entirely (it already does nothing when __pic__ is #define'd).
On 10/3/24 04:13, Ard Biesheuvel wrote: > >> That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I >> feel are a complete no-brainer and should be done regardless of any >> other code generation issues. > > Yes, this is the primary reason I ended up looking into this in the > first place. Earlier this year, we ended up having to introduce > RIP_REL_REF() to emit those RIP-relative references explicitly, in > order to prevent the C code that is called via the early 1:1 mapping > from exploding. The amount of C code called in that manner has been > growing steadily over time with the introduction of 5-level paging and > SEV-SNP and TDX support, which need to play all kinds of tricks before > the normal kernel mappings are created. > movq $sym to leaq sym(%rip) which you said ought to be smaller (and in reality appears to be the same size, 7 bytes) seems like a no-brainer and can be treated as a code quality issue -- in other words, file bug reports against gcc and clang. > Compiling with -fpie and linking with --pie -z text produces an > executable that is guaranteed to have only RIP-relative references in > the .text segment, removing the need for RIP_REL_REF entirely (it > already does nothing when __pic__ is #define'd). But -fpie has a considerable cost; specifically when we have indexed references, as in that case the base pointer needs to be manifest in a register, *and* it takes up a register slot in the EA, which may end converting one instruction into three. Now, the "kernel" memory model is defined in the ABI document, but there is nothing that prevents us from making updates to it if we need to; e.g. the statement that movq $sym can be used is undesirable, of course. -hpa
On Fri, Oct 4, 2024 at 11:06 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On 10/3/24 04:13, Ard Biesheuvel wrote: > > > >> That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I > >> feel are a complete no-brainer and should be done regardless of any > >> other code generation issues. > > > > Yes, this is the primary reason I ended up looking into this in the > > first place. Earlier this year, we ended up having to introduce > > RIP_REL_REF() to emit those RIP-relative references explicitly, in > > order to prevent the C code that is called via the early 1:1 mapping > > from exploding. The amount of C code called in that manner has been > > growing steadily over time with the introduction of 5-level paging and > > SEV-SNP and TDX support, which need to play all kinds of tricks before > > the normal kernel mappings are created. > > > > movq $sym to leaq sym(%rip) which you said ought to be smaller (and in > reality appears to be the same size, 7 bytes) seems like a no-brainer > and can be treated as a code quality issue -- in other words, file bug > reports against gcc and clang. It is the kernel assembly source that should be converted to rip-relative form, gcc (and probably clang) have nothing with it. Uros.
On 10/5/24 01:31, Uros Bizjak wrote: >> >> movq $sym to leaq sym(%rip) which you said ought to be smaller (and in >> reality appears to be the same size, 7 bytes) seems like a no-brainer >> and can be treated as a code quality issue -- in other words, file bug >> reports against gcc and clang. > > It is the kernel assembly source that should be converted to > rip-relative form, gcc (and probably clang) have nothing with it. > Sadly, that is not correct; neither gcc nor clang uses lea: -hpa gcc version 14.2.1 20240912 (Red Hat 14.2.1-3) (GCC) hpa@tazenda:/tmp$ cat foo.c int foobar; int *where_is_foobar(void) { return &foobar; } hpa@tazenda:/tmp$ gcc -mcmodel=kernel -O2 -c -o foo.o foo.c hpa@tazenda:/tmp$ objdump -dr foo.o foo.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <where_is_foobar>: 0: 48 c7 c0 00 00 00 00 mov $0x0,%rax 3: R_X86_64_32S foobar 7: c3 ret clang version 18.1.8 (Fedora 18.1.8-1.fc40) hpa@tazenda:/tmp$ clang -mcmodel=kernel -O2 -c -o foo.o foo.c hpa@tazenda:/tmp$ objdump -dr foo.o foo.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <where_is_foobar>: 0: 48 c7 c0 00 00 00 00 mov $0x0,%rax 3: R_X86_64_32S foobar 7: c3 ret
On Sun, Oct 6, 2024 at 1:37 AM H. Peter Anvin <hpa@zytor.com> wrote: > > On 10/5/24 01:31, Uros Bizjak wrote: > >> > >> movq $sym to leaq sym(%rip) which you said ought to be smaller (and in > >> reality appears to be the same size, 7 bytes) seems like a no-brainer > >> and can be treated as a code quality issue -- in other words, file bug > >> reports against gcc and clang. > > > > It is the kernel assembly source that should be converted to > > rip-relative form, gcc (and probably clang) have nothing with it. > > > > Sadly, that is not correct; neither gcc nor clang uses lea: > > -hpa > > > gcc version 14.2.1 20240912 (Red Hat 14.2.1-3) (GCC) > > hpa@tazenda:/tmp$ cat foo.c > int foobar; > > int *where_is_foobar(void) > { > return &foobar; > } > > hpa@tazenda:/tmp$ gcc -mcmodel=kernel -O2 -c -o foo.o foo.c Indeed, but my reply was in the context of -fpie, which guarantees RIP relative access. IOW, the compiler will always generate sym(%rip) with -fpie, but (obviously) can't change assembly code in the kernel when the PIE is requested. Otherwise, MOV $immediate, %reg is faster when PIE is not required, which is the case with -mcmodel=kernel. IIRC, LEA with %rip had some performance issues, which may not be the case anymore with newer processors. Due to the non-negligible impact of PIE, perhaps some kind of CONFIG_PIE config definition should be introduced, so the assembly code would be able to choose optimal asm sequence when PIE and non-PIE is requested? Uros.
... > Due to the non-negligible impact of PIE, perhaps some kind of > CONFIG_PIE config definition should be introduced, so the assembly > code would be able to choose optimal asm sequence when PIE and non-PIE > is requested? I wouldn't have thought that performance mattered in the asm code that runs during startup? While x86-84 code (ignoring data references) is pretty much always position independent, the same isn't true of all architectures. Some (at least Nios-II) only have absolute call instructions. So you can't really move to pic code globally. You'd also want 'bad' pic code that contained some fixups that needed the code patching. (Which you really don't want for a shared library.) Otherwise you get an extra instruction for non-trivial data accesses. Thinking.... Doesn't the code generated for -fpic assume that the dynamic loader has processed the relocations before it is run? But the kernel startup code is running before they can have been done? So even if that C code were 'pic' it could still contain things that are invalid (probably arrays of pointers?). So you lose one set of bugs and gain another. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Oct 6, 2024 at 8:01 PM David Laight <David.Laight@aculab.com> wrote: > > ... > > Due to the non-negligible impact of PIE, perhaps some kind of > > CONFIG_PIE config definition should be introduced, so the assembly > > code would be able to choose optimal asm sequence when PIE and non-PIE > > is requested? > > I wouldn't have thought that performance mattered in the asm code > that runs during startup? No, not the code that runs only once, where performance impact can be tolerated. This one: https://lore.kernel.org/lkml/20240925150059.3955569-44-ardb+git@google.com/ Uros.
On October 6, 2024 12:17:40 PM PDT, Uros Bizjak <ubizjak@gmail.com> wrote: >On Sun, Oct 6, 2024 at 8:01 PM David Laight <David.Laight@aculab.com> wrote: >> >> ... >> > Due to the non-negligible impact of PIE, perhaps some kind of >> > CONFIG_PIE config definition should be introduced, so the assembly >> > code would be able to choose optimal asm sequence when PIE and non-PIE >> > is requested? >> >> I wouldn't have thought that performance mattered in the asm code >> that runs during startup? > >No, not the code that runs only once, where performance impact can be tolerated. > >This one: > >https://lore.kernel.org/lkml/20240925150059.3955569-44-ardb+git@google.com/ > >Uros. > Yeah, running the kernel proper as PIE seems like a lose all around. The decompressor, ELF stub, etc, are of course a different matter entirely (and at least the latter can't rely on the small or kernel memory models anyway.)
On Sat, 5 Oct 2024 at 16:37, H. Peter Anvin <hpa@zytor.com> wrote: > > Sadly, that is not correct; neither gcc nor clang uses lea: Looking around, this may be intentional. At least according to Agner, several cores do better at "mov immediate" compared to "lea". Eg a RIP-relative LEA on Zen 2 gets a throughput of two per cycle, but a "MOV r,i" gets four. That got fixed in Zen 3 and later, but apparently Intel had similar issues (Ivy Bridge: 1 LEA per cycle, vs 3 "mov i,r". Haswell is 1:4). Of course, Agner's tables are good, but not necessarily always the whole story. There are other instruction tables on the internet (eg uops.info) with possibly more info. And in reality, I would expect it to be a complete non-issue with any OoO engine and real code, because you are very seldom ALU limited particularly when there aren't any data dependencies. But a RIP-relative LEA does seem to put a *bit* more pressure on the core resources, so the compilers are may be right to pick a "mov". Linus
On Sun, Oct 6, 2024 at 2:00 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, 5 Oct 2024 at 16:37, H. Peter Anvin <hpa@zytor.com> wrote: > > > > Sadly, that is not correct; neither gcc nor clang uses lea: > > Looking around, this may be intentional. At least according to Agner, > several cores do better at "mov immediate" compared to "lea". > > Eg a RIP-relative LEA on Zen 2 gets a throughput of two per cycle, but > a "MOV r,i" gets four. That got fixed in Zen 3 and later, but > apparently Intel had similar issues (Ivy Bridge: 1 LEA per cycle, vs 3 > "mov i,r". Haswell is 1:4). Yes, this is the case. I just missed your reply when replying to Peter's mail with a not so precise answer. Uros.
Ard Biesheuvel <ardb+git@google.com> writes: > This substantially reduces the number of relocations that need to be > processed when booting a relocatable KASLR kernel. > > Before (size in bytes of the reloc table): > > 797372 arch/x86/boot/compressed/vmlinux.relocs > > After: > > 400252 arch/x86/boot/compressed/vmlinux.relocs I don't know why anybody would care about the size of the relocation table? What matters is what it does to general performance. Traditionally even on x86-64 PIC/E has a cost and the kernel model was intended to avoid that. From my perspective this patch kit doesn't fix a real problem, it's all risk of performance regression with no gain. -Andi
On Wed, 25 Sept 2024 at 23:09, Andi Kleen <ak@linux.intel.com> wrote: > > Ard Biesheuvel <ardb+git@google.com> writes: > > This substantially reduces the number of relocations that need to be > > processed when booting a relocatable KASLR kernel. > > > > Before (size in bytes of the reloc table): > > > > 797372 arch/x86/boot/compressed/vmlinux.relocs > > > > After: > > > > 400252 arch/x86/boot/compressed/vmlinux.relocs > > I don't know why anybody would care about the size of the relocation table? > Fair point. > What matters is what it does to general performance. > > Traditionally even on x86-64 PIC/E has a cost and the kernel model > was intended to avoid that. > Is the x86_64 kernel C model specified anywhere, to your knowledge? > From my perspective this patch kit doesn't fix a real problem, > it's all risk of performance regression with no gain. > It's all in the cover letter and the commit logs so I won't rehash it here, but I understand that your priorities may be different from mine. I'll provide some numbers about the impact on code size. Are there any other performance related aspects that you think might be impacted by the use of position independent code generation?
On Wed, Sep 25, 2024 at 11:23:39PM +0200, Ard Biesheuvel wrote: > > What matters is what it does to general performance. > > > > Traditionally even on x86-64 PIC/E has a cost and the kernel model > > was intended to avoid that. > > > > Is the x86_64 kernel C model specified anywhere, to your knowledge? The basics are in the ABI. Maybe some of the details of TLS / stack protector are missing (I guess that could be fixed, adding HJ) Some of the motivation was also in early papers like https://www.ucw.cz/~hubicka/papers/amd64/amd64.html I'm copying Honza Hubicka who did the original work. > > > From my perspective this patch kit doesn't fix a real problem, > > it's all risk of performance regression with no gain. > > > > It's all in the cover letter and the commit logs so I won't rehash it > here, but I understand that your priorities may be different from > mine. It sounded fairly nebulous to me. If Linux wanted to support a third tool chain and it didn't support the kernel model yet it would be somehow easier. Apart from the kernel model likely being one of the minor issues in such a endeavour, I don't see a third tool chain othan than gcc and llvm anywhere on the horizon? > > I'll provide some numbers about the impact on code size. Are there any > other performance related aspects that you think might be impacted by > the use of position independent code generation? Code size isn't a sufficient metric either. Linux sometimes goes to great length for small gains, for example there was a huge effort to avoid frame pointers, even though it's a small percentage delta. PIC could well be larger than frame pointers. You need to run it with some real workloads, e.g. some of the kernel oriented workloads in 0day or phoronix, and see if there are performance regressions. Unfortunately for an intrusive change like this this might also vary for different CPUs, so may need some more coverage. -Andi
On Thu, 26 Sept 2024 at 10:48, Andi Kleen <ak@linux.intel.com> wrote: > > > On Wed, Sep 25, 2024 at 11:23:39PM +0200, Ard Biesheuvel wrote: > > > What matters is what it does to general performance. > > > > > > Traditionally even on x86-64 PIC/E has a cost and the kernel model > > > was intended to avoid that. > > > > > > > Is the x86_64 kernel C model specified anywhere, to your knowledge? > > The basics are in the ABI. Maybe some of the details of TLS / stack > protector are missing (I guess that could be fixed, adding HJ) > > Some of the motivation was also in early papers like > https://www.ucw.cz/~hubicka/papers/amd64/amd64.html > > I'm copying Honza Hubicka who did the original work. > Thanks. So the psABI has Kernel code model: The kernel of an operating system is usually rather small but runs in the negative half of the address space. plus an explanation on the ranges of symbolic references. The problem here is that it is inherently a position dependent code model, where 'the virtual address of code executed is known at link time' as per the psABI. We currently violate that assumption in two different ways: - the decompressor applies KASLR by using the static ELF relocations (which are intended for consumption by the linker not the loader) and use them to relocate the executable to its randomized virtual address; those static relocations can go out of sync with the actual code when relaxations are applied; - some of the startup code is now written in C, but is called via the 1:1 mapping; absolute symbol references don't work in that context, and we rely on faith and a whole pile of hacks to ensure that this does not break. > > > > > From my perspective this patch kit doesn't fix a real problem, > > > it's all risk of performance regression with no gain. > > > > > > > It's all in the cover letter and the commit logs so I won't rehash it > > here, but I understand that your priorities may be different from > > mine. > > It sounded fairly nebulous to me. If Linux wanted to support a third tool chain > and it didn't support the kernel model yet it would be somehow easier. > Apart from the kernel model likely being one of the minor issues > in such a endeavour, I don't see a third tool chain othan than gcc and llvm > anywhere on the horizon? > I was referring to Rust, which is llvm based but will also have a GCC based alternative (gcc-rs) in the future. I am aware that these will most likely reuse most of the existing backends, where these concerns have the most impact, but they still need wider consideration than they used to in the past. So the tl;dr for the rationale behind this series is that it is better to use a code model and a relocation model that a) matches the reality of how our code operates, and b) deviates as little as possible from how code is generally constructed for user space. On top of that, it would be better to use relocation metadata that is intended for consumption by a runtime loader, rather than rolling our own based on a relocation format that is intended for consumption by the build time linker. > > > > I'll provide some numbers about the impact on code size. Are there any > > other performance related aspects that you think might be impacted by > > the use of position independent code generation? > > Code size isn't a sufficient metric either. > > Linux sometimes goes to great length for small gains, for example > there was a huge effort to avoid frame pointers, even though it's a > small percentage delta. PIC could well be larger than frame pointers. > > You need to run it with some real workloads, e.g. some of the kernel > oriented workloads in 0day or phoronix, and see if there are > performance regressions. > > Unfortunately for an intrusive change like this this might also vary for > different CPUs, so may need some more coverage. > I'll spend some time to look into this in more detail (and get some help internally at Google to measure the real-world impact)
© 2016 - 2024 Red Hat, Inc.