[RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references

Ard Biesheuvel posted 2 patches 1 year ago
[RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ard Biesheuvel 1 year ago
From: Ard Biesheuvel <ardb@kernel.org>

Compiler emitted absolute references are often section-relative, as the
objects in question sometimes don't even exist in the C code (e.g., jump
tables) or have static linkage.

Enhance the diagnostic that is printed when detecting absolute
references in .head.text, but printing the addend of the symbol
reference, and the location in vmlinux where the reference can be found.

So instead of printing

  Absolute reference to symbol '.rodata' detected not permitted in .head.text

and failing the build, print the below but only as a warning.

  Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).
  This kernel may might not boot.

Not failing the build also works around the issue that the file vmlinux
will be deleted by make when an error occurs, which is not very helpful
in trying to narrow down the problem.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/tools/relocs.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index e937be979ec8..134cf5cfe7bd 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -901,9 +901,10 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
 		}
 
 		if (headtext) {
-			die("Absolute reference to symbol '%s' not permitted in .head.text\n",
-			    symname);
-			break;
+			fprintf(stderr,
+				"Absolute reference to symbol '%s+0x%lx' detected in .head.text (0x%lx).\n"
+				"This kernel might not boot.\n",
+				symname, rel->r_addend, offset);
 		}
 
 		/*
-- 
2.48.1.262.g85cc9f2d1e-goog
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Linus Torvalds 1 year ago
On Mon, 27 Jan 2025 at 03:43, Ard Biesheuvel <ardb+git@google.com> wrote:
>
>   Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).

Do we have any symbol name lookup logic anywhere?

Because it would be much more helpful if it also printed out the
source symbol and target symbol name, not just the offsets in the
section..

Yes, with the offsets - and the lack of fatal error, so that the
vmlinux file stays around - you can now much more easily do some gdb
or objdump magic to then pinpoint what's up, but it still ends up
being very much a "you _really_ need to understand what's up" kind of
thing.

                  Linus
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ard Biesheuvel 1 year ago
On Mon, 27 Jan 2025 at 17:57, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 27 Jan 2025 at 03:43, Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> >   Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).
>
> Do we have any symbol name lookup logic anywhere?
>

I can look into that. In this particular case, though, there is no
symbol to look up as it is a anonymous jump table generated by the
compiler. And the function name would be inaccurate too, as
snp_cpuid_postprocess() got inlined into its caller. But I guess with
the right DWARF data, at least the call site could be narrowed down a
bit better.
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ingo Molnar 1 year ago
* Ard Biesheuvel <ardb@kernel.org> wrote:

> On Mon, 27 Jan 2025 at 17:57, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Mon, 27 Jan 2025 at 03:43, Ard Biesheuvel <ardb+git@google.com> wrote:
> > >
> > >   Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).
> >
> > Do we have any symbol name lookup logic anywhere?
> >
> 
> I can look into that. In this particular case, though, there is no
> symbol to look up as it is a anonymous jump table generated by the
> compiler. And the function name would be inaccurate too, as
> snp_cpuid_postprocess() got inlined into its caller. But I guess with
> the right DWARF data, at least the call site could be narrowed down a
> bit better.

So patch #2 is now upstream, but should I apply this diagnostic patch 
as-is, or will there be a -v2?

Thanks,

	Ingo
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ard Biesheuvel 1 year ago
On Mon, 3 Feb 2025 at 10:40, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > On Mon, 27 Jan 2025 at 17:57, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Mon, 27 Jan 2025 at 03:43, Ard Biesheuvel <ardb+git@google.com> wrote:
> > > >
> > > >   Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).
> > >
> > > Do we have any symbol name lookup logic anywhere?
> > >
> >
> > I can look into that. In this particular case, though, there is no
> > symbol to look up as it is a anonymous jump table generated by the
> > compiler. And the function name would be inaccurate too, as
> > snp_cpuid_postprocess() got inlined into its caller. But I guess with
> > the right DWARF data, at least the call site could be narrowed down a
> > bit better.
>
> So patch #2 is now upstream, but should I apply this diagnostic patch
> as-is, or will there be a -v2?
>

I'm looking into this. But give the points above, I'm reaching the
conclusion that producing a better diagnostic based solely on vmlinux
(which may be built without debug info) is intractable, and not even
the DWARF metadata will describe a compiler generated jump table using
a named ELF symbol.

So I am also looking into isolating the startup code like I did for
arm64 (and which has been adopted by RISC-V as well), but this is
rather hairy on x86 so it will take some time. But once that lands,
this diagnostic can be removed.

So I will leave it up to you to decide whether to merge this
improvement for now, or revert the diagnostic as you suggested before.
This code has already identified some issues that were subsequently
fixed, so it has already served its purpose.

As for the related changes: .head.text can be removed in its entirety
once this is all done, so no need to revert the linker script changes
at this point. (The startup code that only executes at boot on the BSP
can move into .init.text where it belongs)
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ingo Molnar 11 months, 2 weeks ago
* Ard Biesheuvel <ardb@kernel.org> wrote:

> On Mon, 3 Feb 2025 at 10:40, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > > On Mon, 27 Jan 2025 at 17:57, Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Mon, 27 Jan 2025 at 03:43, Ard Biesheuvel <ardb+git@google.com> wrote:
> > > > >
> > > > >   Absolute reference to symbol '.rodata+0x180' detected in .head.text (0xffffffff820cb4ba).
> > > >
> > > > Do we have any symbol name lookup logic anywhere?
> > > >
> > >
> > > I can look into that. In this particular case, though, there is no
> > > symbol to look up as it is a anonymous jump table generated by the
> > > compiler. And the function name would be inaccurate too, as
> > > snp_cpuid_postprocess() got inlined into its caller. But I guess with
> > > the right DWARF data, at least the call site could be narrowed down a
> > > bit better.
> >
> > So patch #2 is now upstream, but should I apply this diagnostic patch
> > as-is, or will there be a -v2?
> >
> 
> I'm looking into this. But give the points above, I'm reaching the
> conclusion that producing a better diagnostic based solely on vmlinux
> (which may be built without debug info) is intractable, and not even
> the DWARF metadata will describe a compiler generated jump table using
> a named ELF symbol.
> 
> So I am also looking into isolating the startup code like I did for
> arm64 (and which has been adopted by RISC-V as well), but this is
> rather hairy on x86 so it will take some time. But once that lands,
> this diagnostic can be removed.
> 
> So I will leave it up to you to decide whether to merge this
> improvement for now, or revert the diagnostic as you suggested before.
> This code has already identified some issues that were subsequently
> fixed, so it has already served its purpose.

So after another 2 weeks there's been no new upstream regressions I'm 
aware of, so - knock on wood - it seems we can leave the die() in 
place?

But could we perhaps make it more debuggable, should it trigger - such 
as not removing the relevant object file and improving the message? 
I.e. make the build failure experience Linus had somewhat more 
palatable...

Thanks,

	Ingo
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ingo Molnar 11 months, 2 weeks ago
* Ingo Molnar <mingo@kernel.org> wrote:

> So after another 2 weeks there's been no new upstream regressions I'm 
> aware of, so - knock on wood - it seems we can leave the die() in 
> place?
> 
> But could we perhaps make it more debuggable, should it trigger - 
> such as not removing the relevant object file and improving the 
> message? I.e. make the build failure experience Linus had somewhat 
> more palatable...

For example, the new message is far better, even when combined with a 
die() build failure:

-                       die("Absolute reference to symbol '%s' not permitted in .head.text\n",
-                           symname);
-                       break;
+                       fprintf(stderr,
+                               "Absolute reference to symbol '%s+0x%lx' detected in .head.text (0x%lx).\n"
+                               "This kernel might not boot.\n",
+                               symname, rel->r_addend, offset);

as it points out that the underlying bug might result in an unbootable 
kernel image. So the user at least knows what the pain is about ...

Thanks,

	Ingo
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ard Biesheuvel 11 months, 2 weeks ago
On Sat, 22 Feb 2025 at 13:03, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > So after another 2 weeks there's been no new upstream regressions I'm
> > aware of, so - knock on wood - it seems we can leave the die() in
> > place?
> >
> > But could we perhaps make it more debuggable, should it trigger -
> > such as not removing the relevant object file and improving the
> > message? I.e. make the build failure experience Linus had somewhat
> > more palatable...
>
> For example, the new message is far better, even when combined with a
> die() build failure:
>
> -                       die("Absolute reference to symbol '%s' not permitted in .head.text\n",
> -                           symname);
> -                       break;
> +                       fprintf(stderr,
> +                               "Absolute reference to symbol '%s+0x%lx' detected in .head.text (0x%lx).\n"
> +                               "This kernel might not boot.\n",
> +                               symname, rel->r_addend, offset);
>
> as it points out that the underlying bug might result in an unbootable
> kernel image. So the user at least knows what the pain is about ...
>

Ultimately, it is the die() that results in vmlinux to be deleted. And
this is actually a result of the slightly dubious way the
Makefile.postlink logic works: usually, artifacts are created once by
the Makefile rule that defines how they are built, and if that rule
fails, no output is produced but the input is preserved. In the
vmlinux case, the file is modified by a separate rule that executes
Makefile.postlink in an entirely separate make invocation, which
splits off the static ELF relocations, using vmlinux both as input and
output.

I can have a stab at fixing that instead. That way, we can use the
improved diagnostic message, and leave the die() in place without it
resulting in vmlinux to be deleted.
Re: [RFC PATCH 1/2] x86/relocs: Improve diagnostic for rejected absolute references
Posted by Ingo Molnar 11 months, 2 weeks ago
* Ard Biesheuvel <ardb@kernel.org> wrote:

> On Sat, 22 Feb 2025 at 13:03, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> > > So after another 2 weeks there's been no new upstream regressions I'm
> > > aware of, so - knock on wood - it seems we can leave the die() in
> > > place?
> > >
> > > But could we perhaps make it more debuggable, should it trigger -
> > > such as not removing the relevant object file and improving the
> > > message? I.e. make the build failure experience Linus had somewhat
> > > more palatable...
> >
> > For example, the new message is far better, even when combined with a
> > die() build failure:
> >
> > -                       die("Absolute reference to symbol '%s' not permitted in .head.text\n",
> > -                           symname);
> > -                       break;
> > +                       fprintf(stderr,
> > +                               "Absolute reference to symbol '%s+0x%lx' detected in .head.text (0x%lx).\n"
> > +                               "This kernel might not boot.\n",
> > +                               symname, rel->r_addend, offset);
> >
> > as it points out that the underlying bug might result in an unbootable
> > kernel image. So the user at least knows what the pain is about ...
> >
> 
> Ultimately, it is the die() that results in vmlinux to be deleted. And
> this is actually a result of the slightly dubious way the
> Makefile.postlink logic works: usually, artifacts are created once by
> the Makefile rule that defines how they are built, and if that rule
> fails, no output is produced but the input is preserved. In the
> vmlinux case, the file is modified by a separate rule that executes
> Makefile.postlink in an entirely separate make invocation, which
> splits off the static ELF relocations, using vmlinux both as input and
> output.
> 
> I can have a stab at fixing that instead. That way, we can use the
> improved diagnostic message, and leave the die() in place without it
> resulting in vmlinux to be deleted.

This sounds like the right approach to me too!

Thanks,

	Ingo