[v2] x86,mm: print likely CPU at segfault time

[PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Rik van Riel 3 years, 8 months ago

In a large enough fleet of computers, it is common to have a few bad CPUs.
Those can often be identified by seeing that some commonly run kernel code,
which runs fine everywhere else, keeps crashing on the same CPU core on one
particular bad system.

However, the failure modes in CPUs that have gone bad over the years are
often oddly specific, and the only bad behavior seen might be segfaults
in programs like bash, python, or various system daemons that run fine
everywhere else.

Add a printk() to show_signal_msg() to print the CPU, core, and socket
at segfault time. This is not perfect, since the task might get rescheduled
on another CPU between when the fault hit, and when the message is printed,
but in practice this has been good enough to help us identify several bad
CPU cores.

segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)

Signed-off-by: Rik van Riel <riel@surriel.com>
CC: Dave Jones <dsj@fb.com>
---
 arch/x86/mm/fault.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fad8faa29d04..a9b93a7816f9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 		unsigned long address, struct task_struct *tsk)
 {
 	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
+	/* This is a racy snapshot, but it's better than nothing. */
+	int cpu = READ_ONCE(raw_smp_processor_id());
 
 	if (!unhandled_signal(tsk, SIGSEGV))
 		return;
@@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
 
 	print_vma_addr(KERN_CONT " in ", regs->ip);
 
+	/*
+	 * Dump the likely CPU where the fatal segfault happened.
+	 * This can help identify faulty hardware.
+	 */
+	printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
+	       topology_core_id(cpu), topology_physical_package_id(cpu));
+
+
 	printk(KERN_CONT "\n");
 
 	show_opcodes(regs, loglvl);
-- 
2.37.1

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Borislav Petkov 3 years, 8 months ago

On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote:
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
> 
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)

And what happens when someone is looking at this, the CPU information is
wrong because we got rescheduled but...

> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
>  arch/x86/mm/fault.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address, struct task_struct *tsk)
>  {
>  	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> +	/* This is a racy snapshot, but it's better than nothing. */

... someone is missing this important tidbit here that the CPU info
above is unreliable?

Someone is sent on a wild goose chase.

Can't you read out the CPU number before interrupts are enabled and hand
it down for printing?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Rik van Riel 3 years, 8 months ago

On Fri, 2022-08-05 at 12:08 +0200, Borislav Petkov wrote:
> On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote:
> > Add a printk() to show_signal_msg() to print the CPU, core, and
> > socket
> > at segfault time. This is not perfect, since the task might get
> > rescheduled
> > on another CPU between when the fault hit, and when the message is
> > printed,
> > but in practice this has been good enough to help us identify
> > several bad
> > CPU cores.
> > 
> > segfault[1349]: segfault at 0 ip 000000000040113a sp
> > 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0,
> > socket 0)
> 
> And what happens when someone is looking at this, the CPU information
> is
> wrong because we got rescheduled but...
> 
> ... someone is missing this important tidbit here that the CPU info
> above is unreliable?
> 
> Someone is sent on a wild goose chase.
> 
We have been using this patch for the past year, and it
does not appear to be an issue in practice.

When a faulty CPU core causes tasks to segfault, we
typically see >95% of the errors printed for the CPU
core that is bad, and only a few on other CPUs.

Those other CPU cores tend to be inside the same physical
chip too, so they get replaced at the same time.

CPU failures tend to be oddly specific, often leading to
dozens (or hundreds) of segfaults in bash or python for 
some particular script, while the main workload on the 
system continues running without issues.

Having a small percentage of the segfaults show up on
cores other than the broken one does not cause issues with
detection or diagnosis.

> Can't you read out the CPU number before interrupts are enabled and
> hand
> it down for printing?
> 
We could, but then we would be reading the CPU number
on every page fault, just in case it's a segfault.

That does not seem like a worthwhile tradeoff, given
how much of a hot path page faults are, and how rare
segfaults are.

Furthermore, we still have the possibilities that a thread
on one CPU crashes because another, faulty CPU scribbled
over its data, or that a thread is preempted and migrated
while still in userspace in-between deciding on a bad address
and accessing it.

I'm willing to be convinced otherwise, but staying out of
the hot path for something so rare seems like a typical
thing to do?

-- 
All Rights Reversed.

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Borislav Petkov 3 years, 8 months ago

On Fri, Aug 05, 2022 at 08:53:46AM -0400, Rik van Riel wrote:
> Having a small percentage of the segfaults show up on
> cores other than the broken one does not cause issues with
> detection or diagnosis.

I'm sorry but I'm not buying any of this: this should either be 100%
correct or it can stay on your kernels.

> We could, but then we would be reading the CPU number
> on every page fault, just in case it's a segfault.
> 
> That does not seem like a worthwhile tradeoff, given
> how much of a hot path page faults are, and how rare
> segfaults are.

Oh wow, a whopping single instruction:

	movl %gs:cpu_number(%rip), %eax # cpu_number, pfo_val__

What tradeoff?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Ingo Molnar 3 years, 8 months ago

* Rik van Riel <riel@surriel.com> wrote:

> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
> 
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
> 
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
> 
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
>  arch/x86/mm/fault.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address, struct task_struct *tsk)
>  {
>  	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> +	/* This is a racy snapshot, but it's better than nothing. */
> +	int cpu = READ_ONCE(raw_smp_processor_id());
>  
>  	if (!unhandled_signal(tsk, SIGSEGV))
>  		return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  
>  	print_vma_addr(KERN_CONT " in ", regs->ip);
>  
> +	/*
> +	 * Dump the likely CPU where the fatal segfault happened.
> +	 * This can help identify faulty hardware.
> +	 */
> +	printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> +	       topology_core_id(cpu), topology_physical_package_id(cpu));

LGTM, applying this to tip:x86/mm unless someone objects.

I've added the tidbit to the changelog that this only gets printed if 
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled - 
which is off by default. So your patch expands upon a default-off debug 
printout in essence - where utility maximization is OK.

Thanks,

	Ingo

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Rik van Riel 3 years, 8 months ago

On Thu, 2022-08-04 at 22:17 +0200, Ingo Molnar wrote:
> 
> I've added the tidbit to the changelog that this only gets printed if
> show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is

The show_unhandled_signals variable seems to be controlled
through /proc/sys/debug/exception-trace, and be on by default.

Enabling print-fatal-signals makes the kernel a lot more
chatty, and does not offer the same value in my experiments :)

> enabled - 
> which is off by default. So your patch expands upon a default-off
> debug 
> printout in essence - where utility maximization is OK.
> 
> Thanks,
> 
>         Ingo
> 

-- 
All Rights Reversed.

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Ingo Molnar 3 years, 8 months ago

* Rik van Riel <riel@surriel.com> wrote:

> The show_unhandled_signals variable seems to be controlled through 
> /proc/sys/debug/exception-trace, and be on by default.

Indeed ... then this too should be pointed out in the changelog.

Plus the patch doesn't build on top of the latest upstream kernel, 
x86-defconfig:

  CC      arch/x86/mm/fault.o
  In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
                  from ./include/linux/compiler.h:248,
                  from ./arch/x86/include/asm/current.h:5,
                  from ./include/linux/sched.h:12,
                  from arch/x86/mm/fault.c:7:
  arch/x86/mm/fault.c: In function ‘show_signal_msg’:
  ./include/asm-generic/rwonce.h:44:71: error: lvalue required as unary ‘&’ operand
     44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
        |                                                                       ^
  ./include/asm-generic/rwonce.h:50:9: note: in expansion of macro ‘__READ_ONCE’
     50 |         __READ_ONCE(x);                                                 \
        |         ^~~~~~~~~~~
  arch/x86/mm/fault.c:773:19: note: in expansion of macro ‘READ_ONCE’
    773 |         int cpu = READ_ONCE(raw_smp_processor_id());
        |                   ^~~~~~~~~
  make[2]: *** [scripts/Makefile.build:249: arch/x86/mm/fault.o] Error 1

Thanks,

	Ingo

Re: [PATCH v2] x86,mm: print likely CPU at segfault time

Posted by Rik van Riel 3 years, 8 months ago

On Fri, 2022-08-05 at 09:59 +0200, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > The show_unhandled_signals variable seems to be controlled through 
> > /proc/sys/debug/exception-trace, and be on by default.
> 
> Indeed ... then this too should be pointed out in the changelog.
> 
Will do.

> Plus the patch doesn't build on top of the latest upstream kernel, 
> x86-defconfig:

Urgh, let me fix that too. I'm not sure what happened
there, I might have done the kernel build in a different
directory than where the code was :(

v3 coming soon.

-- 
All Rights Reversed.