arch/x86/mm/fault.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
In a large enough fleet of computers, it is common to have a few bad CPUs.
Those can often be identified by seeing that some commonly run kernel code,
which runs fine everywhere else, keeps crashing on the same CPU core on one
particular bad system.
However, the failure modes in CPUs that have gone bad over the years are
often oddly specific, and the only bad behavior seen might be segfaults
in programs like bash, python, or various system daemons that run fine
everywhere else.
Add a printk() to show_signal_msg() to print the CPU, core, and socket
at segfault time. This is not perfect, since the task might get rescheduled
on another CPU between when the fault hit, and when the message is printed,
but in practice this has been good enough to help us identify several bad
CPU cores.
segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
Signed-off-by: Rik van Riel <riel@surriel.com>
CC: Dave Jones <dsj@fb.com>
---
arch/x86/mm/fault.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fad8faa29d04..a9b93a7816f9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
+ /* This is a racy snapshot, but it's better than nothing. */
+ int cpu = READ_ONCE(raw_smp_processor_id());
if (!unhandled_signal(tsk, SIGSEGV))
return;
@@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
print_vma_addr(KERN_CONT " in ", regs->ip);
+ /*
+ * Dump the likely CPU where the fatal segfault happened.
+ * This can help identify faulty hardware.
+ */
+ printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
+ topology_core_id(cpu), topology_physical_package_id(cpu));
+
+
printk(KERN_CONT "\n");
show_opcodes(regs, loglvl);
--
2.37.1
On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote:
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
>
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
And what happens when someone is looking at this, the CPU information is
wrong because we got rescheduled but...
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
> arch/x86/mm/fault.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct task_struct *tsk)
> {
> const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> + /* This is a racy snapshot, but it's better than nothing. */
... someone is missing this important tidbit here that the CPU info
above is unreliable?
Someone is sent on a wild goose chase.
Can't you read out the CPU number before interrupts are enabled and hand
it down for printing?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, 2022-08-05 at 12:08 +0200, Borislav Petkov wrote: > On Thu, Aug 04, 2022 at 03:54:50PM -0400, Rik van Riel wrote: > > Add a printk() to show_signal_msg() to print the CPU, core, and > > socket > > at segfault time. This is not perfect, since the task might get > > rescheduled > > on another CPU between when the fault hit, and when the message is > > printed, > > but in practice this has been good enough to help us identify > > several bad > > CPU cores. > > > > segfault[1349]: segfault at 0 ip 000000000040113a sp > > 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, > > socket 0) > > And what happens when someone is looking at this, the CPU information > is > wrong because we got rescheduled but... > > ... someone is missing this important tidbit here that the CPU info > above is unreliable? > > Someone is sent on a wild goose chase. > We have been using this patch for the past year, and it does not appear to be an issue in practice. When a faulty CPU core causes tasks to segfault, we typically see >95% of the errors printed for the CPU core that is bad, and only a few on other CPUs. Those other CPU cores tend to be inside the same physical chip too, so they get replaced at the same time. CPU failures tend to be oddly specific, often leading to dozens (or hundreds) of segfaults in bash or python for some particular script, while the main workload on the system continues running without issues. Having a small percentage of the segfaults show up on cores other than the broken one does not cause issues with detection or diagnosis. > Can't you read out the CPU number before interrupts are enabled and > hand > it down for printing? > We could, but then we would be reading the CPU number on every page fault, just in case it's a segfault. That does not seem like a worthwhile tradeoff, given how much of a hot path page faults are, and how rare segfaults are. Furthermore, we still have the possibilities that a thread on one CPU crashes because another, faulty CPU scribbled over its data, or that a thread is preempted and migrated while still in userspace in-between deciding on a bad address and accessing it. I'm willing to be convinced otherwise, but staying out of the hot path for something so rare seems like a typical thing to do? -- All Rights Reversed.
On Fri, Aug 05, 2022 at 08:53:46AM -0400, Rik van Riel wrote:
> Having a small percentage of the segfaults show up on
> cores other than the broken one does not cause issues with
> detection or diagnosis.
I'm sorry but I'm not buying any of this: this should either be 100%
correct or it can stay on your kernels.
> We could, but then we would be reading the CPU number
> on every page fault, just in case it's a segfault.
>
> That does not seem like a worthwhile tradeoff, given
> how much of a hot path page faults are, and how rare
> segfaults are.
Oh wow, a whopping single instruction:
movl %gs:cpu_number(%rip), %eax # cpu_number, pfo_val__
What tradeoff?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
* Rik van Riel <riel@surriel.com> wrote:
> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
>
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
>
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
>
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
> arch/x86/mm/fault.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, struct task_struct *tsk)
> {
> const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> + /* This is a racy snapshot, but it's better than nothing. */
> + int cpu = READ_ONCE(raw_smp_processor_id());
>
> if (!unhandled_signal(tsk, SIGSEGV))
> return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>
> print_vma_addr(KERN_CONT " in ", regs->ip);
>
> + /*
> + * Dump the likely CPU where the fatal segfault happened.
> + * This can help identify faulty hardware.
> + */
> + printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> + topology_core_id(cpu), topology_physical_package_id(cpu));
LGTM, applying this to tip:x86/mm unless someone objects.
I've added the tidbit to the changelog that this only gets printed if
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled -
which is off by default. So your patch expands upon a default-off debug
printout in essence - where utility maximization is OK.
Thanks,
Ingo
On Thu, 2022-08-04 at 22:17 +0200, Ingo Molnar wrote: > > I've added the tidbit to the changelog that this only gets printed if > show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is The show_unhandled_signals variable seems to be controlled through /proc/sys/debug/exception-trace, and be on by default. Enabling print-fatal-signals makes the kernel a lot more chatty, and does not offer the same value in my experiments :) > enabled - > which is off by default. So your patch expands upon a default-off > debug > printout in essence - where utility maximization is OK. > > Thanks, > > Ingo > -- All Rights Reversed.
* Rik van Riel <riel@surriel.com> wrote:
> The show_unhandled_signals variable seems to be controlled through
> /proc/sys/debug/exception-trace, and be on by default.
Indeed ... then this too should be pointed out in the changelog.
Plus the patch doesn't build on top of the latest upstream kernel,
x86-defconfig:
CC arch/x86/mm/fault.o
In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
from ./include/linux/compiler.h:248,
from ./arch/x86/include/asm/current.h:5,
from ./include/linux/sched.h:12,
from arch/x86/mm/fault.c:7:
arch/x86/mm/fault.c: In function ‘show_signal_msg’:
./include/asm-generic/rwonce.h:44:71: error: lvalue required as unary ‘&’ operand
44 | #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
| ^
./include/asm-generic/rwonce.h:50:9: note: in expansion of macro ‘__READ_ONCE’
50 | __READ_ONCE(x); \
| ^~~~~~~~~~~
arch/x86/mm/fault.c:773:19: note: in expansion of macro ‘READ_ONCE’
773 | int cpu = READ_ONCE(raw_smp_processor_id());
| ^~~~~~~~~
make[2]: *** [scripts/Makefile.build:249: arch/x86/mm/fault.o] Error 1
Thanks,
Ingo
On Fri, 2022-08-05 at 09:59 +0200, Ingo Molnar wrote: > > * Rik van Riel <riel@surriel.com> wrote: > > > The show_unhandled_signals variable seems to be controlled through > > /proc/sys/debug/exception-trace, and be on by default. > > Indeed ... then this too should be pointed out in the changelog. > Will do. > Plus the patch doesn't build on top of the latest upstream kernel, > x86-defconfig: Urgh, let me fix that too. I'm not sure what happened there, I might have done the kernel build in a different directory than where the code was :( v3 coming soon. -- All Rights Reversed.
© 2016 - 2026 Red Hat, Inc.