[v2] RE: [PATCH v2 0/2] Update mce_record tracepoint

RE: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Luck, Tony 2 years ago

> > You've spent enough time with Ashok and Thomas tweaking the Linux
> > microcode driver to know that going back to the machine the next day
> > to ask about microcode version has a bunch of ways to get a wrong
> > answer.
>
> Huh, what does that have to do with this?

If deployment of a microcode update across a fleet always went
flawlessly, life would be simpler. But things can fail. And maybe the
failure wasn't noticed. Perhaps a node was rebooting when the
sysadmin pushed the update to the fleet and so missed the
deployment. Perhaps one core was already acting weird and
the microcode update didn't get applied to that core.

> IIUC, if someone changes something on the system, whether that is
> updating microcode or swapping a harddrive or swapping memory or
> whatever, that invalidates the errors reported, pretty much.
>
> You can't put it all in the trace record, you just can't.

Swapping a hard drive, or hot plugging a NIC isn't very likely
to correlate with an error reported by the CPU in machine
check banks. But microcode can be (and has been) the issue
in enough cases that knowing the version at the time of the
error matters.

You seemed to agree with this argument when the microcode
field was added to "struct mce" back in 2018

fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records")

Is it so very different to add this to a trace record so that rasdaemon
can have feature parity with mcelog(8)?

-Tony

Re: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Borislav Petkov 2 years ago

On Fri, Jan 26, 2024 at 07:15:50PM +0000, Luck, Tony wrote:
> If deployment of a microcode update across a fleet always went
> flawlessly, life would be simpler. But things can fail. And maybe the
> failure wasn't noticed. Perhaps a node was rebooting when the sysadmin
> pushed the update to the fleet and so missed the deployment. Perhaps
> one core was already acting weird and the microcode update didn't get
> applied to that core.

Yes, and you go collect data from that box. You will have to anyway to
figure out why the microcode didn't update.

> Swapping a hard drive, or hot plugging a NIC isn't very likely
> to correlate with an error reported by the CPU in machine
> check banks.

Ofc it is - coherent probe timeoutting due to problematic insertion
could be reported with a MCE, and so on and so on.

> Is it so very different to add this to a trace record so that rasdaemon
> can have feature parity with mcelog(8)?

I knew you were gonna say that. When someone decides that it is
a splendid idea to add more stuff to struct mce then said someone would
want it in the tracepoint too.

And then we're back to my original question: 

"And where does it end? Stick full dmesg in the tracepoint too?"

Where do you draw the line in the sand and say, no more, especially
static, fields bloating the trace record should be added and from then
on, you should go collect the info from that box. Something which you're
supposed to do anyway.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

RE: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Luck, Tony 2 years ago

> > Is it so very different to add this to a trace record so that rasdaemon
> > can have feature parity with mcelog(8)?
>
> I knew you were gonna say that. When someone decides that it is
> a splendid idea to add more stuff to struct mce then said someone would
> want it in the tracepoint too.
>
> And then we're back to my original question:
>
> "And where does it end? Stick full dmesg in the tracepoint too?"
>
> Where do you draw the line in the sand and say, no more, especially
> static, fields bloating the trace record should be added and from then
> on, you should go collect the info from that box. Something which you're
> supposed to do anyway.

Every patch that adds new code or data structures adds to the kernel
memory footprint. Each should be considered on its merits. The basic
question being:

   "Is the new functionality worth the cost?"

Where does it end? It would end if Linus declared:

  "Linux is now complete. Stop sending patches".

I.e. it is never going to end.

If somebody posts a patch asking to add the full dmesg to a
tracepoint, I'll stand with you to say: "Not only no, but hell no".

So for Naik's two patches we have:

1) PPIN
Cost = 8 bytes.
Benefit: Emdeds a system identifier into the trace record so there
can be no ambiguity about which machine generated this error.
Also definitively indicates which socket on a multi-socket system.

2) MICROCODE
Cost = 4 bytes
Benefit: Certainty about the microcode version active on the core
at the time the error was detected.

RAS = Reliability, Availability, Serviceability

These changes fall into the serviceability bucket. They make it
easier to diagnose what went wrong.


-Tony

Re: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Borislav Petkov 2 years ago

On Fri, Jan 26, 2024 at 08:49:03PM +0000, Luck, Tony wrote:
> Every patch that adds new code or data structures adds to the kernel
> memory footprint. Each should be considered on its merits. The basic
> question being:
> 
>    "Is the new functionality worth the cost?"
> 
> Where does it end? It would end if Linus declared:
> 
>   "Linux is now complete. Stop sending patches".
> 
> I.e. it is never going to end.

No, it's not that - it is the merit thing which determines.

> 1) PPIN
> Cost = 8 bytes.
> Benefit: Emdeds a system identifier into the trace record so there
> can be no ambiguity about which machine generated this error.
> Also definitively indicates which socket on a multi-socket system.
> 
> 2) MICROCODE
> Cost = 4 bytes
> Benefit: Certainty about the microcode version active on the core
> at the time the error was detected.
> 
> RAS = Reliability, Availability, Serviceability
> 
> These changes fall into the serviceability bucket. They make it
> easier to diagnose what went wrong.

So does dmesg. Let's add it to the tracepoint...

But no, that's not the right question to ask.

It is rather: which bits of information are very relevant to an error
record and which are transient enough so that they cannot be gathered
from a system by other means or only gathered in a difficult way, and
should be part of that record.

The PPIN is not transient but you have to go map ->extcpu to the PPIN so
adding it to the tracepoint is purely a convenience thing. More or less.

The microcode revision thing I still don't buy but it is already there
so whateva...

So we'd need a rule hammered out and put there in a prominent place so
that it is clear what goes into struct mce and what not.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

RE: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Luck, Tony 2 years ago

> But no, that's not the right question to ask.
>
> It is rather: which bits of information are very relevant to an error
> record and which are transient enough so that they cannot be gathered
> from a system by other means or only gathered in a difficult way, and
> should be part of that record.
>
> The PPIN is not transient but you have to go map ->extcpu to the PPIN so
> adding it to the tracepoint is purely a convenience thing. More or less.
>
> The microcode revision thing I still don't buy but it is already there
> so whateva...
>
> So we'd need a rule hammered out and put there in a prominent place so
> that it is clear what goes into struct mce and what not.

My personal evaluation of the value of these two additions to the trace record:

PPIN: Nice to have. People that send stuff to me are terrible about providing surrounding
details. The record already includes CPUID(1).EAX ... so I can at least skip the step of
asking them which CPU family/model/stepping they were using). But PPIN
can be recovered (so long as the submitter kept good records about which system
generated the record).

MICROCODE: Must have. Microcode version can be changed at run time. Going
back to the system to check later may not give the correct answer to what was active
at the time of the error. Especially for an error reported while a microcode update is
waling across the CPUs poking the MSR on each in turn.

-Tony

Re: [PATCH v2 0/2] Update mce_record tracepoint

Posted by Borislav Petkov 2 years ago

On Fri, Jan 26, 2024 at 10:01:29PM +0000, Luck, Tony wrote:
> PPIN: Nice to have. People that send stuff to me are terrible about
> providing surrounding details. The record already includes
> CPUID(1).EAX ... so I can at least skip the step of asking them which
> CPU family/model/stepping they were using). But PPIN can be recovered
> (so long as the submitter kept good records about which system
> generated the record).

Yes.

> MICROCODE: Must have. Microcode version can be changed at run time.
> Going back to the system to check later may not give the correct
> answer to what was active at the time of the error. Especially for an
> error reported while a microcode update is waling across the CPUs
> poking the MSR on each in turn.

Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette