[v1] RE: [RFC 00/10] Introduce In Field Scan driver

RE: [RFC 00/10] Introduce In Field Scan driver

Posted by Luck, Tony 4 years, 3 months ago

>> This seems a novel use of uevent ... is it OK, or is is abuse?
>
> Don't create "novel" uses of uevents.  They are there to express a
> change in state of a device so that userspace can then go and do
> something with that information.  If that pattern fits here, wonderful.

Maybe Dan will chime in here to better explain his idea. I think for
the case where the core test fails, there is a good match with uevent.
The device (one CPU core) has changed state from "working" to
"untrustworthy". Userspace can do things like: take the logical CPUs
on that core offline, initiate a service call, or in a VMM cluster environment
migrate work to a different node.

> I doubt you can report "test results" via a uevent in a way that the
> current uevent states and messages would properly convey, but hey, maybe
> I'm wrong.

But here things get a bit sketchy. Reporting "pass", or "didn't complete the test"
isn't a state change.  But it seems like a poor interface if there is no feedback
that the test was run. Using different methods to report pass/fail/incomplete
also seems user hostile.

> good luck!
Thanks ... we may need it :-)

-Tony

Re: [RFC 00/10] Introduce In Field Scan driver

Posted by Greg KH 4 years, 3 months ago

On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote:
> >> This seems a novel use of uevent ... is it OK, or is is abuse?
> >
> > Don't create "novel" uses of uevents.  They are there to express a
> > change in state of a device so that userspace can then go and do
> > something with that information.  If that pattern fits here, wonderful.
> 
> Maybe Dan will chime in here to better explain his idea. I think for
> the case where the core test fails, there is a good match with uevent.
> The device (one CPU core) has changed state from "working" to
> "untrustworthy". Userspace can do things like: take the logical CPUs
> on that core offline, initiate a service call, or in a VMM cluster environment
> migrate work to a different node.

Again, I have no idea what you are doing at all with this driver, nor
what you want to do with it.

Start over please.

What is the hardware you have to support?

What is the expectation from userspace with regards to using the
hardware?

> > I doubt you can report "test results" via a uevent in a way that the
> > current uevent states and messages would properly convey, but hey, maybe
> > I'm wrong.
> 
> But here things get a bit sketchy. Reporting "pass", or "didn't complete the test"
> isn't a state change.  But it seems like a poor interface if there is no feedback
> that the test was run. Using different methods to report pass/fail/incomplete
> also seems user hostile.

We have an in-kernel "test" framework.  Yes, it's for kernel code, but
why not extend that to also include hardware tests?

thanks,

greg k-h

Re: [RFC 00/10] Introduce In Field Scan driver

Posted by Dan Williams 4 years, 3 months ago

On Tue, Mar 15, 2022 at 8:27 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote:
> > >> This seems a novel use of uevent ... is it OK, or is is abuse?
> > >
> > > Don't create "novel" uses of uevents.  They are there to express a
> > > change in state of a device so that userspace can then go and do
> > > something with that information.  If that pattern fits here, wonderful.
> >
> > Maybe Dan will chime in here to better explain his idea. I think for
> > the case where the core test fails, there is a good match with uevent.
> > The device (one CPU core) has changed state from "working" to
> > "untrustworthy". Userspace can do things like: take the logical CPUs
> > on that core offline, initiate a service call, or in a VMM cluster environment
> > migrate work to a different node.
>
> Again, I have no idea what you are doing at all with this driver, nor
> what you want to do with it.
>
> Start over please.
>
> What is the hardware you have to support?
>
> What is the expectation from userspace with regards to using the
> hardware?

Here is what I have learned about this driver since engaging on this
patch set. Cores go bad at run time. Datacenters can detect them at
scale. When I worked at Facebook there was an epic story of debugging
random user login failures that resulted in the discovery of a
marginal lot-number of CPUs in a certain cluster. In that case the
crypto instructions on a few cores of those CPUs gave wrong answers.
Whether that was an electromigration effect, or just a marginal bin of
CPUs, the only detection method was A-B testing different clusters of
CPUs to isolate the differences.

This driver takes advantage of a CPU feature to inject a diagnostic
test similar to what can be done via JTAG to validate the
functionality of a given core on a CPU at a low level. The diagnostic
is run periodically since some failures may be sensitive to thermals
while other failures may be be related to the lifetime of the CPU. The
result of the diagnostic is "here are 1 or more cores that may
miscalculate, stop using them and replace the CPU".

At a base level the ABI need only be something that conveys "core X
failed its last diagnostic". All the other details are just extra, and
in my opinion can be dropped save for maybe "core X was unable to run
the diagnostic".

The thought process that got me from the proposal on the table "extend
/sys/devices/system/cpu with per-cpu result state and other details"
to "emit uevents on each test completion" were the following:
-The complexity and maintenance burden of dynamically extending
/sys/devices/system/cpu: Given that you identified a reference
counting issue, I wondered why this was trying to use
/sys/devices/system/cpu in the first instance.

- The result of the test is an event that kicks off remediation
actions: When this fails a tech is paged to replace the CPU and in the
meantime the system can either be taken offline, or if some of the
cores are still good the workloads can be moved off of the bad cores
to keep some capacity online until the replacement can be made.

- KOBJ_CHANGE uevents are already deployed in NVME for AEN
(Asynchronous Event Notifications): If the results of the test were
conveyed only in sysfs then there would be a program that would scrape
sysfs and turn around and fire an event for the downstream remediation
actions. Uevent cuts to the chase and lets udev rule policy log,
notify, and/or take pre-emptive CPU offline action. The CPU state has
changed after a test run. It has either changed to a failed CPU, or it
has changed to one that has recently asserted its health.

> > > I doubt you can report "test results" via a uevent in a way that the
> > > current uevent states and messages would properly convey, but hey, maybe
> > > I'm wrong.
> >
> > But here things get a bit sketchy. Reporting "pass", or "didn't complete the test"
> > isn't a state change.  But it seems like a poor interface if there is no feedback
> > that the test was run. Using different methods to report pass/fail/incomplete
> > also seems user hostile.
>
> We have an in-kernel "test" framework.  Yes, it's for kernel code, but
> why not extend that to also include hardware tests?

This is where my head was at when starting out with this, but this is
more of an asynchronous error reporting mechanism like machine check,
or PCIe AER, than a test. The only difference being that the error in
this case is only reported by first requesting an error check. So it
is more similar to something like a background patrol scrub that seeks
out latent ECC errors in memory.

Re: [RFC 00/10] Introduce In Field Scan driver

Posted by Dan Williams 4 years, 3 months ago

On Tue, Mar 15, 2022 at 9:04 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Tue, Mar 15, 2022 at 8:27 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote:
> > > >> This seems a novel use of uevent ... is it OK, or is is abuse?
> > > >
> > > > Don't create "novel" uses of uevents.  They are there to express a
> > > > change in state of a device so that userspace can then go and do
> > > > something with that information.  If that pattern fits here, wonderful.
> > >
> > > Maybe Dan will chime in here to better explain his idea. I think for
> > > the case where the core test fails, there is a good match with uevent.
> > > The device (one CPU core) has changed state from "working" to
> > > "untrustworthy". Userspace can do things like: take the logical CPUs
> > > on that core offline, initiate a service call, or in a VMM cluster environment
> > > migrate work to a different node.
> >
> > Again, I have no idea what you are doing at all with this driver, nor
> > what you want to do with it.
> >
> > Start over please.
> >
> > What is the hardware you have to support?
> >
> > What is the expectation from userspace with regards to using the
> > hardware?
>
> Here is what I have learned about this driver since engaging on this
> patch set. Cores go bad at run time. Datacenters can detect them at
> scale.

Tony pointed me to this video if you have not seen it:

https://www.youtube.com/watch?v=QMF3rqhjYuM

RE: [RFC 00/10] Introduce In Field Scan driver

Posted by Luck, Tony 4 years, 3 months ago

> Again, I have no idea what you are doing at all with this driver, nor
> what you want to do with it.
>
> Start over please.

TL;DR is that silicon ages and some things break that don't have parity/ECC checks.
So systems start behaving erratically. If you are lucky they crash. If you are less lucky
they give incorrect results.

There's a paper (and even a movie 11 minutes) that describe the research by
Google on this.
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf  
(https://www.youtube.com/watch?v=QMF3rqhjYuM)    

> What is the hardware you have to support?

Feature first available in Sapphire Rapids (Xeon: coming later this year)

> What is the expectation from userspace with regards to using the
> hardware?

Expectation from users is that they can run these tests frequently (many times
per day) to catch silicon that has developed faults quickly and take action to
isolate the cores that have issues.

On HT enabled systems both threads that share a core need to be put into
test mode together. The current version of tests takes around 50 milli-seconds
(so for many workloads doesn't need much prep ... those with high sensitivity
to latency would need to do some additional userspace task binding to make
sure those workloads were moved to another core while the h/w test runs).

There are three outcomes from running a test:

1) The test passes all stages.
2) The test did not complete (for a variety of reasons, e.g. power states)
3) The test indicates failure. Recommendation is to run one more time in case
    the failure was transient .. e.g. cause by a neutron/alpha strike.

-Tony

Re: [RFC 00/10] Introduce In Field Scan driver

Posted by Greg KH 4 years, 3 months ago

On Tue, Mar 15, 2022 at 04:10:59PM +0000, Luck, Tony wrote:
> > Again, I have no idea what you are doing at all with this driver, nor
> > what you want to do with it.
> >
> > Start over please.
> 
> TL;DR is that silicon ages and some things break that don't have parity/ECC checks.
> So systems start behaving erratically. If you are lucky they crash. If you are less lucky
> they give incorrect results.
> 
> There's a paper (and even a movie 11 minutes) that describe the research by
> Google on this.
> https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf  
> (https://www.youtube.com/watch?v=QMF3rqhjYuM)    

Both you and Dan are assuming that I actually care about this hardware
and driver enough to read a presentation or watch a video about it.
Sorry, but that's not happening :)

I'm saying these questions as you all need to be asking yourself that,
and figuring out what the proper api is.  That's not my job here.  I was
just pointing out the problems in your original submission that you all
should have caught before sending it out...

good luck!

greg k-h