>> This seems a novel use of uevent ... is it OK, or is is abuse? > > Don't create "novel" uses of uevents. They are there to express a > change in state of a device so that userspace can then go and do > something with that information. If that pattern fits here, wonderful. Maybe Dan will chime in here to better explain his idea. I think for the case where the core test fails, there is a good match with uevent. The device (one CPU core) has changed state from "working" to "untrustworthy". Userspace can do things like: take the logical CPUs on that core offline, initiate a service call, or in a VMM cluster environment migrate work to a different node. > I doubt you can report "test results" via a uevent in a way that the > current uevent states and messages would properly convey, but hey, maybe > I'm wrong. But here things get a bit sketchy. Reporting "pass", or "didn't complete the test" isn't a state change. But it seems like a poor interface if there is no feedback that the test was run. Using different methods to report pass/fail/incomplete also seems user hostile. > good luck! Thanks ... we may need it :-) -Tony
On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote: > >> This seems a novel use of uevent ... is it OK, or is is abuse? > > > > Don't create "novel" uses of uevents. They are there to express a > > change in state of a device so that userspace can then go and do > > something with that information. If that pattern fits here, wonderful. > > Maybe Dan will chime in here to better explain his idea. I think for > the case where the core test fails, there is a good match with uevent. > The device (one CPU core) has changed state from "working" to > "untrustworthy". Userspace can do things like: take the logical CPUs > on that core offline, initiate a service call, or in a VMM cluster environment > migrate work to a different node. Again, I have no idea what you are doing at all with this driver, nor what you want to do with it. Start over please. What is the hardware you have to support? What is the expectation from userspace with regards to using the hardware? > > I doubt you can report "test results" via a uevent in a way that the > > current uevent states and messages would properly convey, but hey, maybe > > I'm wrong. > > But here things get a bit sketchy. Reporting "pass", or "didn't complete the test" > isn't a state change. But it seems like a poor interface if there is no feedback > that the test was run. Using different methods to report pass/fail/incomplete > also seems user hostile. We have an in-kernel "test" framework. Yes, it's for kernel code, but why not extend that to also include hardware tests? thanks, greg k-h
On Tue, Mar 15, 2022 at 8:27 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote: > > >> This seems a novel use of uevent ... is it OK, or is is abuse? > > > > > > Don't create "novel" uses of uevents. They are there to express a > > > change in state of a device so that userspace can then go and do > > > something with that information. If that pattern fits here, wonderful. > > > > Maybe Dan will chime in here to better explain his idea. I think for > > the case where the core test fails, there is a good match with uevent. > > The device (one CPU core) has changed state from "working" to > > "untrustworthy". Userspace can do things like: take the logical CPUs > > on that core offline, initiate a service call, or in a VMM cluster environment > > migrate work to a different node. > > Again, I have no idea what you are doing at all with this driver, nor > what you want to do with it. > > Start over please. > > What is the hardware you have to support? > > What is the expectation from userspace with regards to using the > hardware? Here is what I have learned about this driver since engaging on this patch set. Cores go bad at run time. Datacenters can detect them at scale. When I worked at Facebook there was an epic story of debugging random user login failures that resulted in the discovery of a marginal lot-number of CPUs in a certain cluster. In that case the crypto instructions on a few cores of those CPUs gave wrong answers. Whether that was an electromigration effect, or just a marginal bin of CPUs, the only detection method was A-B testing different clusters of CPUs to isolate the differences. This driver takes advantage of a CPU feature to inject a diagnostic test similar to what can be done via JTAG to validate the functionality of a given core on a CPU at a low level. The diagnostic is run periodically since some failures may be sensitive to thermals while other failures may be be related to the lifetime of the CPU. The result of the diagnostic is "here are 1 or more cores that may miscalculate, stop using them and replace the CPU". At a base level the ABI need only be something that conveys "core X failed its last diagnostic". All the other details are just extra, and in my opinion can be dropped save for maybe "core X was unable to run the diagnostic". The thought process that got me from the proposal on the table "extend /sys/devices/system/cpu with per-cpu result state and other details" to "emit uevents on each test completion" were the following: -The complexity and maintenance burden of dynamically extending /sys/devices/system/cpu: Given that you identified a reference counting issue, I wondered why this was trying to use /sys/devices/system/cpu in the first instance. - The result of the test is an event that kicks off remediation actions: When this fails a tech is paged to replace the CPU and in the meantime the system can either be taken offline, or if some of the cores are still good the workloads can be moved off of the bad cores to keep some capacity online until the replacement can be made. - KOBJ_CHANGE uevents are already deployed in NVME for AEN (Asynchronous Event Notifications): If the results of the test were conveyed only in sysfs then there would be a program that would scrape sysfs and turn around and fire an event for the downstream remediation actions. Uevent cuts to the chase and lets udev rule policy log, notify, and/or take pre-emptive CPU offline action. The CPU state has changed after a test run. It has either changed to a failed CPU, or it has changed to one that has recently asserted its health. > > > I doubt you can report "test results" via a uevent in a way that the > > > current uevent states and messages would properly convey, but hey, maybe > > > I'm wrong. > > > > But here things get a bit sketchy. Reporting "pass", or "didn't complete the test" > > isn't a state change. But it seems like a poor interface if there is no feedback > > that the test was run. Using different methods to report pass/fail/incomplete > > also seems user hostile. > > We have an in-kernel "test" framework. Yes, it's for kernel code, but > why not extend that to also include hardware tests? This is where my head was at when starting out with this, but this is more of an asynchronous error reporting mechanism like machine check, or PCIe AER, than a test. The only difference being that the error in this case is only reported by first requesting an error check. So it is more similar to something like a background patrol scrub that seeks out latent ECC errors in memory.
On Tue, Mar 15, 2022 at 9:04 AM Dan Williams <dan.j.williams@intel.com> wrote: > > On Tue, Mar 15, 2022 at 8:27 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote: > > > >> This seems a novel use of uevent ... is it OK, or is is abuse? > > > > > > > > Don't create "novel" uses of uevents. They are there to express a > > > > change in state of a device so that userspace can then go and do > > > > something with that information. If that pattern fits here, wonderful. > > > > > > Maybe Dan will chime in here to better explain his idea. I think for > > > the case where the core test fails, there is a good match with uevent. > > > The device (one CPU core) has changed state from "working" to > > > "untrustworthy". Userspace can do things like: take the logical CPUs > > > on that core offline, initiate a service call, or in a VMM cluster environment > > > migrate work to a different node. > > > > Again, I have no idea what you are doing at all with this driver, nor > > what you want to do with it. > > > > Start over please. > > > > What is the hardware you have to support? > > > > What is the expectation from userspace with regards to using the > > hardware? > > Here is what I have learned about this driver since engaging on this > patch set. Cores go bad at run time. Datacenters can detect them at > scale. Tony pointed me to this video if you have not seen it: https://www.youtube.com/watch?v=QMF3rqhjYuM
> Again, I have no idea what you are doing at all with this driver, nor
> what you want to do with it.
>
> Start over please.
TL;DR is that silicon ages and some things break that don't have parity/ECC checks.
So systems start behaving erratically. If you are lucky they crash. If you are less lucky
they give incorrect results.
There's a paper (and even a movie 11 minutes) that describe the research by
Google on this.
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
(https://www.youtube.com/watch?v=QMF3rqhjYuM)
> What is the hardware you have to support?
Feature first available in Sapphire Rapids (Xeon: coming later this year)
> What is the expectation from userspace with regards to using the
> hardware?
Expectation from users is that they can run these tests frequently (many times
per day) to catch silicon that has developed faults quickly and take action to
isolate the cores that have issues.
On HT enabled systems both threads that share a core need to be put into
test mode together. The current version of tests takes around 50 milli-seconds
(so for many workloads doesn't need much prep ... those with high sensitivity
to latency would need to do some additional userspace task binding to make
sure those workloads were moved to another core while the h/w test runs).
There are three outcomes from running a test:
1) The test passes all stages.
2) The test did not complete (for a variety of reasons, e.g. power states)
3) The test indicates failure. Recommendation is to run one more time in case
the failure was transient .. e.g. cause by a neutron/alpha strike.
-Tony
On Tue, Mar 15, 2022 at 04:10:59PM +0000, Luck, Tony wrote: > > Again, I have no idea what you are doing at all with this driver, nor > > what you want to do with it. > > > > Start over please. > > TL;DR is that silicon ages and some things break that don't have parity/ECC checks. > So systems start behaving erratically. If you are lucky they crash. If you are less lucky > they give incorrect results. > > There's a paper (and even a movie 11 minutes) that describe the research by > Google on this. > https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf > (https://www.youtube.com/watch?v=QMF3rqhjYuM) Both you and Dan are assuming that I actually care about this hardware and driver enough to read a presentation or watch a video about it. Sorry, but that's not happening :) I'm saying these questions as you all need to be asking yourself that, and figuring out what the proper api is. That's not my job here. I was just pointing out the problems in your original submission that you all should have caught before sending it out... good luck! greg k-h
© 2016 - 2026 Red Hat, Inc.