[PATCH v2 0/2] New CMCI storm mitigation for Intel CPUs

Tony Luck posted 2 patches 4 years, 3 months ago
arch/x86/kernel/cpu/mce/core.c     |  46 +++---
arch/x86/kernel/cpu/mce/intel.c    | 241 ++++++++++++++---------------
arch/x86/kernel/cpu/mce/internal.h |  10 +-
3 files changed, 141 insertions(+), 156 deletions(-)
[PATCH v2 0/2] New CMCI storm mitigation for Intel CPUs
Posted by Tony Luck 4 years, 3 months ago
Two-part motivation:

1) Disabling CMCI globally is an overly big hammer

2) Intel signals some UNCORRECTED errors using CMCI (yes, turns
out that was a poorly chosen name given the later evolution of
the architecture). Since we don't want to miss those, the proposed
storm code just bumps the threshold to (almost) maximum to mitigate,
but not eliminate the storm. Note that the threshold only applies
to corrected errors.

Patch 1 deletes the parts of the old storm code that are no
longer needed.

Patch 2 adds the new per-bank mitigation.

Smita: Unless Boris finds a some more stuff for me to fix, this
version will be a better starting point to merge with your changes.

Changes since v1 (based on feedback from Boris)

- Spelling fixes in commit message
- Many more comments explaining what is going on
- Change name of function that does tracking
- Change names for #defines for storm BEGIN/END
- #define for high threshold in decimal, not hex

Tony Luck (2):
  x86/mce: Remove old CMCI storm mitigation code
  x86/mce: Add per-bank CMCI storm mitigation

 arch/x86/kernel/cpu/mce/core.c     |  46 +++---
 arch/x86/kernel/cpu/mce/intel.c    | 241 ++++++++++++++---------------
 arch/x86/kernel/cpu/mce/internal.h |  10 +-
 3 files changed, 141 insertions(+), 156 deletions(-)


base-commit: ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2
-- 
2.35.1
Re: [PATCH v2 0/2] New CMCI storm mitigation for Intel CPUs
Posted by Borislav Petkov 4 years, 3 months ago
On Tue, Mar 15, 2022 at 11:15:07AM -0700, Tony Luck wrote:
> Smita: Unless Boris finds a some more stuff for me to fix, this
> version will be a better starting point to merge with your changes.

Right, I'm wondering if AMD can use the same scheme so that abstracting
out the hw-specific accesses (MSR writes, etc) would be enough...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v2 0/2] New CMCI storm mitigation for Intel CPUs
Posted by Koralahalli Channabasappa, Smita 4 years, 3 months ago
On 3/15/22 1:34 PM, Borislav Petkov wrote:

> On Tue, Mar 15, 2022 at 11:15:07AM -0700, Tony Luck wrote:
>> Smita: Unless Boris finds a some more stuff for me to fix, this
>> version will be a better starting point to merge with your changes.
> Right, I'm wondering if AMD can use the same scheme so that abstracting
> out the hw-specific accesses (MSR writes, etc) would be enough...

Thanks Tony.

Agreed. Most of this would apply for AMD's threshold interrupts too.

Will come up with a merged patch and move the storm handling to
mce/core.c and just keep the hw-specific accesses separate for
Intel and AMD in their respective files.

Thanks
Smita.