arch/x86/kernel/cpu/mce/threshold.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Extend the logic of handling CMCI storms to AMD threshold interrupts.
Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and
per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on
a storm. Rather, disable the interrupt on the corresponding CPU and bank.
Re-enable back the interrupts if enough consecutive polls of the bank show no
corrected errors (30, as programmed by Intel).
Turning off the threshold interrupts would be a better solution on AMD systems
as other error severities will still be handled even if the threshold
interrupts are disabled.
Also, AMD systems currently allow banks to be managed by both polling and
interrupts. So don't modify the polling banks set after a storm ends.
[Tony: Small tweak because mce_handle_storm() isn't a pointer now]
[Yazen: Rebase and simplify]
Stable backport notes:
1. Currently, when a Machine check interrupt storm is detected, the bank's
corresponding bit in mce_poll_banks per-CPU variable is cleared by
cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or
encountered after the storm subsides are not logged since polling on that
bank has been disabled. Polling banks set on AMD systems should not be
modified when a storm subsides.
2. This patch is a snippet from the CMCI storm handling patch (link below)
that has been accepted into tip for v6.19. While backporting the patch
would have been the preferred way, the same cannot be undertaken since
its part of a larger set. As such, this fix will be temporary. When the
original patch and its set is integrated into stable, this patch should be
reverted.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
This is somewhat of a new scenario for me. Not really sure about the
procedure. Hence, haven't modified the commit message and removed the
tags. If required, will rework both.
Also, while this issue can be encountered on AMD systems using v6.8 and
later stable kernels, we would specifically prefer for this fix to be
backported to v6.12 since its LTS.
---
arch/x86/kernel/cpu/mce/threshold.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
index f4a007616468..61eaa1774931 100644
--- a/arch/x86/kernel/cpu/mce/threshold.c
+++ b/arch/x86/kernel/cpu/mce/threshold.c
@@ -85,7 +85,8 @@ void cmci_storm_end(unsigned int bank)
{
struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
- __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
+ if (!mce_flags.amd_threshold)
+ __clear_bit(bank, this_cpu_ptr(mce_poll_banks));
storm->banks[bank].history = 0;
storm->banks[bank].in_storm_mode = false;
base-commit: 8b690556d8fe074b4f9835075050fba3fb180e93
--
2.43.0
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote: > From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > > Extend the logic of handling CMCI storms to AMD threshold interrupts. > > Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and > per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on > a storm. Rather, disable the interrupt on the corresponding CPU and bank. > Re-enable back the interrupts if enough consecutive polls of the bank show no > corrected errors (30, as programmed by Intel). > > Turning off the threshold interrupts would be a better solution on AMD systems > as other error severities will still be handled even if the threshold > interrupts are disabled. > > Also, AMD systems currently allow banks to be managed by both polling and > interrupts. So don't modify the polling banks set after a storm ends. > > [Tony: Small tweak because mce_handle_storm() isn't a pointer now] > [Yazen: Rebase and simplify] > > Stable backport notes: > 1. Currently, when a Machine check interrupt storm is detected, the bank's > corresponding bit in mce_poll_banks per-CPU variable is cleared by > cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or > encountered after the storm subsides are not logged since polling on that > bank has been disabled. Polling banks set on AMD systems should not be > modified when a storm subsides. > > 2. This patch is a snippet from the CMCI storm handling patch (link below) > that has been accepted into tip for v6.19. While backporting the patch > would have been the preferred way, the same cannot be undertaken since > its part of a larger set. As such, this fix will be temporary. When the > original patch and its set is integrated into stable, this patch should be > reverted. > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> > Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> > Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> > Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com > Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> > --- > This is somewhat of a new scenario for me. Not really sure about the > procedure. Hence, haven't modified the commit message and removed the > tags. If required, will rework both. > Also, while this issue can be encountered on AMD systems using v6.8 and > later stable kernels, we would specifically prefer for this fix to be > backported to v6.12 since its LTS. What is the git commit id of this change in Linus's tree?
On 11/21/2025 00:53, Greg KH wrote: > On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote: >> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> >> >> Extend the logic of handling CMCI storms to AMD threshold interrupts. >> >> Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and >> per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on >> a storm. Rather, disable the interrupt on the corresponding CPU and bank. >> Re-enable back the interrupts if enough consecutive polls of the bank show no >> corrected errors (30, as programmed by Intel). >> >> Turning off the threshold interrupts would be a better solution on AMD systems >> as other error severities will still be handled even if the threshold >> interrupts are disabled. >> >> Also, AMD systems currently allow banks to be managed by both polling and >> interrupts. So don't modify the polling banks set after a storm ends. >> >> [Tony: Small tweak because mce_handle_storm() isn't a pointer now] >> [Yazen: Rebase and simplify] >> >> Stable backport notes: >> 1. Currently, when a Machine check interrupt storm is detected, the bank's >> corresponding bit in mce_poll_banks per-CPU variable is cleared by >> cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or >> encountered after the storm subsides are not logged since polling on that >> bank has been disabled. Polling banks set on AMD systems should not be >> modified when a storm subsides. >> >> 2. This patch is a snippet from the CMCI storm handling patch (link below) >> that has been accepted into tip for v6.19. While backporting the patch >> would have been the preferred way, the same cannot be undertaken since >> its part of a larger set. As such, this fix will be temporary. When the >> original patch and its set is integrated into stable, this patch should be >> reverted. >> >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> >> Signed-off-by: Tony Luck <tony.luck@intel.com> >> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> >> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> >> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> >> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com >> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> >> --- >> This is somewhat of a new scenario for me. Not really sure about the >> procedure. Hence, haven't modified the commit message and removed the >> tags. If required, will rework both. >> Also, while this issue can be encountered on AMD systems using v6.8 and >> later stable kernels, we would specifically prefer for this fix to be >> backported to v6.12 since its LTS. > > What is the git commit id of this change in Linus's tree? I think it has not yet been merged into mainline's master branch. This commit was recently accepted into the tip (5th November). Following is its commit ID: a5834a5458aa004866e7da402c6bc2dfe2f3737e Link: https://lore.kernel.org/all/176243356968.2601451.11559805061162819633.tip-bot2@tip-bot2/ Do I need to send another version with this commit ID mentioned in the commit message? -- Thanks, Avadhut Naik
On Fri, Nov 21, 2025 at 01:04:47AM -0600, Naik, Avadhut wrote:
>
>
> On 11/21/2025 00:53, Greg KH wrote:
> > On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
> >> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> >>
> >> Extend the logic of handling CMCI storms to AMD threshold interrupts.
> >>
> >> Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and
> >> per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on
> >> a storm. Rather, disable the interrupt on the corresponding CPU and bank.
> >> Re-enable back the interrupts if enough consecutive polls of the bank show no
> >> corrected errors (30, as programmed by Intel).
> >>
> >> Turning off the threshold interrupts would be a better solution on AMD systems
> >> as other error severities will still be handled even if the threshold
> >> interrupts are disabled.
> >>
> >> Also, AMD systems currently allow banks to be managed by both polling and
> >> interrupts. So don't modify the polling banks set after a storm ends.
> >>
> >> [Tony: Small tweak because mce_handle_storm() isn't a pointer now]
> >> [Yazen: Rebase and simplify]
> >>
> >> Stable backport notes:
> >> 1. Currently, when a Machine check interrupt storm is detected, the bank's
> >> corresponding bit in mce_poll_banks per-CPU variable is cleared by
> >> cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or
> >> encountered after the storm subsides are not logged since polling on that
> >> bank has been disabled. Polling banks set on AMD systems should not be
> >> modified when a storm subsides.
> >>
> >> 2. This patch is a snippet from the CMCI storm handling patch (link below)
> >> that has been accepted into tip for v6.19. While backporting the patch
> >> would have been the preferred way, the same cannot be undertaken since
> >> its part of a larger set. As such, this fix will be temporary. When the
> >> original patch and its set is integrated into stable, this patch should be
> >> reverted.
> >>
> >> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
> >> Signed-off-by: Tony Luck <tony.luck@intel.com>
> >> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> >> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
> >> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> >> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
> >> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> >> ---
> >> This is somewhat of a new scenario for me. Not really sure about the
> >> procedure. Hence, haven't modified the commit message and removed the
> >> tags. If required, will rework both.
> >> Also, while this issue can be encountered on AMD systems using v6.8 and
> >> later stable kernels, we would specifically prefer for this fix to be
> >> backported to v6.12 since its LTS.
> >
> > What is the git commit id of this change in Linus's tree?
>
> I think it has not yet been merged into mainline's master branch.
> This commit was recently accepted into the tip (5th November).
Then there's nothing we can do about this in the stable tree, please
read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for all about this.
thanks,
greg k-h
On Fri, Nov 21, 2025 at 08:09:21AM +0100, Greg KH wrote:
> > I think it has not yet been merged into mainline's master branch.
> > This commit was recently accepted into the tip (5th November).
>
> Then there's nothing we can do about this in the stable tree, please
Yeah, it took me a while to understand what the issue is when Avadhut was
explaining it to me offlist:
So the hunk at the beginning of this thread is needed as a fix for stable
because when they inject a lot of errors back-to-back, after the error storm
detection recovers, they cannot log any errors anymore - see the explanation
in the first patch.
So what we'll do here:
@Avadhut, you take that hunk, pls, and create a separate patch with commit
message explaining everything, blablalba, cc:stable, the whole shebang.
That patch goes upstream and to stable.
The rest of the original
a5834a5458aa ("x86/mce: Handle AMD threshold interrupt storms")
you then redo ontop of this one and send it too.
I'll zap a5834a5458aa from the lineup for now so that you can split it.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
You need to put here
"Commit <sha1> upstream."
> Extend the logic of handling CMCI storms to AMD threshold interrupts.
...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11/20/2025 15:53, Borislav Petkov wrote: > On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote: >> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > > You need to put here > > "Commit <sha1> upstream." > Will add that. Also, does this need to have a Fixes tag? Didn't add one here as the original patch committed to tip didn't have one. >> Extend the logic of handling CMCI storms to AMD threshold interrupts. > > ... > > -- Thanks, Avadhut Naik
On Thu, Nov 20, 2025 at 07:59:57PM -0600, Naik, Avadhut wrote: > > > On 11/20/2025 15:53, Borislav Petkov wrote: > > On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote: > >> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > > > > You need to put here > > > > "Commit <sha1> upstream." > > > Will add that. > > Also, does this need to have a Fixes tag? > > Didn't add one here as the original patch committed to tip didn't have one. Then there's no need.
© 2016 - 2025 Red Hat, Inc.