[RFC] AMD VM crashing on deferred memory error injection

William Roche posted 1 patch 11 hours ago
[RFC] AMD VM crashing on deferred memory error injection
Posted by William Roche 11 hours ago
Hello,

I'd like to bring to your attention a consequence of the integration of
this set of commits early into the 6.19 kernel:

   2025-11-04 14:55 [PATCH v8 0/8] AMD MCA interrupts rework
  
https://lore.kernel.org/all/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com/

Yazen Ghannam (7):
       x86/mce: Unify AMD THR handler with MCA Polling
       x86/mce: Unify AMD DFR handler with MCA Polling
       x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
       x86/mce/amd: Support SMCA Corrected Error Interrupt
       x86/mce/amd: Remove redundant reset_block()
       x86/mce/amd: Define threshold restart function for banks
       x86/mce: Save and use APEI corrected threshold limit


An AMD Qemu VM running this kernel is no longer able to deal with the
injection of a deferred memory error, and crashes with:

[  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to 
write 0x0000000000000000) at rIP: 0xffffffff8229894d 
(mce_wrmsrq+0x1d/0x60)
[  333.428105] Call Trace: 
  

[  333.429566]  <IRQ> 
  

[  333.430745]  amd_clear_bank+0x6e/0x70 
  

[  333.432828]  machine_check_poll+0x228/0x2e0 
  

[  333.435068]  ? __pfx_mce_timer_fn+0x10/0x10 
  

[  333.437241]  mce_timer_fn+0xb1/0x130 
  

[  333.438966]  ? __pfx_mce_timer_fn+0x10/0x10 
  

[  333.441380]  call_timer_fn+0x26/0x120 
  

[  333.443518]  __run_timers+0x202/0x290 
  

[  333.445763]  run_timer_softirq+0x49/0x100 
  

[  333.447908]  handle_softirqs+0xeb/0x2c0 
  

[  333.449863]  __irq_exit_rcu+0xda/0x100 
  

[  333.452065]  sysvec_apic_timer_interrupt+0x71/0x90 
  

[  333.454846]  </IRQ> 
  

[  333.456192]  <TASK> 
  

[  333.457520]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  333.460355] RIP: 0010:pv_native_safe_halt+0xf/0x20
[  333.463203] Code: 20 d0 e9 5f 99 e6 fe 0f 1f 40 00 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 ee 18 00 fb 
f4 <e9> 37 990
[  333.472816] RSP: 0018:ffffffff83403e78 EFLAGS: 00000246
[  333.475848] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
0000000000000000
[  333.479481] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
0000000000000000
[  333.483492] RBP: ffffffff83412980 R08: 0000000000000000 R09: 
0000000000000000
[  333.487503] R10: 0000000000000000 R11: 0000000000000000 R12: 
0000000000000000
[  333.491482] R13: 0000000000000000 R14: 0000000000000000 R15: 
00000000000947d0
[  333.495258]  default_idle+0x9/0x30
[  333.497283]  default_idle_call+0x28/0x100
[  333.499641]  cpuidle_idle_call+0x12e/0x180
[  333.502087]  do_idle+0x77/0xb0
[  333.503914]  cpu_startup_entry+0x29/0x30
[  333.506337]  rest_init+0xcc/0xd0
[  333.508296]  start_kernel+0x4df/0x4e0
[  333.510491]  x86_64_start_reservations+0x32/0x40
[  333.513101]  x86_64_start_kernel+0xce/0xd0
[  333.515433]  common_startup_64+0x13e/0x141
[  333.517920]  </TASK>
[  333.519468] Kernel panic - not syncing: MCA architectural violation!


The problem appeared with the addition of clearing MCA_DESTAT for all
deferred errors in the amd_clear_bank() function by this kernel commit:

     7cb735d7c0cb  x86/mce: Unify AMD DFR handler with MCA Polling

+       /* Clear MCA_DESTAT for all deferred errors even those logged in 
MCA_STATUS. */
+       if (m->status & MCI_STATUS_DEFERRED)
+               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);


Where a Qemu AMD implementation of MCE injection for deferred errors
relies on machine_check_poll() picking up these errors.
As indicated in Qemu change:
     4b77512b2782  i386: Fix MCE support for AMD hosts
https://lore.kernel.org/qemu-devel/20240603193622.47156-2-john.allen@amd.com/


When a Qemu process receives the SIGBUS information from the host, it
generates a virtual MCE to be dealt by the VM kernel machine_check_poll().
But clearing MCA_DESTAT doesn't seem to be allowed and triggers an
exception. Which looks like a kernel & AMD SMCA contract mismatch (?)

So should we consider that the Qemu platform has to allow the change or
is the kernel missing guards around clearing this MCA bank after
injected UEs on this platform ?


FYI, to reproduce the problem:
. I used a QEMU Standard PC q35:

qemu-system-x86_64 --version
QEMU emulator version 10.2.50 (v10.2.0-1085-gcd5a79dc98)
Copyright (c) 2003-2026 Fabrice Bellard and the QEMU Project developers

qemu-system-x86_64 -smp 4 -m 20G -enable-kvm -cpu host -usb \
	-device usb-tablet -serial mon:stdio -M q35 \
	-nic user,model=e1000,hostfwd=tcp::60022-:22 -nographic \
	-drive file=disk.qcow2,cache=none

. Inject an error into this VM running a 6.19.0-rc1 or more recent kernel.
 From the host:
# modprobe hwpoison-inject
# echo <pfn> > /sys/kernel/debug/hwpoison/corrupt-pfn

Wait 5 minutes until the deferred error is handled by the VM kernel, and
the VM than crashes with the above stack trace...


. But removing the reset of MCA_DESTAT in the kernel amd_clear_bank()
function or adding this simple test makes the system work again as
before:


diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index d9f9ee7db5c8..86b3070fbb40 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -860,7 +860,7 @@ void amd_clear_bank(struct mce *m)
         amd_reset_thr_limit(m->bank);

         /* Clear MCA_DESTAT for all deferred errors even those logged 
in MCA_STATUS. */
-       if (m->status & MCI_STATUS_DEFERRED)
+       if (m->status & MCI_STATUS_DEFERRED && !(m->status & 
MCI_STATUS_POISON))
                 mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);

         /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */



According to me, this small kernel fix relies too much on a Qemu AMD
specific implementation detail.

Would you have a more appropriate fix to suggest please ?

Thanks in advance for your feedback.
William.
Re: [RFC] AMD VM crashing on deferred memory error injection
Posted by Yazen Ghannam 7 hours ago
On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> Hello,
> 
> I'd like to bring to your attention a consequence of the integration of
> this set of commits early into the 6.19 kernel:
> 
>   2025-11-04 14:55 [PATCH v8 0/8] AMD MCA interrupts rework
> https://lore.kernel.org/all/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com/
> 
> Yazen Ghannam (7):
>       x86/mce: Unify AMD THR handler with MCA Polling
>       x86/mce: Unify AMD DFR handler with MCA Polling
>       x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
>       x86/mce/amd: Support SMCA Corrected Error Interrupt
>       x86/mce/amd: Remove redundant reset_block()
>       x86/mce/amd: Define threshold restart function for banks
>       x86/mce: Save and use APEI corrected threshold limit
> 
> 
> An AMD Qemu VM running this kernel is no longer able to deal with the
> injection of a deferred memory error, and crashes with:
> 
> [  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to write
> 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
> [  333.428105] Call Trace:
> 
> [  333.429566]  <IRQ>
> 
> [  333.430745]  amd_clear_bank+0x6e/0x70
> 
> [  333.432828]  machine_check_poll+0x228/0x2e0
> 
> [  333.435068]  ? __pfx_mce_timer_fn+0x10/0x10
> 
> [  333.437241]  mce_timer_fn+0xb1/0x130
> 
> [  333.438966]  ? __pfx_mce_timer_fn+0x10/0x10
> 
> [  333.441380]  call_timer_fn+0x26/0x120
> 
> [  333.443518]  __run_timers+0x202/0x290
> 
> [  333.445763]  run_timer_softirq+0x49/0x100
> 
> [  333.447908]  handle_softirqs+0xeb/0x2c0
> 
> [  333.449863]  __irq_exit_rcu+0xda/0x100
> 
> [  333.452065]  sysvec_apic_timer_interrupt+0x71/0x90
> 
> [  333.454846]  </IRQ>
> 
> [  333.456192]  <TASK>
> 
> [  333.457520]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  333.460355] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [  333.463203] Code: 20 d0 e9 5f 99 e6 fe 0f 1f 40 00 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 ee 18 00 fb f4 <e9>
> 37 990
> [  333.472816] RSP: 0018:ffffffff83403e78 EFLAGS: 00000246
> [  333.475848] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [  333.479481] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000000
> [  333.483492] RBP: ffffffff83412980 R08: 0000000000000000 R09:
> 0000000000000000
> [  333.487503] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000000
> [  333.491482] R13: 0000000000000000 R14: 0000000000000000 R15:
> 00000000000947d0
> [  333.495258]  default_idle+0x9/0x30
> [  333.497283]  default_idle_call+0x28/0x100
> [  333.499641]  cpuidle_idle_call+0x12e/0x180
> [  333.502087]  do_idle+0x77/0xb0
> [  333.503914]  cpu_startup_entry+0x29/0x30
> [  333.506337]  rest_init+0xcc/0xd0
> [  333.508296]  start_kernel+0x4df/0x4e0
> [  333.510491]  x86_64_start_reservations+0x32/0x40
> [  333.513101]  x86_64_start_kernel+0xce/0xd0
> [  333.515433]  common_startup_64+0x13e/0x141
> [  333.517920]  </TASK>
> [  333.519468] Kernel panic - not syncing: MCA architectural violation!
> 
> 
> The problem appeared with the addition of clearing MCA_DESTAT for all
> deferred errors in the amd_clear_bank() function by this kernel commit:
> 
>     7cb735d7c0cb  x86/mce: Unify AMD DFR handler with MCA Polling
> 
> +       /* Clear MCA_DESTAT for all deferred errors even those logged in
> MCA_STATUS. */
> +       if (m->status & MCI_STATUS_DEFERRED)
> +               mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> 
> 
> Where a Qemu AMD implementation of MCE injection for deferred errors
> relies on machine_check_poll() picking up these errors.
> As indicated in Qemu change:
>     4b77512b2782  i386: Fix MCE support for AMD hosts
> https://lore.kernel.org/qemu-devel/20240603193622.47156-2-john.allen@amd.com/
> 
> 
> When a Qemu process receives the SIGBUS information from the host, it
> generates a virtual MCE to be dealt by the VM kernel machine_check_poll().
> But clearing MCA_DESTAT doesn't seem to be allowed and triggers an
> exception. Which looks like a kernel & AMD SMCA contract mismatch (?)
> 
> So should we consider that the Qemu platform has to allow the change or
> is the kernel missing guards around clearing this MCA bank after
> injected UEs on this platform ?
> 
> 
> FYI, to reproduce the problem:
> . I used a QEMU Standard PC q35:
> 
> qemu-system-x86_64 --version
> QEMU emulator version 10.2.50 (v10.2.0-1085-gcd5a79dc98)
> Copyright (c) 2003-2026 Fabrice Bellard and the QEMU Project developers
> 
> qemu-system-x86_64 -smp 4 -m 20G -enable-kvm -cpu host -usb \
> 	-device usb-tablet -serial mon:stdio -M q35 \
> 	-nic user,model=e1000,hostfwd=tcp::60022-:22 -nographic \
> 	-drive file=disk.qcow2,cache=none
> 
> . Inject an error into this VM running a 6.19.0-rc1 or more recent kernel.
> From the host:
> # modprobe hwpoison-inject
> # echo <pfn> > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
> Wait 5 minutes until the deferred error is handled by the VM kernel, and
> the VM than crashes with the above stack trace...
> 
> 
> . But removing the reset of MCA_DESTAT in the kernel amd_clear_bank()
> function or adding this simple test makes the system work again as
> before:
> 
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index d9f9ee7db5c8..86b3070fbb40 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -860,7 +860,7 @@ void amd_clear_bank(struct mce *m)
>         amd_reset_thr_limit(m->bank);
> 
>         /* Clear MCA_DESTAT for all deferred errors even those logged in
> MCA_STATUS. */
> -       if (m->status & MCI_STATUS_DEFERRED)
> +       if (m->status & MCI_STATUS_DEFERRED && !(m->status &
> MCI_STATUS_POISON))
>                 mce_wrmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank), 0);
> 
>         /* Don't clear MCA_STATUS if MCA_DESTAT was used exclusively. */
> 
> 
> 
> According to me, this small kernel fix relies too much on a Qemu AMD
> specific implementation detail.
> 
> Would you have a more appropriate fix to suggest please ?
> 
> Thanks in advance for your feedback.
> William.

Thanks William for the report and details.

Clearing "STATUS" registers is a normal part of MCA handling.

We seem to allow clearing the regular "MCi_STATUS" register. I assume
this gets trapped/ignored by the hypervisor.

I expect we need to do the same behavior for the "MCA_DESTAT" register.

I'll do some research here, but please do share any pointers you may
have.

Thanks,
Yazen
Re: [RFC] AMD VM crashing on deferred memory error injection
Posted by Yazen Ghannam 7 hours ago
On Mon, Feb 09, 2026 at 04:08:19PM -0500, Yazen Ghannam wrote:
> On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:

[...]

> > According to me, this small kernel fix relies too much on a Qemu AMD
> > specific implementation detail.
> > 
> > Would you have a more appropriate fix to suggest please ?
> > 
> > Thanks in advance for your feedback.
> > William.
> 
> Thanks William for the report and details.
> 
> Clearing "STATUS" registers is a normal part of MCA handling.
> 
> We seem to allow clearing the regular "MCi_STATUS" register. I assume
> this gets trapped/ignored by the hypervisor.
> 
> I expect we need to do the same behavior for the "MCA_DESTAT" register.
> 
> I'll do some research here, but please do share any pointers you may
> have.

Sorry for the rapid reply, but I think this is where we need an update.

Linux:
arch/x86/kvm/x86.c : set_msr_mce()

Please note the comment:
"All CPUs allow writing 0 to MCi_STATUS MSRs to clear the MSR."

We should include the MCA_DESTAT register range here.

What do you think?

Thanks,
Yazen
Re: [RFC] AMD VM crashing on deferred memory error injection
Posted by Borislav Petkov 10 hours ago
On Mon, Feb 09, 2026 at 05:36:32PM +0100, William Roche wrote:
> An AMD Qemu VM running this kernel is no longer able to deal with the
> injection of a deferred memory error, and crashes with:
> 
> [  333.420854] mce: MSR access error: WRMSR to 0xc0002098 (tried to write
> 0x0000000000000000) at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
> [  333.428105] Call Trace:

Works as advertized - KVM is not allowing the MSR write.

This enablement is not meant for VM use. Why do we care about injecting hw
errors in a guest?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
RE: [RFC] AMD VM crashing on deferred memory error injection
Posted by Luck, Tony 10 hours ago
> This enablement is not meant for VM use. Why do we care about injecting hw
> errors in a guest?

The guest may be able to just kill a process and keep running.

-Tony
Re: [RFC] AMD VM crashing on deferred memory error injection
Posted by Borislav Petkov 10 hours ago
On Mon, Feb 09, 2026 at 05:38:58PM +0000, Luck, Tony wrote:
> > This enablement is not meant for VM use. Why do we care about injecting hw
> > errors in a guest?
> 
> The guest may be able to just kill a process and keep running.

I have heard about injecting errors into qemu/kvm perhaps a decade ago and
nothing ever since. Either it has been working perfectly since then or no one
cares until now.

So the guest "may" be able to do a lot of things - question is, do we support
it and how do we test for it in the future so that it doesn't break.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette