[PATCH v2 0/1] AMD VM crashing on deferred memory error injection

“William Roche posted 1 patch 1 month, 1 week ago
There is a newer version of this series
arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
[PATCH v2 0/1] AMD VM crashing on deferred memory error injection
Posted by “William Roche 1 month, 1 week ago
From: William Roche <william.roche@oracle.com>

Thank you very much Yazen for your review and all the suggestions!

v2 changes:
- Commit title changed to:
  x86/mce/amd: Fix VM crash during deferred error handling
- Commit message with capitalized QEMU and KVM as well as the imperative
  statement suggested by Yazen
- "CC stable" tag placed after "Signed-off-by"
  (The documentation asks for "the sign-off area" without more details)
- blank line added to separate SCMA code block and the update of
  MCA_STATUS.

 --

After the integration of the following commit:
	7cb735d7c0cb x86/mce: Unify AMD DFR handler with MCA Polling

AMD Qemu VM started to crash when dealing with deferred memory error
injection with a stack trace like:

mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000)
at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)

  amd_clear_bank+0x6e/0x70
  machine_check_poll+0x228/0x2e0
  ? __pfx_mce_timer_fn+0x10/0x10
  mce_timer_fn+0xb1/0x130
  ? __pfx_mce_timer_fn+0x10/0x10
  call_timer_fn+0x26/0x120
  __run_timers+0x202/0x290
  run_timer_softirq+0x49/0x100
  handle_softirqs+0xeb/0x2c0
  __irq_exit_rcu+0xda/0x100
  sysvec_apic_timer_interrupt+0x71/0x90
[...]
 Kernel panic - not syncing: MCA architectural violation!

See the discussion at:
https://lore.kernel.org/all/48d8e1c8-1eb9-49cc-8de8-78077f29c203@oracle.com/

We identified a problem with SMCA specific registers access from
non-SMCA platforms like a QEMU/KVM machine.

This patch is checkpatch.pl clean.
Unit test of memory error injection works fine with it.


William Roche (1):
  x86/mce/amd: Fix VM crash during deferred error handling

 arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

-- 
2.47.3
Re: [PATCH v2 0/1] AMD VM crashing on deferred memory error injection
Posted by William Roche 3 weeks ago
On 2/18/26 17:30, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> Thank you very much Yazen for your review and all the suggestions!
> 
> v2 changes:
> - Commit title changed to:
>    x86/mce/amd: Fix VM crash during deferred error handling
> - Commit message with capitalized QEMU and KVM as well as the imperative
>    statement suggested by Yazen
> - "CC stable" tag placed after "Signed-off-by"
>    (The documentation asks for "the sign-off area" without more details)
> - blank line added to separate SCMA code block and the update of
>    MCA_STATUS.
> 
>   --
> 
> After the integration of the following commit:
> 	7cb735d7c0cb x86/mce: Unify AMD DFR handler with MCA Polling
> 
> AMD Qemu VM started to crash when dealing with deferred memory error
> injection with a stack trace like:
> 
> mce: MSR access error: WRMSR to 0xc0002098 (tried to write 0x0000000000000000)
> at rIP: 0xffffffff8229894d (mce_wrmsrq+0x1d/0x60)
> 
>    amd_clear_bank+0x6e/0x70
>    machine_check_poll+0x228/0x2e0
>    ? __pfx_mce_timer_fn+0x10/0x10
>    mce_timer_fn+0xb1/0x130
>    ? __pfx_mce_timer_fn+0x10/0x10
>    call_timer_fn+0x26/0x120
>    __run_timers+0x202/0x290
>    run_timer_softirq+0x49/0x100
>    handle_softirqs+0xeb/0x2c0
>    __irq_exit_rcu+0xda/0x100
>    sysvec_apic_timer_interrupt+0x71/0x90
> [...]
>   Kernel panic - not syncing: MCA architectural violation!
> 
> See the discussion at:
> https://lore.kernel.org/all/48d8e1c8-1eb9-49cc-8de8-78077f29c203@oracle.com/
> 
> We identified a problem with SMCA specific registers access from
> non-SMCA platforms like a QEMU/KVM machine.
> 
> This patch is checkpatch.pl clean.
> Unit test of memory error injection works fine with it.
> 
> 
> William Roche (1):
>    x86/mce/amd: Fix VM crash during deferred error handling
> 
>   arch/x86/kernel/cpu/mce/amd.c | 17 +++++++++++------
>   1 file changed, 11 insertions(+), 6 deletions(-)
> 

Hello,

This fix has been reviewed by Yazen Ghannam. The code tested with 
QEMU/KVM virtual machines on AMD platforms. The commit that is fixed 
here (7cb735d7c0cb) is present in the stable branch linux-6.19.y.

Could you please let me know if anything is missing to integrate this fix ?

Thanks in advance for your feedback,
William.