[v2] x86/crash: Fix double NMI shootdown bug

[PATCH v2 0/3] x86/crash: Fix double NMI shootdown bug

Posted by Sean Christopherson 3 years, 11 months ago

Fix a double NMI shootdown bug found and debugged by Guilherme, who did all
the hard work.  NMI shootdown is a one-time thing; the handler leaves NMIs
blocked and enters halt.  At best, a second (or third...) shootdown is an
expensive nop, at worst it can hang the kernel and prevent kexec'ing into
a new kernel, e.g. prior to the hardening of register_nmi_handler(), a
double shootdown resulted in a double list_add(), which is fatal when running
with CONFIG_BUG_ON_DATA_CORRUPTION=y.

With the "right" kexec/kdump configuration, emergency_vmx_disable_all() can
be reached after kdump_nmi_shootdown_cpus() (currently the only two users
of nmi_shootdown_cpus()).

To fix, move the disabling of virtualization into crash_nmi_callback(),
remove emergency_vmx_disable_all()'s callback, and do a shootdown for
emergency_vmx_disable_all() if and only if a shootdown hasn't yet occurred.
The only thing emergency_vmx_disable_all() cares about is disabling VMX/SVM
(obviously), and since I can't envision a use case for an NMI shootdown that
doesn't want to disable virtualization, doing that in the core handler means
emergency_vmx_disable_all() only needs to ensure _a_ shootdown occurs, it
doesn't care when that shootdown happened or what callback may have run.

Patch 2 is a related bug fix found while exploring ideas for patch 1.
Patch 3 is a cleanup to try to prevent future "fixed VMX but not SVM"
style bugs.

Guilherme and Vitaly, I dropped your Tested-by and Reviewed-by tags
since the relevant patches changed a decent amount.

v2:
  - Use a NULL handler and crash_ipi_issued instead of a magic nop
    handler. [tglx]
  - Add comments to call out that modifying the existing handler
    once the NMI is sent may cause explosions.
  - Add a patch to cleanup cpu_emergency_vmxoff().

v1: https://lore.kernel.org/all/20220511234332.3654455-1-seanjc@google.com

Sean Christopherson (3):
  x86/crash: Disable virt in core NMI crash handler to avoid double
    shootdown
  x86/reboot: Disable virtualization in an emergency if SVM is supported
  x86/virt: Fold __cpu_emergency_vmxoff() into its sole caller

 arch/x86/include/asm/reboot.h  |  1 +
 arch/x86/include/asm/virtext.h | 14 +-----
 arch/x86/kernel/crash.c        | 16 +-----
 arch/x86/kernel/reboot.c       | 89 +++++++++++++++++++++++++---------
 4 files changed, 69 insertions(+), 51 deletions(-)


base-commit: a7fed5c0431dbfa707037848830f980e0f93cfb3
-- 
2.36.0.550.gb090851708-goog

Re: [PATCH v2 0/3] x86/crash: Fix double NMI shootdown bug

Posted by Guilherme G. Piccoli 3 years, 7 months ago

On 17/05/2022 21:16, Sean Christopherson wrote:
> Fix a double NMI shootdown bug found and debugged by Guilherme, who did all
> the hard work.  NMI shootdown is a one-time thing; the handler leaves NMIs
> blocked and enters halt.  At best, a second (or third...) shootdown is an
> expensive nop, at worst it can hang the kernel and prevent kexec'ing into
> a new kernel, e.g. prior to the hardening of register_nmi_handler(), a
> double shootdown resulted in a double list_add(), which is fatal when running
> with CONFIG_BUG_ON_DATA_CORRUPTION=y.
> 
> With the "right" kexec/kdump configuration, emergency_vmx_disable_all() can
> be reached after kdump_nmi_shootdown_cpus() (currently the only two users
> of nmi_shootdown_cpus()).
> 
> To fix, move the disabling of virtualization into crash_nmi_callback(),
> remove emergency_vmx_disable_all()'s callback, and do a shootdown for
> emergency_vmx_disable_all() if and only if a shootdown hasn't yet occurred.
> The only thing emergency_vmx_disable_all() cares about is disabling VMX/SVM
> (obviously), and since I can't envision a use case for an NMI shootdown that
> doesn't want to disable virtualization, doing that in the core handler means
> emergency_vmx_disable_all() only needs to ensure _a_ shootdown occurs, it
> doesn't care when that shootdown happened or what callback may have run.
> 
> Patch 2 is a related bug fix found while exploring ideas for patch 1.
> Patch 3 is a cleanup to try to prevent future "fixed VMX but not SVM"
> style bugs.
> 
> Guilherme and Vitaly, I dropped your Tested-by and Reviewed-by tags
> since the relevant patches changed a decent amount.
> 
> v2:
>   - Use a NULL handler and crash_ipi_issued instead of a magic nop
>     handler. [tglx]
>   - Add comments to call out that modifying the existing handler
>     once the NMI is sent may cause explosions.
>   - Add a patch to cleanup cpu_emergency_vmxoff().
> 
> v1: https://lore.kernel.org/all/20220511234332.3654455-1-seanjc@google.com
> 
> Sean Christopherson (3):
>   x86/crash: Disable virt in core NMI crash handler to avoid double
>     shootdown
>   x86/reboot: Disable virtualization in an emergency if SVM is supported
>   x86/virt: Fold __cpu_emergency_vmxoff() into its sole caller
> 
>  arch/x86/include/asm/reboot.h  |  1 +
>  arch/x86/include/asm/virtext.h | 14 +-----
>  arch/x86/kernel/crash.c        | 16 +-----
>  arch/x86/kernel/reboot.c       | 89 +++++++++++++++++++++++++---------
>  4 files changed, 69 insertions(+), 51 deletions(-)
> 
> 
> base-commit: a7fed5c0431dbfa707037848830f980e0f93cfb3

Hi folks, monthly ping!
Any news on this fix series? Just checked, still applies cleanly.

Thanks,


Guilherme

Re: [PATCH v2 0/3] x86/crash: Fix double NMI shootdown bug

Posted by Guilherme G. Piccoli 3 years, 8 months ago

On 17/05/2022 21:16, Sean Christopherson wrote:
> Fix a double NMI shootdown bug found and debugged by Guilherme, who did all
> the hard work.  NMI shootdown is a one-time thing; the handler leaves NMIs
> blocked and enters halt.  At best, a second (or third...) shootdown is an
> expensive nop, at worst it can hang the kernel and prevent kexec'ing into
> a new kernel, e.g. prior to the hardening of register_nmi_handler(), a
> double shootdown resulted in a double list_add(), which is fatal when running
> with CONFIG_BUG_ON_DATA_CORRUPTION=y.
> 
> With the "right" kexec/kdump configuration, emergency_vmx_disable_all() can
> be reached after kdump_nmi_shootdown_cpus() (currently the only two users
> of nmi_shootdown_cpus()).
> 
> To fix, move the disabling of virtualization into crash_nmi_callback(),
> remove emergency_vmx_disable_all()'s callback, and do a shootdown for
> emergency_vmx_disable_all() if and only if a shootdown hasn't yet occurred.
> The only thing emergency_vmx_disable_all() cares about is disabling VMX/SVM
> (obviously), and since I can't envision a use case for an NMI shootdown that
> doesn't want to disable virtualization, doing that in the core handler means
> emergency_vmx_disable_all() only needs to ensure _a_ shootdown occurs, it
> doesn't care when that shootdown happened or what callback may have run.
> 
> Patch 2 is a related bug fix found while exploring ideas for patch 1.
> Patch 3 is a cleanup to try to prevent future "fixed VMX but not SVM"
> style bugs.
> 
> Guilherme and Vitaly, I dropped your Tested-by and Reviewed-by tags
> since the relevant patches changed a decent amount.
> 
> v2:
>   - Use a NULL handler and crash_ipi_issued instead of a magic nop
>     handler. [tglx]
>   - Add comments to call out that modifying the existing handler
>     once the NMI is sent may cause explosions.
>   - Add a patch to cleanup cpu_emergency_vmxoff().
> 
> v1: https://lore.kernel.org/all/20220511234332.3654455-1-seanjc@google.com
> 
> Sean Christopherson (3):
>   x86/crash: Disable virt in core NMI crash handler to avoid double
>     shootdown
>   x86/reboot: Disable virtualization in an emergency if SVM is supported
>   x86/virt: Fold __cpu_emergency_vmxoff() into its sole caller
> 
>  arch/x86/include/asm/reboot.h  |  1 +
>  arch/x86/include/asm/virtext.h | 14 +-----
>  arch/x86/kernel/crash.c        | 16 +-----
>  arch/x86/kernel/reboot.c       | 89 +++++++++++++++++++++++++---------
>  4 files changed, 69 insertions(+), 51 deletions(-)
> 
> 
> base-commit: a7fed5c0431dbfa707037848830f980e0f93cfb3


Hey folks, is there any news about this series?
I saw a ping from Sean some time ago...so, this is a re-ping heh

Thanks,


Guilherme

Re: [PATCH v2 0/3] x86/crash: Fix double NMI shootdown bug

Posted by Sean Christopherson 3 years, 10 months ago

On Wed, May 18, 2022, Sean Christopherson wrote:
> Fix a double NMI shootdown bug found and debugged by Guilherme, who did all
> the hard work.  NMI shootdown is a one-time thing; the handler leaves NMIs
> blocked and enters halt.  At best, a second (or third...) shootdown is an
> expensive nop, at worst it can hang the kernel and prevent kexec'ing into
> a new kernel, e.g. prior to the hardening of register_nmi_handler(), a
> double shootdown resulted in a double list_add(), which is fatal when running
> with CONFIG_BUG_ON_DATA_CORRUPTION=y.

...

> Sean Christopherson (3):
>   x86/crash: Disable virt in core NMI crash handler to avoid double
>     shootdown
>   x86/reboot: Disable virtualization in an emergency if SVM is supported
>   x86/virt: Fold __cpu_emergency_vmxoff() into its sole caller
> 
>  arch/x86/include/asm/reboot.h  |  1 +
>  arch/x86/include/asm/virtext.h | 14 +-----
>  arch/x86/kernel/crash.c        | 16 +-----
>  arch/x86/kernel/reboot.c       | 89 +++++++++++++++++++++++++---------
>  4 files changed, 69 insertions(+), 51 deletions(-)

Ping!  Still applies cleanly.

Re: [PATCH v2 0/3] x86/crash: Fix double NMI shootdown bug

Posted by Guilherme G. Piccoli 3 years, 11 months ago

Hi Sean / Thomas, was this merged anywhere? I just checked and seems it
didn't reach mainline...patch 01 is especially relevant, IMHO.

Thanks in advance,


Guilherme