arch/x86/kvm/vmx/vmx.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
Ensure the shadow VMCS cache is evicted during an emergency reboot to
prevent potential memory corruption if the cache is evicted after reboot.
This issue was identified through code inspection, as __loaded_vmcs_clear()
flushes both the normal VMCS and the shadow VMCS.
Avoid checking the "launched" state during an emergency reboot, unlike the
behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
can interfere with operations like copy_shadow_to_vmcs12(), where shadow
VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
right after the VMCS load, the shadow VMCSes will be active but the
"launched" state may not be set.
Signed-off-by: Chao Gao <chao.gao@intel.com>
---
arch/x86/kvm/vmx/vmx.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b70ed72c1783..dccd1c9939b8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
return;
list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
- loaded_vmcss_on_cpu_link)
+ loaded_vmcss_on_cpu_link) {
vmcs_clear(v->vmcs);
+ if (v->shadow_vmcs)
+ vmcs_clear(v->shadow_vmcs);
+ }
kvm_cpu_vmxoff();
}
--
2.46.1
On Mon, 24 Mar 2025 22:08:48 +0800, Chao Gao wrote:
> Ensure the shadow VMCS cache is evicted during an emergency reboot to
> prevent potential memory corruption if the cache is evicted after reboot.
>
> This issue was identified through code inspection, as __loaded_vmcs_clear()
> flushes both the normal VMCS and the shadow VMCS.
>
> Avoid checking the "launched" state during an emergency reboot, unlike the
> behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
> can interfere with operations like copy_shadow_to_vmcs12(), where shadow
> VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
> right after the VMCS load, the shadow VMCSes will be active but the
> "launched" state may not be set.
>
> [...]
Applied to kvm-x86 vmx. I tagged it for stable, but it's not urgent (I'm 99%
certain it will never cause problems), so I figure I'd give it a full cycle in
-next.
[1/1] KVM: VMX: Flush shadow VMCS on emergency reboot
https://github.com/kvm-x86/linux/commit/a0ee1d5faff1
--
https://github.com/kvm-x86/linux/tree/next
On Mon, Mar 24, 2025, Chao Gao wrote:
> Ensure the shadow VMCS cache is evicted during an emergency reboot to
> prevent potential memory corruption if the cache is evicted after reboot.
I don't suppose Intel would want to go on record and state what CPUs would actually
be affected by this bug. My understanding is that Intel has never shipped a CPU
that caches shadow VMCS state.
On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
that's going to be the architectural behavior going forward, will that behavior
be enumerated to software? Regardless of whether there's software enumeration,
I would like to have the emergency disable path depend on that behavior. In part
to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
this bug.
If all past CPUs never cache shadow VMCS state, and all future CPUs flush the
caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be
tagged for stable.
> This issue was identified through code inspection, as __loaded_vmcs_clear()
> flushes both the normal VMCS and the shadow VMCS.
>
> Avoid checking the "launched" state during an emergency reboot, unlike the
> behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
> can interfere with operations like copy_shadow_to_vmcs12(), where shadow
> VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
> right after the VMCS load, the shadow VMCSes will be active but the
> "launched" state may not be set.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
> arch/x86/kvm/vmx/vmx.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b70ed72c1783..dccd1c9939b8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
> return;
>
> list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
> - loaded_vmcss_on_cpu_link)
> + loaded_vmcss_on_cpu_link) {
> vmcs_clear(v->vmcs);
> + if (v->shadow_vmcs)
> + vmcs_clear(v->shadow_vmcs);
> + }
>
> kvm_cpu_vmxoff();
> }
> --
> 2.46.1
>
Trimmed Cc: to lists, as this is basically off-topic, but I thought you might
be amused :-)
On Thu, Apr 10, 2025, Sean Christopherson wrote:
> On Mon, Mar 24, 2025, Chao Gao wrote:
> > Ensure the shadow VMCS cache is evicted during an emergency reboot to
> > prevent potential memory corruption if the cache is evicted after reboot.
>
> I don't suppose Intel would want to go on record and state what CPUs would actually
> be affected by this bug. My understanding is that Intel has never shipped a CPU
> that caches shadow VMCS state.
>
> On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
> that's going to be the architectural behavior going forward, will that behavior
> be enumerated to software? Regardless of whether there's software enumeration,
> I would like to have the emergency disable path depend on that behavior. In part
> to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
> this bug.
Apparently I completely purged it from my memory, but while poking through an
internal branch related to moving VMXON out of KVM, I came across this:
--
Author: Sean Christopherson <seanjc@google.com>
AuthorDate: Wed Jan 17 16:19:28 2024 -0800
Commit: Sean Christopherson <seanjc@google.com>
CommitDate: Fri Jan 26 13:16:31 2024 -0800
KVM: VMX: VMCLEAR loaded shadow VMCSes on kexec()
Add a helper to VMCLEAR _all_ loaded VMCSes in a loaded_vmcs pair, and use
it when doing VMCLEAR before kexec() after a crash to fix a (likely benign)
bug where KVM neglects to VMCLEAR loaded shadow VMCSes. The bug is likely
benign as existing Intel CPUs don't insert shadow VMCSes into the VMCS
cache, i.e. shadow VMCSes can't be evicted since they're never cached, and
thus won't clobber memory in the new kernel.
--
At least my reaction was more or less the same both times?
> If all past CPUs never cache shadow VMCS state, and all future CPUs flush the
> caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be
> tagged for stable.
>
> > This issue was identified through code inspection, as __loaded_vmcs_clear()
> > flushes both the normal VMCS and the shadow VMCS.
> >
> > Avoid checking the "launched" state during an emergency reboot, unlike the
> > behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
> > can interfere with operations like copy_shadow_to_vmcs12(), where shadow
> > VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
> > right after the VMCS load, the shadow VMCSes will be active but the
> > "launched" state may not be set.
> >
> > Signed-off-by: Chao Gao <chao.gao@intel.com>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index b70ed72c1783..dccd1c9939b8 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
> > return;
> >
> > list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
> > - loaded_vmcss_on_cpu_link)
> > + loaded_vmcss_on_cpu_link) {
> > vmcs_clear(v->vmcs);
> > + if (v->shadow_vmcs)
> > + vmcs_clear(v->shadow_vmcs);
> > + }
> >
> > kvm_cpu_vmxoff();
> > }
> > --
> > 2.46.1
> >
Sean Christopherson wrote:
> Trimmed Cc: to lists, as this is basically off-topic, but I thought you might
> be amused :-)
>
> On Thu, Apr 10, 2025, Sean Christopherson wrote:
> > On Mon, Mar 24, 2025, Chao Gao wrote:
> > > Ensure the shadow VMCS cache is evicted during an emergency reboot to
> > > prevent potential memory corruption if the cache is evicted after reboot.
> >
> > I don't suppose Intel would want to go on record and state what CPUs would actually
> > be affected by this bug. My understanding is that Intel has never shipped a CPU
> > that caches shadow VMCS state.
> >
> > On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
> > that's going to be the architectural behavior going forward, will that behavior
> > be enumerated to software? Regardless of whether there's software enumeration,
> > I would like to have the emergency disable path depend on that behavior. In part
> > to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
> > this bug.
>
> Apparently I completely purged it from my memory, but while poking through an
> internal branch related to moving VMXON out of KVM, I came across this:
Hey, while we on the topic of being off topic about that branch, I
mentioned to Chao that I was looking to post patches for a simple VMX
module. He pointed me here to say "I think Sean is already working on
it". This posting would be to save you the trouble of untangling that
branch, at least in the near term.
Here's the work-in-progress shortlog:
Chao Gao (2):
x86/virt/tdx: Move TDX per-CPU initialization into tdx_enable()
coco/tdx-host: Introduce a "tdx_host" device
Dan Williams (5):
x86/virt/tdx: Drop KVM_INTEL dependency
x86/reboot: Convert cpu_emergency_disable_virtualization() to notifier chain
x86/reboot: Introduce CONFIG_VIRT_X86
KVM: VMX: Rename vmx_init() to kvm_vmx_init()
KVM: VMX: Move VMX from kvm_intel.ko to vmx.ko
I can put the finishing touches on it and send it out against
kvm-x86/next after -rc1 is out, or hold off if your extraction is going
ok.
> --
> Author: Sean Christopherson <seanjc@google.com>
> AuthorDate: Wed Jan 17 16:19:28 2024 -0800
> Commit: Sean Christopherson <seanjc@google.com>
> CommitDate: Fri Jan 26 13:16:31 2024 -0800
>
> KVM: VMX: VMCLEAR loaded shadow VMCSes on kexec()
>
> Add a helper to VMCLEAR _all_ loaded VMCSes in a loaded_vmcs pair, and use
> it when doing VMCLEAR before kexec() after a crash to fix a (likely benign)
> bug where KVM neglects to VMCLEAR loaded shadow VMCSes. The bug is likely
> benign as existing Intel CPUs don't insert shadow VMCSes into the VMCS
> cache, i.e. shadow VMCSes can't be evicted since they're never cached, and
> thus won't clobber memory in the new kernel.
>
> --
At least my approach to the VMX module leaves VMCS clearing with KVM.
Main goal is not regress any sequencing around VMX.
On Thu, Oct 09, 2025, dan.j.williams@intel.com wrote:
> Hey, while we on the topic of being off topic about that branch, I
> mentioned to Chao that I was looking to post patches for a simple VMX
> module. He pointed me here to say "I think Sean is already working on
> it". This posting would be to save you the trouble of untangling that
> branch, at least in the near term.
>
> Here's the work-in-progress shortlog:
>
> Chao Gao (2):
> x86/virt/tdx: Move TDX per-CPU initialization into tdx_enable()
> coco/tdx-host: Introduce a "tdx_host" device
>
> Dan Williams (5):
> x86/virt/tdx: Drop KVM_INTEL dependency
> x86/reboot: Convert cpu_emergency_disable_virtualization() to notifier chain
I don't actually think we want a notifier chain, because the order matters. I
think what we want instead if a daisy-chain of notifiers (I swear I'm not trying
to troll y'all). If we go with a standard notifier chain, there will be a hidden
dependency on KVM's notifier running first (to do VMCLEAR) and what you appear to
be calling vmx.ko's notifier running second.
> x86/reboot: Introduce CONFIG_VIRT_X86
> KVM: VMX: Rename vmx_init() to kvm_vmx_init()
> KVM: VMX: Move VMX from kvm_intel.ko to vmx.ko
Spoiler for the wall of text below, but I don't think it makes sense to have a
vmx.ko, or an intermediate module of any kind.
> I can put the finishing touches on it and send it out against
> kvm-x86/next after -rc1 is out, or hold off if your extraction is going
> ok.
It went ok, but it's dead in the water.
> At least my approach to the VMX module leaves VMCS clearing with KVM.
Yes, I came to this conclusion as well.
TL;DR: I don't think we should pursue my idea (read: pipe dream) of extracting
all core virtualization functionality to allow multiple hypervisors. It's
not _that_ hard (my "not 6 months of work" statement still holds true),
but it raises a number of policy-related questions and would add non-trivial
complexity, for no immediate benefit (and potential no benefit even in the
long run).
For all intents and purposes, moving the VMCS tracking to a central module commits
to fully supporting multiple in-kernel hypervisors. After extracting/rebasing
almost all of the internal code, I realized that fully supporting another hypervisor
without actually having another hypervisor in-hand (and that we want to support
upstream), is an absurd amount of churn and complexity for no practical benefit.
For the internal implementation, we had punted on a few key annoyances because they
were way out-of-scope for our use case. Sadly, it's not even the hardware side of
things that adds the most complexity, it's all the paravirt crud. Specifically,
KVM's support for various Hyper-V features, e.g. Hyper-V's Enlightened VMCS (eVMCS),
used when KVM is running as a VM on a Hyper-V, or on KVM emulating Hyper-V.
Whether or not to use eVMCS is in some ways a system wide policy decision (currently
it's a KVM module param), because the control state is per CPU. Our internal approach
was to limit support for multple KVM modules to bare metal kernel configurations,
but that doesn't really fly when trying to make a truly generic hypervisor framework.
I'm pretty sure we could solve that by having KVM purge the eVMCS pointer when a
vCPU is scheduled out (to avoid another hypervisor unintentionally use KVM's eVMCS),
but that's a whole new pile of complexity, especially if we tried to have proactive
clearing limited to eVMCS (currently, KVM lazily does VMCLEAR via an IPI shootdown
if a VMCS needs to be migrated to a different pCPU; it's easy enough to have KVM
eagerly do VMCLEAR (on sched out), but it's not obvious that it'd be a performance
win).
There are a few other things of a similar nature, where it's not _too_ hard to
massage some piece of functionality into common infrastructure, but as I was
mentally working through them, I kept thinking "this isn't going to end well".
When we were specifically targeting multiple KVM instances, it wasn't anywhere
near as scary. Because while each KVM instance could obviously run different
code, the general architecture and wants of each active hypervisor would be the
same.
E.g. whether or not to enable SNP's new ciphertext hiding features, and how many
ASIDs to reserve for SNP+ VMs, is another system wide policy decision that is a
non-issue when the goal is purely to be able to run different versions of the same
hypervisor. But it's less clear cut how that should be handled when the "other"
hypervisor is an unknown, faceless entity.
Ditto for the "obvious" things like the user-return MSR framework. That "needs"
to be system wide because the current value in hardware is per-CPU, but in quotes
because it presupposes that another hypervisor would even want to play the same
games.
So, while it's definitely doable to separate the core virtualization bits from
KVM, I think the added complexity (in code and in the mental model) would make
it all _extremely_ premature infrastructure.
> Main goal is not regress any sequencing around VMX.
Ya, that requires some care.
A big takeaway/revelation for me from after going through the above exercise of
trying and failing (well, giving up) to extract _all_ system-wide virtualization
functionality is that not only is trying to generically support other hypervisors
undesirable, if we commit to NOT doing that, then we really can special case the
low level VMXON + EFER.SVME logic.
What I mean by that is we don't need a fully generic, standalone module to provide
VMXON+VMXOFF, we only need common tracking. KVM is the only module user, and TDX
is a super special snowflake that already needs to provide a pile of its own
functionality. KVM (obviously) already has the necessary hooks to handle things
like CPU bringup/teardown and reboot, and adding them to the TDX subystem code is
easy enough.
And more importantly, except for VMXON+VMXOFF, the _only_ requirements TDX brings
to the table are one-shot, unidirectional things (e.g. TDH_SYS_LP_INIT). I.e.
there's no need for multi-user notifier chains, because the core code can blast
VMXOFF in an emergency, and clean reboots can be easily handled through syscore.
I've got patches that appear to work (writing the code didn't take long at all
once I pieced together the idea). I'll post an RFC and we can go from there.
On Wed, Oct 08, 2025 at 04:01:07PM -0700, Sean Christopherson wrote:
>Trimmed Cc: to lists, as this is basically off-topic, but I thought you might
>be amused :-)
>
>On Thu, Apr 10, 2025, Sean Christopherson wrote:
>> On Mon, Mar 24, 2025, Chao Gao wrote:
>> > Ensure the shadow VMCS cache is evicted during an emergency reboot to
>> > prevent potential memory corruption if the cache is evicted after reboot.
>>
>> I don't suppose Intel would want to go on record and state what CPUs would actually
>> be affected by this bug. My understanding is that Intel has never shipped a CPU
>> that caches shadow VMCS state.
Yes. Shadow VMCSs are never cached. But this is an implementation detail.
Per SDM, software is required to VMCLEAR a shadow VMCS that was made active
to be forward compatible.
>>
>> On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
>> that's going to be the architectural behavior going forward, will that behavior
>> be enumerated to software? Regardless of whether there's software enumeration,
>> I would like to have the emergency disable path depend on that behavior. In part
>> to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
>> this bug.
Yes. The current implementation is that CPUs with SEAM support flush _all_
VMCS caches on VMXOFF. But the architectural behavior is trending toward
CPUs that enumerate IA32_VMX_PROCBASED_CTRLS3[5] as 1 flushing _SEAM_ VMCS
caches on VMXOFF.
>
>Apparently I completely purged it from my memory, but while poking through an
>internal branch related to moving VMXON out of KVM, I came across this:
>
>--
>Author: Sean Christopherson <seanjc@google.com>
>AuthorDate: Wed Jan 17 16:19:28 2024 -0800
>Commit: Sean Christopherson <seanjc@google.com>
>CommitDate: Fri Jan 26 13:16:31 2024 -0800
>
> KVM: VMX: VMCLEAR loaded shadow VMCSes on kexec()
>
> Add a helper to VMCLEAR _all_ loaded VMCSes in a loaded_vmcs pair, and use
> it when doing VMCLEAR before kexec() after a crash to fix a (likely benign)
> bug where KVM neglects to VMCLEAR loaded shadow VMCSes. The bug is likely
> benign as existing Intel CPUs don't insert shadow VMCSes into the VMCS
> cache, i.e. shadow VMCSes can't be evicted since they're never cached, and
> thus won't clobber memory in the new kernel.
>
>--
>
>At least my reaction was more or less the same both times?
>
>> If all past CPUs never cache shadow VMCS state, and all future CPUs flush the
>> caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be
>> tagged for stable.
>>
>> > This issue was identified through code inspection, as __loaded_vmcs_clear()
>> > flushes both the normal VMCS and the shadow VMCS.
>> >
>> > Avoid checking the "launched" state during an emergency reboot, unlike the
>> > behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
>> > can interfere with operations like copy_shadow_to_vmcs12(), where shadow
>> > VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
>> > right after the VMCS load, the shadow VMCSes will be active but the
>> > "launched" state may not be set.
>> >
>> > Signed-off-by: Chao Gao <chao.gao@intel.com>
>> > ---
>> > arch/x86/kvm/vmx/vmx.c | 5 ++++-
>> > 1 file changed, 4 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> > index b70ed72c1783..dccd1c9939b8 100644
>> > --- a/arch/x86/kvm/vmx/vmx.c
>> > +++ b/arch/x86/kvm/vmx/vmx.c
>> > @@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
>> > return;
>> >
>> > list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
>> > - loaded_vmcss_on_cpu_link)
>> > + loaded_vmcss_on_cpu_link) {
>> > vmcs_clear(v->vmcs);
>> > + if (v->shadow_vmcs)
>> > + vmcs_clear(v->shadow_vmcs);
>> > + }
>> >
>> > kvm_cpu_vmxoff();
>> > }
>> > --
>> > 2.46.1
>> >
On Thu, Apr 10, 2025 at 02:55:29PM -0700, Sean Christopherson wrote:
>On Mon, Mar 24, 2025, Chao Gao wrote:
>> Ensure the shadow VMCS cache is evicted during an emergency reboot to
>> prevent potential memory corruption if the cache is evicted after reboot.
>
>I don't suppose Intel would want to go on record and state what CPUs would actually
>be affected by this bug. My understanding is that Intel has never shipped a CPU
>that caches shadow VMCS state.
I am not sure. Would you like me to check internally?
However, SDM Chapter 26.11 includes a footnote stating:
"
As noted in Section 26.1, execution of the VMPTRLD instruction makes a VMCS is
active. In addition, VM entry makes active any shadow VMCS referenced by the
VMCS link pointer in the current VMCS. If a shadow VMCS is made active by VM
entry, it is necessary to execute VMCLEAR for that VMCS before allowing that
VMCS to become active on another logical processor.
"
To me, this suggests that shadow VMCS may be cached, and software shouldn't
assume the CPU won't cache it. But, I don't know if this is the reality or
if the statement is merely for hardware implementation flexibility.
>
>On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If
Actually this behavior is not publicly documented.
>that's going to be the architectural behavior going forward, will that behavior
>be enumerated to software? Regardless of whether there's software enumeration,
>I would like to have the emergency disable path depend on that behavior. In part
>to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of
>this bug.
I don't understand how we can gain confidence that SEAM VMCSes won't screw
over kdump.
If a VMM wants to leverage the VMXOFF behavior, software enumeration
might be needed for nested virtualization. Using CPU F/M/S (SPR+) to
enumerate a behavior could be problematic for virtualization. Right?
>
>If all past CPUs never cache shadow VMCS state, and all future CPUs flush the
>caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be
>tagged for stable.
Agreed.
Sean, I am not clear whether you intend to fix this issue and, if so, how.
Could you clarify?
>
>> This issue was identified through code inspection, as __loaded_vmcs_clear()
>> flushes both the normal VMCS and the shadow VMCS.
>>
>> Avoid checking the "launched" state during an emergency reboot, unlike the
>> behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
>> can interfere with operations like copy_shadow_to_vmcs12(), where shadow
>> VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
>> right after the VMCS load, the shadow VMCSes will be active but the
>> "launched" state may not be set.
>>
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 5 ++++-
>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index b70ed72c1783..dccd1c9939b8 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
>> return;
>>
>> list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
>> - loaded_vmcss_on_cpu_link)
>> + loaded_vmcss_on_cpu_link) {
>> vmcs_clear(v->vmcs);
>> + if (v->shadow_vmcs)
>> + vmcs_clear(v->shadow_vmcs);
>> + }
>>
>> kvm_cpu_vmxoff();
>> }
>> --
>> 2.46.1
>>
On Fri, Apr 11, 2025, Chao Gao wrote: > On Thu, Apr 10, 2025 at 02:55:29PM -0700, Sean Christopherson wrote: > >On Mon, Mar 24, 2025, Chao Gao wrote: > >> Ensure the shadow VMCS cache is evicted during an emergency reboot to > >> prevent potential memory corruption if the cache is evicted after reboot. > > > >I don't suppose Intel would want to go on record and state what CPUs would actually > >be affected by this bug. My understanding is that Intel has never shipped a CPU > >that caches shadow VMCS state. > > I am not sure. Would you like me to check internally? Eh, if it's easy, it'd be nice to have, but don't put much effort into it. I'm probably being too cute in hesitating about sending this to stable@. The risk really shouldn't be high. > However, SDM Chapter 26.11 includes a footnote stating: > " > As noted in Section 26.1, execution of the VMPTRLD instruction makes a VMCS is > active. In addition, VM entry makes active any shadow VMCS referenced by the > VMCS link pointer in the current VMCS. If a shadow VMCS is made active by VM > entry, it is necessary to execute VMCLEAR for that VMCS before allowing that > VMCS to become active on another logical processor. > " > > To me, this suggests that shadow VMCS may be cached, and software shouldn't > assume the CPU won't cache it. But, I don't know if this is the reality or > if the statement is merely for hardware implementation flexibility. > > > > >On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If > > Actually this behavior is not publicly documented. Well shoot. That should probably be remedied. Even if the behavior is guaranteed only on CPUs that support SEAM, _that_ detail should be documented. I'm not holding my breath on Intel allowing third party code in SEAM, but the mode _is_ documented in the SDM, and so IMO, the SDM should also document how things like clearing the VMCS cache are supposed to work when there are VMCSes that "untrusted" software may not be able to access. > >that's going to be the architectural behavior going forward, will that behavior > >be enumerated to software? Regardless of whether there's software enumeration, > >I would like to have the emergency disable path depend on that behavior. In part > >to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of > >this bug. > > I don't understand how we can gain confidence that SEAM VMCSes won't screw > over kdump. If KVM relies on VMXOFF to purge the VMCS cache, then it gives a measure of confidence that running TDX VMs won't leave behind SEAM VMCSes in the cache. KVM can't easily clear SEAM VMCSs, but IIRC, the memory can be "forcefully" reclaimed by paving over it with MOVDIR64B, at which point having VMCS cache entries for the memory would be problematic. > If a VMM wants to leverage the VMXOFF behavior, software enumeration > might be needed for nested virtualization. Using CPU F/M/S (SPR+) to > enumerate a behavior could be problematic for virtualization. Right? Yeah, F/M/S is a bad idea. Architecturally, I think the behavior needs to be tied to support for SEAM. Is there a safe-ish way to probe for SEAM support, without having to glean it from MSR_IA32_MKTME_KEYID_PARTITIONING? > >If all past CPUs never cache shadow VMCS state, and all future CPUs flush the > >caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be > >tagged for stable. > > Agreed. > > Sean, I am not clear whether you intend to fix this issue and, if so, how. > Could you clarify? Oh, I definitely plan on taking this patch, I'm just undecided on whether or not to tag it for stable@.
On Fri, Apr 11, 2025 at 09:57:55AM -0700, Sean Christopherson wrote: >On Fri, Apr 11, 2025, Chao Gao wrote: >> On Thu, Apr 10, 2025 at 02:55:29PM -0700, Sean Christopherson wrote: >> >On Mon, Mar 24, 2025, Chao Gao wrote: >> >> Ensure the shadow VMCS cache is evicted during an emergency reboot to >> >> prevent potential memory corruption if the cache is evicted after reboot. >> > >> >I don't suppose Intel would want to go on record and state what CPUs would actually >> >be affected by this bug. My understanding is that Intel has never shipped a CPU >> >that caches shadow VMCS state. >> >> I am not sure. Would you like me to check internally? > >Eh, if it's easy, it'd be nice to have, but don't put much effort into it. I'm >probably being too cute in hesitating about sending this to stable@. The risk >really shouldn't be high. I've raised this question internally and will get back once I have an answer. > >> However, SDM Chapter 26.11 includes a footnote stating: >> " >> As noted in Section 26.1, execution of the VMPTRLD instruction makes a VMCS is >> active. In addition, VM entry makes active any shadow VMCS referenced by the >> VMCS link pointer in the current VMCS. If a shadow VMCS is made active by VM >> entry, it is necessary to execute VMCLEAR for that VMCS before allowing that >> VMCS to become active on another logical processor. >> " >> >> To me, this suggests that shadow VMCS may be cached, and software shouldn't >> assume the CPU won't cache it. But, I don't know if this is the reality or >> if the statement is merely for hardware implementation flexibility. >> >> > >> >On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If >> >> Actually this behavior is not publicly documented. > >Well shoot. That should probably be remedied. Even if the behavior is guaranteed >only on CPUs that support SEAM, _that_ detail should be documented. I'm not >holding my breath on Intel allowing third party code in SEAM, but the mode _is_ >documented in the SDM, and so IMO, the SDM should also document how things like >clearing the VMCS cache are supposed to work when there are VMCSes that "untrusted" >software may not be able to access. I'm also inquiring whether all VMCSs are flushed or just SEAM VMCSs, and whether this behavior can be made public. A related topic is why KVM is flushing VMCSs. I haven't found any explicit statement in the SDM indicating that the flush is necessary. SDM chapter 26.11 mentions: If a logical processor leaves VMX operation, any VMCSs active on that logical processor may be corrupted (see below). To prevent such corruption of a VMCS that may be used either after a return to VMX operation or on another logical processor, software should execute VMCLEAR for that VMCS before executing the VMXOFF instruction or removing power from the processor (e.g., as part of a transition to the S3 and S4 power states). To me, the issue appears to be VMCS corruption after leaving VMX operation and the flush is necessary only if you intend to use the VMCS after re-entering VMX operation. From previous KVM commits, I find two different reasons for flushing VMCSs: - Ensure VMCSs in vmcore aren't corrupted [1] - Prevent the writeback of VMCS memory on its eviction from overwriting random memory in the new kernel [2] The first reason makes sense and aligns with the SDM. However, the second lacks explicit support from the SDM, suggesting either a gap in the SDM or simply our misinterpretation. So, I will take this opportunity to seek clarification. [1]: https://lore.kernel.org/kvm/50C0BB90.1080804@gmail.com/ [2]: https://lore.kernel.org/kvm/20200321193751.24985-2-sean.j.christopherson@intel.com/ > >> >that's going to be the architectural behavior going forward, will that behavior >> >be enumerated to software? Regardless of whether there's software enumeration, >> >I would like to have the emergency disable path depend on that behavior. In part >> >to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of >> >this bug. >> >> I don't understand how we can gain confidence that SEAM VMCSes won't screw >> over kdump. > >If KVM relies on VMXOFF to purge the VMCS cache, then it gives a measure of >confidence that running TDX VMs won't leave behind SEAM VMCSes in the cache. KVM >can't easily clear SEAM VMCSs, but IIRC, the memory can be "forcefully" reclaimed >by paving over it with MOVDIR64B, at which point having VMCS cache entries for >the memory would be problematic. > >> If a VMM wants to leverage the VMXOFF behavior, software enumeration >> might be needed for nested virtualization. Using CPU F/M/S (SPR+) to >> enumerate a behavior could be problematic for virtualization. Right? > >Yeah, F/M/S is a bad idea. Architecturally, I think the behavior needs to be >tied to support for SEAM. Is there a safe-ish way to probe for SEAM support, >without having to glean it from MSR_IA32_MKTME_KEYID_PARTITIONING? > >> >If all past CPUs never cache shadow VMCS state, and all future CPUs flush the >> >caches on VMXOFF, then this is a glorified NOP, and thus probably shouldn't be >> >tagged for stable. >> >> Agreed. >> >> Sean, I am not clear whether you intend to fix this issue and, if so, how. >> Could you clarify? > >Oh, I definitely plan on taking this patch, I'm just undecided on whether or not >to tag it for stable@. Thanks.
On Mon, Apr 14, 2025, Chao Gao wrote: > A related topic is why KVM is flushing VMCSs. I haven't found any explicit > statement in the SDM indicating that the flush is necessary. > > SDM chapter 26.11 mentions: > > If a logical processor leaves VMX operation, any VMCSs active on that logical > processor may be corrupted (see below). To prevent such corruption of a VMCS > that may be used either after a return to VMX operation or on another logical > processor, software should execute VMCLEAR for that VMCS before executing the > VMXOFF instruction or removing power from the processor (e.g., as part of a > transition to the S3 and S4 power states). > > To me, the issue appears to be VMCS corruption after leaving VMX operation and > the flush is necessary only if you intend to use the VMCS after re-entering VMX > operation. The problem is that if the CPU flushes a VMCS from the cache at a later time, for any reason, then the CPU will write back data to main memory. The issue isn't reusing the VMCS, it's reusing the underlying memory.
On Mon, Apr 14, 2025 at 06:03:44PM -0700, Sean Christopherson wrote: >On Mon, Apr 14, 2025, Chao Gao wrote: >> A related topic is why KVM is flushing VMCSs. I haven't found any explicit >> statement in the SDM indicating that the flush is necessary. >> >> SDM chapter 26.11 mentions: >> >> If a logical processor leaves VMX operation, any VMCSs active on that logical >> processor may be corrupted (see below). To prevent such corruption of a VMCS >> that may be used either after a return to VMX operation or on another logical >> processor, software should execute VMCLEAR for that VMCS before executing the >> VMXOFF instruction or removing power from the processor (e.g., as part of a >> transition to the S3 and S4 power states). >> >> To me, the issue appears to be VMCS corruption after leaving VMX operation and >> the flush is necessary only if you intend to use the VMCS after re-entering VMX >> operation. > >The problem is that if the CPU flushes a VMCS from the cache at a later time, for >any reason, then the CPU will write back data to main memory. The issue isn't >reusing the VMCS, it's reusing the underlying memory. Yes, I understand the concern about reusing memory. I would like the SDM to explicitly state that the CPU may flush a VMCS from the cache at any time after leaving VMX operation, and that software should flush a VMCS before leaving VMX operation if its memory will be reused for other purposes.
On Fri, 2025-04-11 at 09:57 -0700, Sean Christopherson wrote: > > > > > > On a very related topic, doesn't SPR+ now flush the VMCS caches on VMXOFF? If > > > > Actually this behavior is not publicly documented. > > Well shoot. That should probably be remedied. Even if the behavior is guaranteed > only on CPUs that support SEAM, _that_ detail should be documented. I'm not > holding my breath on Intel allowing third party code in SEAM, but the mode _is_ > documented in the SDM, and so IMO, the SDM should also document how things like > clearing the VMCS cache are supposed to work when there are VMCSes that "untrusted" > software may not be able to access. > > > > that's going to be the architectural behavior going forward, will that behavior > > > be enumerated to software? Regardless of whether there's software enumeration, > > > I would like to have the emergency disable path depend on that behavior. In part > > > to gain confidence that SEAM VMCSes won't screw over kdump, but also in light of > > > this bug. > > > > I don't understand how we can gain confidence that SEAM VMCSes won't screw > > over kdump. > > If KVM relies on VMXOFF to purge the VMCS cache, then it gives a measure of > confidence that running TDX VMs won't leave behind SEAM VMCSes in the cache. KVM > can't easily clear SEAM VMCSs, but IIRC, the memory can be "forcefully" reclaimed > by paving over it with MOVDIR64B, at which point having VMCS cache entries for > the memory would be problematic. I am not sure why we need to use MOVDIR64B to clear SEAM VMCSes in kdump? Regardless of whether we do that or not, IIUC there is no harm even we still have SEAM VMCS cache entries in kdump: They are associated with TDX private KeyID(s). Sudden write back of them doesn't matter because when reading them from /proc/vmcore they are garbage anyway. And IIUC no machine check (due to TD bit mismatch) will happen either.
On 4/12/2025 12:57 AM, Sean Christopherson wrote: >> If a VMM wants to leverage the VMXOFF behavior, software enumeration >> might be needed for nested virtualization. Using CPU F/M/S (SPR+) to >> enumerate a behavior could be problematic for virtualization. Right? > Yeah, F/M/S is a bad idea. Architecturally, I think the behavior needs to be > tied to support for SEAM. Is there a safe-ish way to probe for SEAM support, > without having to glean it from MSR_IA32_MKTME_KEYID_PARTITIONING? yes, the bit 15 of MSR_IA32_MTRRCAP.
On Mon, 2025-03-24 at 22:08 +0800, Chao Gao wrote:
> Ensure the shadow VMCS cache is evicted during an emergency reboot to
> prevent potential memory corruption if the cache is evicted after reboot.
>
> This issue was identified through code inspection, as __loaded_vmcs_clear()
> flushes both the normal VMCS and the shadow VMCS.
>
> Avoid checking the "launched" state during an emergency reboot, unlike the
> behavior in __loaded_vmcs_clear(). This is important because reboot NMIs
> can interfere with operations like copy_shadow_to_vmcs12(), where shadow
> VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur
> right after the VMCS load, the shadow VMCSes will be active but the
> "launched" state may not be set.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
> ---
> arch/x86/kvm/vmx/vmx.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b70ed72c1783..dccd1c9939b8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -769,8 +769,11 @@ void vmx_emergency_disable_virtualization_cpu(void)
> return;
>
> list_for_each_entry(v, &per_cpu(loaded_vmcss_on_cpu, cpu),
> - loaded_vmcss_on_cpu_link)
> + loaded_vmcss_on_cpu_link) {
> vmcs_clear(v->vmcs);
> + if (v->shadow_vmcs)
> + vmcs_clear(v->shadow_vmcs);
> + }
>
> kvm_cpu_vmxoff();
> }
© 2016 - 2025 Red Hat, Inc.