[PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset

Maxim Levitsky posted 2 patches 1 month, 3 weeks ago
[PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset
Posted by Maxim Levitsky 1 month, 3 weeks ago
reset the segment cache after segment initialization in vmx_vcpu_reset
to avoid stale uninitialized data being cached in the segment cache.

In particular the following scenario is possible when full preemption
is enabled:

- vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES
is written in vmx_vcpu_reset.

- During preemption, the kvm_arch_vcpu_in_kernel is called which
reads SS's segment AR byte to determine if the CPU was in the kernel.

That caches 0 value of SS.AR_BYTES, then eventually the vCPU thread will be
preempted back, then set the correct SS.AR_BYTES value in the vmcs
and the cached value will remain stale, and could be read e.g via
KVM_GET_SREGS.

Usually this is not a problem because VMX segment cache is reset on each
vCPU run, but if the userspace (e.g KVM selftests do) reads the segment
registers just after the vCPU was created, and modifies some of them
but passes through other registers and in this case SS.AR_BYTES,
the stale value of it will make it into the vmcs,
and later lead to a VM entry failure due to incorrect SS segment type.

Fix this by moving the vmx_segment_cache_clear() call to be after the
segments are initialized.

Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
getting stale data during the segment setup, and that issue will
be addressed later.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa9f307d9b18..d43bb755e15c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx->hv_deadline_tsc = -1;
 	kvm_set_cr8(vcpu, 0);
 
-	vmx_segment_cache_clear(vmx);
-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
-
 	seg_setup(VCPU_SREG_CS);
 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
@@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmcs_writel(GUEST_IDTR_BASE, 0);
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
 
+	vmx_segment_cache_clear(vmx);
+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
+
 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
-- 
2.26.3
Re: [PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset
Posted by Chao Gao 1 month, 1 week ago
On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:

KVM: VMX:

>reset the segment cache after segment initialization in vmx_vcpu_reset
>to avoid stale uninitialized data being cached in the segment cache.
>
>In particular the following scenario is possible when full preemption
>is enabled:
>
>- vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES
>is written in vmx_vcpu_reset.
>
>- During preemption, the kvm_arch_vcpu_in_kernel is called which
>reads SS's segment AR byte to determine if the CPU was in the kernel.
>
>That caches 0 value of SS.AR_BYTES, then eventually the vCPU thread will be
>preempted back, then set the correct SS.AR_BYTES value in the vmcs
>and the cached value will remain stale, and could be read e.g via
>KVM_GET_SREGS.
>
>Usually this is not a problem because VMX segment cache is reset on each
>vCPU run, but if the userspace (e.g KVM selftests do) reads the segment
>registers just after the vCPU was created, and modifies some of them
>but passes through other registers and in this case SS.AR_BYTES,
>the stale value of it will make it into the vmcs,
>and later lead to a VM entry failure due to incorrect SS segment type.

I looked into the same issue last week, which was reported by someone
internally.

>
>Fix this by moving the vmx_segment_cache_clear() call to be after the
>segments are initialized.
>
>Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
>getting stale data during the segment setup, and that issue will
>be addressed later.
>
>Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

Do you need a Fixes tag and/or Cc: stable?

>---
> arch/x86/kvm/vmx/vmx.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index fa9f307d9b18..d43bb755e15c 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmx->hv_deadline_tsc = -1;
> 	kvm_set_cr8(vcpu, 0);
> 
>-	vmx_segment_cache_clear(vmx);
>-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>-
> 	seg_setup(VCPU_SREG_CS);
> 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
>@@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmcs_writel(GUEST_IDTR_BASE, 0);
> 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> 
>+	vmx_segment_cache_clear(vmx);
>+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);

vmx_segment_cache_clear() is called in a few other sites. I think at least the
call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
after a write to it. if the write was preempted after the cache was cleared but
before the new value being written into VMCS, QEMU would find that SS.AR held a
stale value.

>+
> 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
> 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
> 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
>-- 
>2.26.3
>
>
Re: [PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset
Posted by Sean Christopherson 1 month, 1 week ago
On Tue, Aug 06, 2024, Chao Gao wrote:
> On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> >Fix this by moving the vmx_segment_cache_clear() call to be after the
> >segments are initialized.
> >
> >Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> >getting stale data during the segment setup, and that issue will
> >be addressed later.
> >
> >Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Do you need a Fixes tag and/or Cc: stable?

Heh, it's an old one

  Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")

> 
> >---
> > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index fa9f307d9b18..d43bb755e15c 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > 	vmx->hv_deadline_tsc = -1;
> > 	kvm_set_cr8(vcpu, 0);
> > 
> >-	vmx_segment_cache_clear(vmx);
> >-	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> >-
> > 	seg_setup(VCPU_SREG_CS);
> > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> >@@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > 
> >+	vmx_segment_cache_clear(vmx);
> >+	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> 
> vmx_segment_cache_clear() is called in a few other sites. I think at least the
> call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> after a write to it. if the write was preempted after the cache was cleared but
> before the new value being written into VMCS, QEMU would find that SS.AR held a
> stale value.

Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
wrong, but it's obviously incomplete, and will be unnecessary if the preemption
issue is resolved.

[*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
Re: [PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset
Posted by Maxim Levitsky 1 week ago
On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> On Tue, Aug 06, 2024, Chao Gao wrote:
> > On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> > > Fix this by moving the vmx_segment_cache_clear() call to be after the
> > > segments are initialized.
> > > 
> > > Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> > > getting stale data during the segment setup, and that issue will
> > > be addressed later.
> > > 
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Do you need a Fixes tag and/or Cc: stable?
> 
> Heh, it's an old one
> 
>   Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
> 
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index fa9f307d9b18..d43bb755e15c 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > 	vmx->hv_deadline_tsc = -1;
> > > 	kvm_set_cr8(vcpu, 0);
> > > 
> > > -	vmx_segment_cache_clear(vmx);
> > > -	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > -
> > > 	seg_setup(VCPU_SREG_CS);
> > > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > 
> > > +	vmx_segment_cache_clear(vmx);
> > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > 
> > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > after a write to it. if the write was preempted after the cache was cleared but
> > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > stale value.
> 
> Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> issue is resolved.

Hi,

I was thinking to keep it simple, since the issue is mostly theoretical after this fix,
but I'll give this another try.

Best regards,
	Maxim Levitsky

> 
> [*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
>
Re: [PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset
Posted by Maxim Levitsky 6 days, 18 hours ago
On Mon, 2024-09-09 at 15:11 -0400, Maxim Levitsky wrote:
> On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> > On Tue, Aug 06, 2024, Chao Gao wrote:
> > > On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> > > > Fix this by moving the vmx_segment_cache_clear() call to be after the
> > > > segments are initialized.
> > > > 
> > > > Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> > > > getting stale data during the segment setup, and that issue will
> > > > be addressed later.
> > > > 
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > Do you need a Fixes tag and/or Cc: stable?
> > 
> > Heh, it's an old one
> > 
> >   Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
> > 
> > > > ---
> > > > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index fa9f307d9b18..d43bb755e15c 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > 	vmx->hv_deadline_tsc = -1;
> > > > 	kvm_set_cr8(vcpu, 0);
> > > > 
> > > > -	vmx_segment_cache_clear(vmx);
> > > > -	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > > -
> > > > 	seg_setup(VCPU_SREG_CS);
> > > > 	vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > > > 	vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> > > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > 	vmcs_writel(GUEST_IDTR_BASE, 0);
> > > > 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > > 
> > > > +	vmx_segment_cache_clear(vmx);
> > > > +	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > 
> > > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > > after a write to it. if the write was preempted after the cache was cleared but
> > > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > > stale value.
> > 
> > Ya, I thought the plan was to go for a more complete fix[*]?  This change isn't
> > wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> > issue is resolved.
> 
> Hi,
> 
> I was thinking to keep it simple, since the issue is mostly theoretical after this fix,
> but I'll give this another try.
> 
> Best regards,
> 	Maxim Levitsky
> 
> > [*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@redhat.com
> > 

Hi,

This is what I am thinking, after going over this issue again:

Pre-populating the cache and/or adding 'exited_in_kernel' will waste vmreads on *each* vmexit, 
I worry that this is just not worth the mostly theoretical issue that we have.


Since the segment and the register cache only optimize the case of reading a same field twice or more,
I suspect that reading these fields always is worse performance wise than removing the segment cache
altogether and reading these fields again and again.


Finally all 3 places that read the segment cache, only access one piece of data (SS.AR or RIP), 
thus it doesn't really matter if they see an old or a new value. 

I mean in theory if userspace changes the SS's AR bytes out of the
blue, and then we get a preemption event, in theory as you say the old value is correct but it really
doesn't matter.

So IMHO, just ensuring that we invalidate the segment cache right after we do any changes is the simplest
solution.

I can in addition to that add a warning to kvm_register_is_available and vmx_segment_cache_test_set, that
will test that only SS.AR and RIP are read from the interrupt context, so that if in the future someone
attempts to read more fields, this issue can be re-evaluated.

Best regards,
	Maxim Levitsky