[PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block

Paolo Bonzini posted 9 patches 3 years, 8 months ago
There is a newer version of this series
[PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block
Posted by Paolo Bonzini 3 years, 8 months ago
kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
it cannot sleep.  Writing to guest memory is therefore forbidden, but it
can happen on AMD processors if kvm_check_nested_events() causes a vmexit.

Fortunately, all events that are caught by kvm_check_nested_events() are
also recognized by kvm_vcpu_has_events() through vendor callbacks such as
kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
remove the call and postpone the actual processing to vcpu_block().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e9358ea112b..9226fd536783 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 			return 1;
 	}
 
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * Evaluate nested events before exiting the halted state.
+		 * This allows the halt state to be recorded properly in
+		 * the VMCS12's activity state field (AMD does not have
+		 * a similar field and a vmexit always causes a spurious
+		 * wakeup from HLT).
+		 */
+		kvm_check_nested_events(vcpu);
+	}
+
 	if (kvm_apic_accept_events(vcpu) < 0)
 		return 0;
 	switch(vcpu->arch.mp_state) {
@@ -10662,9 +10673,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 
 static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
-	if (is_guest_mode(vcpu))
-		kvm_check_nested_events(vcpu);
-
 	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
 		!vcpu->arch.apf.halted);
 }
-- 
2.31.1
Re: [PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block
Posted by Sean Christopherson 3 years, 6 months ago
On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
> it cannot sleep.  Writing to guest memory is therefore forbidden, but it
> can happen on AMD processors if kvm_check_nested_events() causes a vmexit.
> 
> Fortunately, all events that are caught by kvm_check_nested_events() are
> also recognized by kvm_vcpu_has_events() through vendor callbacks such as
> kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
> remove the call and postpone the actual processing to vcpu_block().
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5e9358ea112b..9226fd536783 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  			return 1;
>  	}
>  
> +	if (is_guest_mode(vcpu)) {
> +		/*
> +		 * Evaluate nested events before exiting the halted state.
> +		 * This allows the halt state to be recorded properly in
> +		 * the VMCS12's activity state field (AMD does not have
> +		 * a similar field and a vmexit always causes a spurious
> +		 * wakeup from HLT).
> +		 */
> +		kvm_check_nested_events(vcpu);

Continuing the conversation with myself, this "needs" to check the return of
kvm_check_nested_events().  In quotes because KVM just signals "internal error"
if vmx_complete_nested_posted_interrupt() fails, i.e. the VM is likely dead anyways,
but again it's odd that the return of kvm_apic_accept_events() is checked but the
direct call to kvm_check_nested_events() is not.

> +	}
> +
>  	if (kvm_apic_accept_events(vcpu) < 0)
>  		return 0;
>  	switch(vcpu->arch.mp_state) {
> @@ -10662,9 +10673,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  
>  static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
>  {
> -	if (is_guest_mode(vcpu))
> -		kvm_check_nested_events(vcpu);
> -
>  	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
>  		!vcpu->arch.apf.halted);
>  }
> -- 
> 2.31.1
>
Re: [PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block
Posted by Sean Christopherson 3 years, 6 months ago
On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
> it cannot sleep.  Writing to guest memory is therefore forbidden, but it
> can happen on AMD processors if kvm_check_nested_events() causes a vmexit.
> 
> Fortunately, all events that are caught by kvm_check_nested_events() are
> also recognized by kvm_vcpu_has_events() through vendor callbacks such as
> kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
> remove the call and postpone the actual processing to vcpu_block().
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5e9358ea112b..9226fd536783 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  			return 1;
>  	}
>  
> +	if (is_guest_mode(vcpu)) {
> +		/*
> +		 * Evaluate nested events before exiting the halted state.
> +		 * This allows the halt state to be recorded properly in
> +		 * the VMCS12's activity state field (AMD does not have
> +		 * a similar field and a vmexit always causes a spurious
> +		 * wakeup from HLT).
> +		 */
> +		kvm_check_nested_events(vcpu);
> +	}
> +
>  	if (kvm_apic_accept_events(vcpu) < 0)
>  		return 0;

Oof, this ends up yielding a really confusing code sequence.  kvm_apic_accept_events()
has its own kvm_check_nested_events(), but has code to snapshot pending INITs/SIPIs
_before_ the call.  Unpacked, KVM ends up with:

	if (is_guest_mode(vcpu))
		kvm_check_nested_events(vcpu);

	/*
	 * Read pending events before calling the check_events
	 * callback.
	 */
	pe = smp_load_acquire(&apic->pending_events);
	if (!pe)
		return 0;

	if (is_guest_mode(vcpu)) {
		r = kvm_check_nested_events(vcpu);
		if (r < 0)
			return r == -EBUSY ? 0 : r;
	}

	if (kvm_vcpu_latch_init(vcpu)) {
		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
		if (test_bit(KVM_APIC_SIPI, &pe))
			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
		return 0;
	}

	if (test_bit(KVM_APIC_INIT, &pe)) {
		clear_bit(KVM_APIC_INIT, &apic->pending_events);
		kvm_vcpu_reset(vcpu, true);
		if (kvm_vcpu_is_bsp(apic->vcpu))
			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
		else
			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
	}
	if (test_bit(KVM_APIC_SIPI, &pe)) {
		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
			/* evaluate pending_events before reading the vector */
			smp_rmb();
			sipi_vector = apic->sipi_vector;
			static_call(kvm_x86_vcpu_deliver_sipi_vector)(vcpu, sipi_vector);
			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
		}
	}

which on the surface makes this code look broken, e.g. if kvm_check_nested_events()
_needs_ to be after the pending_events snapshot is taken, why is it safe to add a
kvm_check_nested_events() call immediately before the snapshot?

In reality, it's just a bunch of noise because the pending events snapshot is
completely unnecessary and subtly relies on INIT+SIPI being blocked after VM-Exit
on VMX (and SVM, but it's more important for VMX).

In particular, testing "pe" after VM-Exit is nonsensical.  On VMX, events are consumed
if they trigger VM-Exit, i.e. processing INIT/SIPI is flat out wrong if the INIT/SIPI
was the direct cause of VM-Exit.  On SVM, events are left pending, so if any pending
INIT/SIPI will still be there.

The VMX code works because kvm_vcpu_latch_init(), a.k.a. "is INIT blocked", is
always true after VM-Exit since INIT is always blocked in VMX root mode.  Ditto for
the conditional clearing of SIPI; the CPU can't be in wait-for-SIPI immediately
after VM-Exit and so dropping SIPIs ends up being architecturally ok.

I'll add a patch to drop the snapshot code, assuming I'm not missing something even
more subtle...
Re: [PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block
Posted by Sean Christopherson 3 years, 7 months ago
On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
> it cannot sleep.  Writing to guest memory is therefore forbidden, but it
> can happen on AMD processors if kvm_check_nested_events() causes a vmexit.
> 
> Fortunately, all events that are caught by kvm_check_nested_events() are
> also recognized by kvm_vcpu_has_events() through vendor callbacks such as
> kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
> remove the call and postpone the actual processing to vcpu_block().
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5e9358ea112b..9226fd536783 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  			return 1;
>  	}
>  
> +	if (is_guest_mode(vcpu)) {
> +		/*
> +		 * Evaluate nested events before exiting the halted state.
> +		 * This allows the halt state to be recorded properly in
> +		 * the VMCS12's activity state field (AMD does not have
> +		 * a similar field and a vmexit always causes a spurious
> +		 * wakeup from HLT).
> +		 */
> +		kvm_check_nested_events(vcpu);

Formatting nit, I'd prefer the block comment go above the if-statement, that way
we avoiding debating whether or not the technically-unnecessary braces align with
kernel/KVM style, and it doesn't have to wrap as aggressively.

And s/vmexit/VM-Exit while I'm nitpicking.

	/*
	 * Evaluate nested events before exiting the halted state.  This allows
	 * the halt state to be recorded properly in the VMCS12's activity
	 * state field (AMD does not have a similar field and a VM-Exit always
	 * causes a spurious wakeup from HLT).
	 */
	if (is_guest_mode(vcpu))
		kvm_check_nested_events(vcpu);

Side topic, the AMD behavior is a bug report waiting to happen.  I know of at least
one customer failure that was root caused to a KVM bug where KVM caused a spurious
wakeup.  To be fair, the guest workload was being stupid (execute HLT on vCPU and
then effectively unmap its code by doing kexec), but it's still an unpleasant gap :-(
Re: [PATCH v2 9/9] KVM: x86: never write to memory from kvm_vcpu_check_block
Posted by Maxim Levitsky 3 years, 7 months ago
On Tue, 2022-08-16 at 23:45 +0000, Sean Christopherson wrote:
> On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> > kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
> > it cannot sleep.  Writing to guest memory is therefore forbidden, but it
> > can happen on AMD processors if kvm_check_nested_events() causes a vmexit.
> > 
> > Fortunately, all events that are caught by kvm_check_nested_events() are
> > also recognized by kvm_vcpu_has_events() through vendor callbacks such as
> > kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
> > remove the call and postpone the actual processing to vcpu_block().
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/kvm/x86.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5e9358ea112b..9226fd536783 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10639,6 +10639,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> >                         return 1;
> >         }
> >  
> > +       if (is_guest_mode(vcpu)) {
> > +               /*
> > +                * Evaluate nested events before exiting the halted state.
> > +                * This allows the halt state to be recorded properly in
> > +                * the VMCS12's activity state field (AMD does not have
> > +                * a similar field and a vmexit always causes a spurious
> > +                * wakeup from HLT).
> > +                */

I assume that the comment refers to the fact that nested_vmx_vmexit due to event
on the HLT instruction, will trigger update of the 'vmcs12->guest_activity_state'
so it should be done before we update the 'vcpu->arch.mp_state'


> > +               kvm_check_nested_events(vcpu);
> 
> Formatting nit, I'd prefer the block comment go above the if-statement, that way
> we avoiding debating whether or not the technically-unnecessary braces align with
> kernel/KVM style, and it doesn't have to wrap as aggressively.
> 
> And s/vmexit/VM-Exit while I'm nitpicking.
> 
>         /*
>          * Evaluate nested events before exiting the halted state.  This allows
>          * the halt state to be recorded properly in the VMCS12's activity
>          * state field (AMD does not have a similar field and a VM-Exit always
>          * causes a spurious wakeup from HLT).
>          */
>         if (is_guest_mode(vcpu))
>                 kvm_check_nested_events(vcpu);
> 
> Side topic, the AMD behavior is a bug report waiting to happen.  I know of at least
> one customer failure that was root caused to a KVM bug where KVM caused a spurious
> wakeup.  To be fair, the guest workload was being stupid (execute HLT on vCPU and
> then effectively unmap its code by doing kexec), but it's still an unpleasant gap :-(

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

>