[v2] KVM: x86: never write to memory from kvm_vcpu_check_block

[PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Paolo Bonzini 3 years, 8 months ago

The return value of kvm_vcpu_block will be repurposed soon to return
the state of KVM_REQ_UNHALT.  In preparation for that, get rid of the
current return value.  It is only used by kvm_vcpu_halt to decide whether
the call resulted in a wait, but the same effect can be obtained with
a single round of polling.

No functional change intended, apart from practically indistinguishable
changes to the polling behavior.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      | 45 +++++++++++++++++-----------------------
 2 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c480b1821e1..e7bd48d15db8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1339,7 +1339,7 @@ void kvm_sigset_activate(struct kvm_vcpu *vcpu);
 void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
 
 void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
-bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
+void kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
 bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 515dfe9d3bcf..1f049c1d01b4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3429,10 +3429,9 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
  * pending.  This is mostly used when halting a vCPU, but may also be used
  * directly for other vCPU non-runnable states, e.g. x86's Wait-For-SIPI.
  */
-bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
+void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 {
 	struct rcuwait *wait = kvm_arch_vcpu_get_wait(vcpu);
-	bool waited = false;
 
 	vcpu->stat.generic.blocking = 1;
 
@@ -3447,7 +3446,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
 		if (kvm_vcpu_check_block(vcpu) < 0)
 			break;
 
-		waited = true;
 		schedule();
 	}
 
@@ -3457,8 +3455,6 @@ bool kvm_vcpu_block(struct kvm_vcpu *vcpu)
 	preempt_enable();
 
 	vcpu->stat.generic.blocking = 0;
-
-	return waited;
 }
 
 static inline void update_halt_poll_stats(struct kvm_vcpu *vcpu, ktime_t start,
@@ -3493,35 +3489,32 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
 {
 	bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
 	bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
-	ktime_t start, cur, poll_end;
+	ktime_t start, cur, poll_end, stop;
 	bool waited = false;
 	u64 halt_ns;
 
 	start = cur = poll_end = ktime_get();
-	if (do_halt_poll) {
-		ktime_t stop = ktime_add_ns(start, vcpu->halt_poll_ns);
+	stop = do_halt_poll ? start : ktime_add_ns(start, vcpu->halt_poll_ns);
 
-		do {
-			/*
-			 * This sets KVM_REQ_UNHALT if an interrupt
-			 * arrives.
-			 */
-			if (kvm_vcpu_check_block(vcpu) < 0)
-				goto out;
-			cpu_relax();
-			poll_end = cur = ktime_get();
-		} while (kvm_vcpu_can_poll(cur, stop));
-	}
+	do {
+		/*
+		 * This sets KVM_REQ_UNHALT if an interrupt
+		 * arrives.
+		 */
+		if (kvm_vcpu_check_block(vcpu) < 0)
+			goto out;
+		cpu_relax();
+		poll_end = cur = ktime_get();
+	} while (kvm_vcpu_can_poll(cur, stop));
 
-	waited = kvm_vcpu_block(vcpu);
+	waited = true;
+	kvm_vcpu_block(vcpu);
 
 	cur = ktime_get();
-	if (waited) {
-		vcpu->stat.generic.halt_wait_ns +=
-			ktime_to_ns(cur) - ktime_to_ns(poll_end);
-		KVM_STATS_LOG_HIST_UPDATE(vcpu->stat.generic.halt_wait_hist,
-				ktime_to_ns(cur) - ktime_to_ns(poll_end));
-	}
+	vcpu->stat.generic.halt_wait_ns +=
+		ktime_to_ns(cur) - ktime_to_ns(poll_end);
+	KVM_STATS_LOG_HIST_UPDATE(vcpu->stat.generic.halt_wait_hist,
+				  ktime_to_ns(cur) - ktime_to_ns(poll_end));
 out:
 	/* The total time the vCPU was "halted", including polling time. */
 	halt_ns = ktime_to_ns(cur) - ktime_to_ns(start);
-- 
2.31.1

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Sean Christopherson 3 years, 7 months ago

On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> The return value of kvm_vcpu_block will be repurposed soon to return

kvm_vcpu_block()

> the state of KVM_REQ_UNHALT.  In preparation for that, get rid of the
> current return value.  It is only used by kvm_vcpu_halt to decide whether

kvm_vcpu_halt()

> the call resulted in a wait, but the same effect can be obtained with
> a single round of polling.
> 
> No functional change intended, apart from practically indistinguishable
> changes to the polling behavior.

This "breaks" update_halt_poll_stats().  At the very least, it breaks the comment
that effectively says "waited" is set if and only if schedule() is called.

	/*
	 * Note, halt-polling is considered successful so long as the vCPU was
	 * never actually scheduled out, i.e. even if the wake event arrived
	 * after of the halt-polling loop itself, but before the full wait.
	 */
	if (do_halt_poll)
		update_halt_poll_stats(vcpu, start, poll_end, !waited);

> @@ -3493,35 +3489,32 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
>  {
>  	bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
>  	bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
> -	ktime_t start, cur, poll_end;
> +	ktime_t start, cur, poll_end, stop;
>  	bool waited = false;
>  	u64 halt_ns;
>  
>  	start = cur = poll_end = ktime_get();
> -	if (do_halt_poll) {
> -		ktime_t stop = ktime_add_ns(start, vcpu->halt_poll_ns);
> +	stop = do_halt_poll ? start : ktime_add_ns(start, vcpu->halt_poll_ns);

This is inverted, the loop should terminate after the first iteration (stop==start)
if halt-polling is _not_ allowed, i.e. do_halt_poll is false.

Overall, I don't like this patch.  I don't necessarily hate it, but I definitely
don't like it.

Isn't freeing up the return from kvm_vcpu_check_block() unnecessary?  Can't we
just do:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f11b505cbee..ccb9f8bdeb18 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10633,7 +10633,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
                if (hv_timer)
                        kvm_lapic_switch_to_hv_timer(vcpu);

-               if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
+               if (!kvm_arch_vcpu_runnable(vcpu))
                        return 1;
        }

which IMO is more intuitive and doesn't require reworking halt-polling (again).

I don't see why KVM cares if a vCPU becomes runnable after detecting that something
else caused kvm_vcpu_check_block() to bail.  E.g. a pending signal doesn't invalidate
a pending vCPU event, and either way KVM is bailing from the run loop.

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Paolo Bonzini 3 years, 7 months ago

On 8/17/22 01:34, Sean Christopherson wrote:
> Isn't freeing up the return from kvm_vcpu_check_block() unnecessary?  Can't we
> just do:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9f11b505cbee..ccb9f8bdeb18 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10633,7 +10633,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>                  if (hv_timer)
>                          kvm_lapic_switch_to_hv_timer(vcpu);
> 
> -               if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
> +               if (!kvm_arch_vcpu_runnable(vcpu))
>                          return 1;
>          }
> 
> 
> which IMO is more intuitive and doesn't require reworking halt-polling (again).

This was my first idea indeed.  However I didn't like calling 
kvm_arch_vcpu_runnable() again and "did it schedule()" seemed to be a 
less interesting result from kvm_vcpu_block() (and in fact 
kvm_vcpu_halt() does not bother passing it up the return chain).

Paolo

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Sean Christopherson 3 years, 7 months ago

On Wed, Aug 17, 2022, Paolo Bonzini wrote:
> On 8/17/22 01:34, Sean Christopherson wrote:
> > Isn't freeing up the return from kvm_vcpu_check_block() unnecessary?  Can't we
> > just do:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9f11b505cbee..ccb9f8bdeb18 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10633,7 +10633,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> >                  if (hv_timer)
> >                          kvm_lapic_switch_to_hv_timer(vcpu);
> > 
> > -               if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
> > +               if (!kvm_arch_vcpu_runnable(vcpu))
> >                          return 1;
> >          }
> > 
> > 
> > which IMO is more intuitive and doesn't require reworking halt-polling (again).
> 
> This was my first idea indeed.  However I didn't like calling
> kvm_arch_vcpu_runnable() again and "did it schedule()" seemed to be a less
> interesting result from kvm_vcpu_block() (and in fact kvm_vcpu_halt() does
> not bother passing it up the return chain).

The flip side of calling kvm_arch_vcpu_runnable() again is that KVM will immediately
wake the vCPU if it becomes runnable after kvm_vcpu_check_block().  The edge cases
where the vCPU becomes runnable late are unlikely to truly matter in practice, but
on the other hand manually re-checking kvm_arch_vcpu_runnable() means KVM gets both
cases "right" (waited=true iff vCPU actually waited, vCPU awakened ASAP), whereas
squishing the information into the return of kvm_vcpu_check_block() means KVM gets
both cases "wrong" (waited=true even if schedule() was never called, vCPU left in
a non-running state even though it's runnable).

My only hesitation with calling kvm_arch_vcpu_runnable() again is that it could be
problematic if KVM somehow managed to consume the event that caused kvm_vcpu_has_events()
to return true, but I don't see how that could happen without it being a KVM bug.

Ah, and if we do go with the explicit check, it should come with a comment to call
out that KVM needs to return up the stack, e.g.

		/*
		 * If KVM stopped blocking the vCPU but the vCPU is still not
		 * runnable, then there is a pending host event of some kind,
		 * e.g. a pending signal.  Return up the stack to service the
		 * pending event without changing the vCPU's activity state.
		 */
		if (!kvm_arch_vcpu_runnable(vcpu))
			return 1;

so that we don't end combining the checks into:

	while (!kvm_arch_vcpu_runnable(vcpu)) {
		/*
		 * Switch to the software timer before halt-polling/blocking as
		 * the guest's timer may be a break event for the vCPU, and the
		 * hypervisor timer runs only when the CPU is in guest mode.
		 * Switch before halt-polling so that KVM recognizes an expired
		 * timer before blocking.
		 */
		hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
		if (hv_timer)
			kvm_lapic_switch_to_sw_timer(vcpu);

		kvm_vcpu_srcu_read_unlock(vcpu);
		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
			kvm_vcpu_halt(vcpu);
		else
			kvm_vcpu_block(vcpu);
		kvm_vcpu_srcu_read_lock(vcpu);

		if (hv_timer)
			kvm_lapic_switch_to_hv_timer(vcpu);
	}

which looks sane, but would mean KVM never bounces out to handle whatever event
needs handling.

Side topic, usage of kvm_apic_accept_events() appears to be broken (though nothing
can trigger the bug).  If kvm_apic_accept_events() were to return an -errno, then
kvm_arch_vcpu_ioctl_run() would return '0' to userspace without updating
vcpu->run->exit_reason.  I think an easy fix is to drop the return value entirely
and then WARN if kvm_check_nested_events() returns something other than -EBUSY.

	if (is_guest_mode(vcpu)) {
		r = kvm_check_nested_events(vcpu);
		if (r < 0) {
			WARN_ON_ONCE(r != -EBUSY);
			return;
		}

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Sean Christopherson 3 years, 6 months ago

On Wed, Aug 17, 2022, Sean Christopherson wrote:
> Side topic, usage of kvm_apic_accept_events() appears to be broken (though nothing
> can trigger the bug).  If kvm_apic_accept_events() were to return an -errno, then
> kvm_arch_vcpu_ioctl_run() would return '0' to userspace without updating
> vcpu->run->exit_reason.  I think an easy fix is to drop the return value entirely
> and then WARN if kvm_check_nested_events() returns something other than -EBUSY.
> 
> 	if (is_guest_mode(vcpu)) {
> 		r = kvm_check_nested_events(vcpu);
> 		if (r < 0) {
> 			WARN_ON_ONCE(r != -EBUSY);
> 			return;
> 		}

For posterity, I was wrong.  Way down the stack, vmx_complete_nested_posted_interrupt()
can return -ENXIO after filling vcpu->run->exit_reason via kvm_handle_memory_failure().
That's the entire reason why negative values from kvm_check_nested_events() and
kvm_apic_accept_events() are morphed to '0', i.e. to "exit to userspace".

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Paolo Bonzini 3 years, 7 months ago

On 8/17/22 18:41, Sean Christopherson wrote:
> On Wed, Aug 17, 2022, Paolo Bonzini wrote:
>> On 8/17/22 01:34, Sean Christopherson wrote:
>>> Isn't freeing up the return from kvm_vcpu_check_block() unnecessary?  Can't we
>>> just do:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 9f11b505cbee..ccb9f8bdeb18 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -10633,7 +10633,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>>>                   if (hv_timer)
>>>                           kvm_lapic_switch_to_hv_timer(vcpu);
>>>
>>> -               if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
>>> +               if (!kvm_arch_vcpu_runnable(vcpu))
>>>                           return 1;
>>>           }
>>>
>>>
>>> which IMO is more intuitive and doesn't require reworking halt-polling (again).
>>
>> This was my first idea indeed.  However I didn't like calling
>> kvm_arch_vcpu_runnable() again and "did it schedule()" seemed to be a less
>> interesting result from kvm_vcpu_block() (and in fact kvm_vcpu_halt() does
>> not bother passing it up the return chain).
> 
> The flip side of calling kvm_arch_vcpu_runnable() again is that KVM will immediately
> wake the vCPU if it becomes runnable after kvm_vcpu_check_block().  The edge cases
> where the vCPU becomes runnable late are unlikely to truly matter in practice, but
> on the other hand manually re-checking kvm_arch_vcpu_runnable() means KVM gets both
> cases "right" (waited=true iff vCPU actually waited, vCPU awakened ASAP), whereas
> squishing the information into the return of kvm_vcpu_check_block() means KVM gets
> both cases "wrong" (waited=true even if schedule() was never called, vCPU left in
> a non-running state even though it's runnable).
> 
> My only hesitation with calling kvm_arch_vcpu_runnable() again is that it could be
> problematic if KVM somehow managed to consume the event that caused kvm_vcpu_has_events()
> to return true, but I don't see how that could happen without it being a KVM bug.

No, I agree that it cannot happen, and especially so after getting rid 
of the kvm_check_nested_events() call in kvm_arch_vcpu_runnable().

I'll reorder the patches and apply your suggestion.

Paolo

Re: [PATCH v2 2/9] KVM: x86: remove return value of kvm_vcpu_block

Posted by Maxim Levitsky 3 years, 7 months ago

On Tue, 2022-08-16 at 23:34 +0000, Sean Christopherson wrote:
> On Thu, Aug 11, 2022, Paolo Bonzini wrote:
> > The return value of kvm_vcpu_block will be repurposed soon to return
> 
> kvm_vcpu_block()
> 
> > the state of KVM_REQ_UNHALT.  In preparation for that, get rid of the
> > current return value.  It is only used by kvm_vcpu_halt to decide whether
> 
> kvm_vcpu_halt()
> 
> > the call resulted in a wait, but the same effect can be obtained with
> > a single round of polling.
> > 
> > No functional change intended, apart from practically indistinguishable
> > changes to the polling behavior.
> 
> This "breaks" update_halt_poll_stats().  At the very least, it breaks the comment
> that effectively says "waited" is set if and only if schedule() is called.
> 
>         /*
>          * Note, halt-polling is considered successful so long as the vCPU was
>          * never actually scheduled out, i.e. even if the wake event arrived
>          * after of the halt-polling loop itself, but before the full wait.
>          */
>         if (do_halt_poll)
>                 update_halt_poll_stats(vcpu, start, poll_end, !waited);
> 
> > @@ -3493,35 +3489,32 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu)
> >  {
> >         bool halt_poll_allowed = !kvm_arch_no_poll(vcpu);
> >         bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns;
> > -       ktime_t start, cur, poll_end;
> > +       ktime_t start, cur, poll_end, stop;
> >         bool waited = false;
> >         u64 halt_ns;
> >  
> >         start = cur = poll_end = ktime_get();
> > -       if (do_halt_poll) {
> > -               ktime_t stop = ktime_add_ns(start, vcpu->halt_poll_ns);
> > +       stop = do_halt_poll ? start : ktime_add_ns(start, vcpu->halt_poll_ns);
> 
> This is inverted, the loop should terminate after the first iteration (stop==start)
> if halt-polling is _not_ allowed, i.e. do_halt_poll is false.
> 
> Overall, I don't like this patch.  I don't necessarily hate it, but I definitely
> don't like it.

Roughtly same opinion here.

> 
> Isn't freeing up the return from kvm_vcpu_check_block() unnecessary?  Can't we
> just do:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9f11b505cbee..ccb9f8bdeb18 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10633,7 +10633,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>                 if (hv_timer)
>                         kvm_lapic_switch_to_hv_timer(vcpu);
> 
> -               if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
> +               if (!kvm_arch_vcpu_runnable(vcpu))


The 'kvm_vcpu_block()' returns when 'kvm_vcpu_check_block()' returns a negative number
which can happen also due to pending apic timer / signal, however it only sets the
'KVM_REQ_UNHALT' only when 'kvm_arch_vcpu_runnable()==true' so the above does make
sense.


Best regards,
 Maxim Levitsky


>                         return 1;
>         }
> 
> 
> which IMO is more intuitive and doesn't require reworking halt-polling (again).
> 
> I don't see why KVM cares if a vCPU becomes runnable after detecting that something
> else caused kvm_vcpu_check_block() to bail.  E.g. a pending signal doesn't invalidate
> a pending vCPU event, and either way KVM is bailing from the run loop.
>