[PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled

Sean Christopherson posted 1 patch 3 weeks, 1 day ago
arch/x86/kvm/lapic.c | 9 +++++++++
1 file changed, 9 insertions(+)
[PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Sean Christopherson 3 weeks, 1 day ago
Explicitly set apic->irr_pending after stuffing the vIRR when userspace
sets APIC state and APICv is disabled, otherwise KVM will skip scanning
the vIRR in subsequent calls to apic_find_highest_irr(), and ultimately
fail to inject the interrupt until another interrupt happens to be added
to the vIRR.

Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
be true if APICv is enabled, because not all vIRR updates will be visible
to KVM.

Note, irr_pending is intentionally not updated in kvm_apic_update_apicv(),
because when APICv is being inhibited/disabled, KVM needs to keep the flag
set until the next emulated EOI so that KVM will correctly handle any
in-flight updates to the vIRR from hardware.  But when setting APIC state,
neither the VM nor the VMM can assume specific ordering between an update
from hardware and overwriting all state in kvm_apic_set_state(), thus KVM
can safely clear irr_pending if the vIRR is empty.

Reported-by: Yong He <zhuangel570@gmail.com>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 65412640cfc7..deb73aea2c06 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3086,6 +3086,15 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
 		kvm_x86_call(hwapic_irr_update)(vcpu,
 						apic_find_highest_irr(apic));
 		kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
+	} else {
+		/*
+		 * Note, kvm_apic_update_apicv() is responsible for updating
+		 * isr_count and highest_isr_cache.  irr_pending is somewhat
+		 * special because it mustn't be cleared when APICv is disabled
+		 * at runtime, and only state restore can cause an IRR bit to
+		 * be set without also refreshing irr_pending.
+		 */
+		apic->irr_pending = apic_search_irr(apic) != -1;
 	}
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	if (ioapic_in_kernel(vcpu->kvm))

base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
-- 
2.47.0.163.g1226f6d8fa-goog
Re: [PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Sean Christopherson 2 weeks, 5 days ago
+Maxim

On Fri, Nov 01, 2024, Sean Christopherson wrote:
> Explicitly set apic->irr_pending after stuffing the vIRR when userspace
> sets APIC state and APICv is disabled, otherwise KVM will skip scanning
> the vIRR in subsequent calls to apic_find_highest_irr(), and ultimately
> fail to inject the interrupt until another interrupt happens to be added
> to the vIRR.
> 
> Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
> be true if APICv is enabled, because not all vIRR updates will be visible
> to KVM.
> 
> Note, irr_pending is intentionally not updated in kvm_apic_update_apicv(),
> because when APICv is being inhibited/disabled, KVM needs to keep the flag
> set until the next emulated EOI so that KVM will correctly handle any
> in-flight updates to the vIRR from hardware.  But when setting APIC state,
> neither the VM nor the VMM can assume specific ordering between an update
> from hardware and overwriting all state in kvm_apic_set_state(), thus KVM
> can safely clear irr_pending if the vIRR is empty.
> 
> Reported-by: Yong He <zhuangel570@gmail.com>
> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/lapic.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 65412640cfc7..deb73aea2c06 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -3086,6 +3086,15 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
>  		kvm_x86_call(hwapic_irr_update)(vcpu,
>  						apic_find_highest_irr(apic));
>  		kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
> +	} else {
> +		/*
> +		 * Note, kvm_apic_update_apicv() is responsible for updating
> +		 * isr_count and highest_isr_cache.  irr_pending is somewhat
> +		 * special because it mustn't be cleared when APICv is disabled
> +		 * at runtime, and only state restore can cause an IRR bit to
> +		 * be set without also refreshing irr_pending.
> +		 */
> +		apic->irr_pending = apic_search_irr(apic) != -1;

I did a bit more archaeology in order to give this a Fixes tag (and a Cc: stable),
and found two interesting evolutions of this code.

The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch
irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog
suggests, deleted code that update irr_pending.

Before that commit, kvm_apic_update_apicv() did more or less what I am proposing
here, with the obvious difference that the proposed fix is specific to
kvm_lapic_reset().

        struct kvm_lapic *apic = vcpu->arch.apic;

        if (vcpu->arch.apicv_active) {
                /* irr_pending is always true when apicv is activated. */
                apic->irr_pending = true;
                apic->isr_count = 1;
        } else {
                apic->irr_pending = (apic_search_irr(apic) != -1);
                apic->isr_count = count_vectors(apic->regs + APIC_ISR);
        }

And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm:
lapic: Introduce APICv update helper function").  Prior to 97a71c444a14, KVM
unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed
that the new virtual APIC state could have a pending IRQ (which isn't a terrible
assumption.

Furthermore, in addition to introducing this issue, commit 755c2bf87860 also
papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv
as disabled prior to searching the IRR.  Waiting until KVM emulates EOI to update
irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(),
and because there are plenty of memory barries in between, but leaving irr_pending
set is basically hacking around bad ordering, which I _think_ can be fixed by:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..85d330b56c7e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
                goto out;
 
        apic->apicv_active = activate;
-       kvm_apic_update_apicv(vcpu);
        kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
+       kvm_apic_update_apicv(vcpu);
 
        /*
         * When APICv gets disabled, we may still have injected interrupts

So, while searching the IRR is technically sufficient to fix the bug, I'm leaning
*very* strongly torward fixing this bug by unconditionally setting irr_pending
to true in kvm_apic_update_apicv(), with a FIXME to call out what KVM should be
doing.  And then address that FIXME in a future series (I have a rather massive
pile of fixes and cleanups that are closely related, so there will be ample
opportunity).

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 1 Nov 2024 12:35:32 -0700
Subject: [PATCH] KVM: x86: Unconditionally set irr_pending when updating APICv
 state

TODO: writeme

Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it")
Cc: stable@vger.kernel.org
Reported-by: Yong He <zhuangel570@gmail.com>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2098dc689088..95c6beb8ce27 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
-	if (apic->apicv_active) {
-		/* irr_pending is always true when apicv is activated. */
-		apic->irr_pending = true;
+	/*
+	 * When APICv is enabled, KVM must always search the IRR for a pending
+	 * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU
+	 * isn't running.  If APICv is disabled, KVM _should_ search the IRR
+	 * for a pending IRQ.  But KVM currently doesn't ensure *all* hardware,
+	 * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching
+	 * the IRR at this time could race with IRQ delivery from hardware that
+	 * still sees APICv as being enabled.
+	 *
+	 * FIXME: Ensure other vCPUs and devices observe the change in APICv
+	 *        state prior to updating KVM's metadata caches, so that KVM
+	 *        can safely search the IRR and set irr_pending accordingly.
+	 */
+	apic->irr_pending = true;
+
+	if (apic->apicv_active)
 		apic->isr_count = 1;
-	} else {
-		/*
-		 * Don't clear irr_pending, searching the IRR can race with
-		 * updates from the CPU as APICv is still active from hardware's
-		 * perspective.  The flag will be cleared as appropriate when
-		 * KVM injects the interrupt.
-		 */
+	else
 		apic->isr_count = count_vectors(apic->regs + APIC_ISR);
-	}
+
 	apic->highest_isr_cache = -1;
 }
 

base-commit: 8fe4fefefa1b9ea01557d454699c20fdf709e890
--
Re: [PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Chao Gao 2 weeks, 4 days ago
>Furthermore, in addition to introducing this issue, commit 755c2bf87860 also
>papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv
>as disabled prior to searching the IRR.  Waiting until KVM emulates EOI to update
>irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(),
>and because there are plenty of memory barries in between, but leaving irr_pending
>set is basically hacking around bad ordering, which I _think_ can be fixed by:
>
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 83fe0a78146f..85d330b56c7e 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
>                goto out;
> 
>        apic->apicv_active = activate;
>-       kvm_apic_update_apicv(vcpu);
>        kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
>+       kvm_apic_update_apicv(vcpu);

I may miss something important. how does this change ensure CPUs and devices see
APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when
performing IPI virtualization just looks up the PID_table while IOMMU looks up
the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them.
Re: [PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Sean Christopherson 2 weeks, 4 days ago
On Wed, Nov 06, 2024, Chao Gao wrote:
> >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also
> >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv
> >as disabled prior to searching the IRR.  Waiting until KVM emulates EOI to update
> >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(),
> >and because there are plenty of memory barries in between, but leaving irr_pending
> >set is basically hacking around bad ordering, which I _think_ can be fixed by:
> >
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index 83fe0a78146f..85d330b56c7e 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
> >                goto out;
> > 
> >        apic->apicv_active = activate;
> >-       kvm_apic_update_apicv(vcpu);
> >        kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
> >+       kvm_apic_update_apicv(vcpu);
> 
> I may miss something important. how does this change ensure CPUs and devices see
> APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when
> performing IPI virtualization just looks up the PID_table while IOMMU looks up
> the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them.

For Intel, which is a bug (one of many in this area).  AMD does update both.  The
failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where
it needs to be inhibited/disabled.
Re: [PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Chao Gao 2 weeks, 3 days ago
On Wed, Nov 06, 2024 at 05:54:19AM -0800, Sean Christopherson wrote:
>On Wed, Nov 06, 2024, Chao Gao wrote:
>> >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also
>> >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv
>> >as disabled prior to searching the IRR.  Waiting until KVM emulates EOI to update
>> >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(),
>> >and because there are plenty of memory barries in between, but leaving irr_pending
>> >set is basically hacking around bad ordering, which I _think_ can be fixed by:
>> >
>> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> >index 83fe0a78146f..85d330b56c7e 100644
>> >--- a/arch/x86/kvm/x86.c
>> >+++ b/arch/x86/kvm/x86.c
>> >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
>> >                goto out;
>> > 
>> >        apic->apicv_active = activate;
>> >-       kvm_apic_update_apicv(vcpu);
>> >        kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
>> >+       kvm_apic_update_apicv(vcpu);
>> 
>> I may miss something important. how does this change ensure CPUs and devices see
>> APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when
>> performing IPI virtualization just looks up the PID_table while IOMMU looks up
>> the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them.
>
>For Intel, which is a bug (one of many in this area).  AMD does update both.  The
>failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where
>it needs to be inhibited/disabled.

Yes indeed. Actually the commit below fixes the bug for Intel already. Just the
approach isn't to let other CPUs and devices see APICv disabled. Instead, pick
up all pending IRQs (in PIR) before VM-entry and cancel VM-entry if needed.

  1 commit 7e1901f6c86c896acff6609e0176f93f756d8b2a
  2 Author: Paolo Bonzini <pbonzini@redhat.com>
  3 Date:   Mon Nov 22 19:43:09 2021 -0500
  4
  5     KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
  6
  7     If APICv is disabled for this vCPU, assigned devices may still attempt to
  8     post interrupts.  In that case, we need to cancel the vmentry and deliver
  9     the interrupt with KVM_REQ_EVENT.  Extend the existing code that handles
 10     injection of L1 interrupts into L2 to cover this case as well.
 11
 12     vmx_hwapic_irr_update is only called when APICv is active so it would be
 13     confusing to add a check for vcpu->arch.apicv_active in there.  Instead,
 14     just use vmx_set_rvi directly in vmx_sync_pir_to_irr.
Re: [PATCH] KVM: x86: Update irr_pending when setting APIC state with APICv disabled
Posted by Sean Christopherson 2 weeks, 3 days ago
On Thu, Nov 07, 2024, Chao Gao wrote:
> On Wed, Nov 06, 2024 at 05:54:19AM -0800, Sean Christopherson wrote:
> >On Wed, Nov 06, 2024, Chao Gao wrote:
> >> >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also
> >> >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv
> >> >as disabled prior to searching the IRR.  Waiting until KVM emulates EOI to update
> >> >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(),
> >> >and because there are plenty of memory barries in between, but leaving irr_pending
> >> >set is basically hacking around bad ordering, which I _think_ can be fixed by:
> >> >
> >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> >index 83fe0a78146f..85d330b56c7e 100644
> >> >--- a/arch/x86/kvm/x86.c
> >> >+++ b/arch/x86/kvm/x86.c
> >> >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
> >> >                goto out;
> >> > 
> >> >        apic->apicv_active = activate;
> >> >-       kvm_apic_update_apicv(vcpu);
> >> >        kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu);
> >> >+       kvm_apic_update_apicv(vcpu);
> >> 
> >> I may miss something important. how does this change ensure CPUs and devices see
> >> APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when
> >> performing IPI virtualization just looks up the PID_table while IOMMU looks up
> >> the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them.
> >
> >For Intel, which is a bug (one of many in this area).  AMD does update both.  The
> >failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where
> >it needs to be inhibited/disabled.
> 
> Yes indeed. Actually the commit below fixes the bug for Intel already. Just the
> approach isn't to let other CPUs and devices see APICv disabled. Instead, pick
> up all pending IRQs (in PIR) before VM-entry and cancel VM-entry if needed.
> 
>   1 commit 7e1901f6c86c896acff6609e0176f93f756d8b2a
>   2 Author: Paolo Bonzini <pbonzini@redhat.com>
>   3 Date:   Mon Nov 22 19:43:09 2021 -0500
>   4
>   5     KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
>   6
>   7     If APICv is disabled for this vCPU, assigned devices may still attempt to
>   8     post interrupts.  In that case, we need to cancel the vmentry and deliver
>   9     the interrupt with KVM_REQ_EVENT.  Extend the existing code that handles
>  10     injection of L1 interrupts into L2 to cover this case as well.
>  11
>  12     vmx_hwapic_irr_update is only called when APICv is active so it would be
>  13     confusing to add a check for vcpu->arch.apicv_active in there.  Instead,
>  14     just use vmx_set_rvi directly in vmx_sync_pir_to_irr.

Ah, right, and that approach works because the posted interrupt notification IRQ
is guaranteed to cause a VM-Exit, and KVM keeps the destination CPU in the PID
up-to-date even if APICv is inhibited.

But on AMD, the GA log interrupt is per-IOMMU and so isn't affined to the CPU on
which the vCPU that generated that log entry is running, i.e. won't force an exit
on the destination.  Oh, and the vCPU's entry in the IPI virtualization table
needs to be marked as not-running so that the sender is forced to exit and kick
the target.

In theory, kicking the target vCPU in avic_ga_log_notifier() would allow keeping
the associated IRTEs in guest/posted mode.  I'm mildly curious if that would
yield better or worse performance/latency than going through the per-IRQ handler.