[PATCH v2] KVM: x86: Unconditionally set irr_pending when updating APICv state

Sean Christopherson posted 1 patch 2 weeks, 4 days ago
arch/x86/kvm/lapic.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
[PATCH v2] KVM: x86: Unconditionally set irr_pending when updating APICv state
Posted by Sean Christopherson 2 weeks, 4 days ago
Always set irr_pending (to true) when updating APICv status to fix a bug
where KVM fails to set irr_pending when userspace sets APIC state and
APICv is disabled, which ultimate results in KVM failing to inject the
pending interrupt(s) that userspace stuffed into the vIRR, until another
interrupt happens to be emulated by KVM.

Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
be true if APICv is enabled, because not all vIRR updates will be visible
to KVM.

Hit the bug with a big hammer, even though strictly speaking KVM can scan
the vIRR and set/clear irr_pending as appropriate for this specific case.
The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't
touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as
the shortlog suggests, deleted code that updated irr_pending.

Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with
with the crucial difference that kvm_apic_update_apicv() did the scan even
when APICv was being *disabled*, e.g. due to an AVIC inhibition.

        struct kvm_lapic *apic = vcpu->arch.apic;

        if (vcpu->arch.apicv_active) {
                /* irr_pending is always true when apicv is activated. */
                apic->irr_pending = true;
                apic->isr_count = 1;
        } else {
                apic->irr_pending = (apic_search_irr(apic) != -1);
                apic->isr_count = count_vectors(apic->regs + APIC_ISR);
        }

And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78
("kvm: lapic: Introduce APICv update helper function"), prior to which KVM
unconditionally set irr_pending to true in kvm_apic_set_state(), i.e.
assumed that the new virtual APIC state could have a pending IRQ.

Furthermore, in addition to introducing this issue, commit 755c2bf87860
also papered over the underlying bug: KVM doesn't ensure CPUs and devices
see APICv as disabled prior to searching the IRR.  Waiting until KVM
emulates an EOI to update irr_pending "works", but only because KVM won't
emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of
memory barriers in between.  I.e. leaving irr_pending set is basically
hacking around bad ordering.

So, effectively revert to the pre-b26a695a1d78 behavior for state restore,
even though it's sub-optimal if no IRQs are pending, in order to provide a
minimal fix, but leave behind a FIXME to document the ugliness.  With luck,
the ordering issue will be fixed and the mess will be cleaned up in the
not-too-distant future.

Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reported-by: Yong He <zhuangel570@gmail.com>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

v2: Go with a big hammer fix, and plan on scanning the vIRR for all cases
    once the ordering bug has been resolved, i.e. once KVM guarantees the
    scan happens after CPUs and devices see the new APICv state.

v1: https://lore.kernel.org/all/20241101193532.1817004-1-seanjc@google.com

 arch/x86/kvm/lapic.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 65412640cfc7..e470061b744a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
-	if (apic->apicv_active) {
-		/* irr_pending is always true when apicv is activated. */
-		apic->irr_pending = true;
+	/*
+	 * When APICv is enabled, KVM must always search the IRR for a pending
+	 * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU
+	 * isn't running.  If APICv is disabled, KVM _should_ search the IRR
+	 * for a pending IRQ.  But KVM currently doesn't ensure *all* hardware,
+	 * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching
+	 * the IRR at this time could race with IRQ delivery from hardware that
+	 * still sees APICv as being enabled.
+	 *
+	 * FIXME: Ensure other vCPUs and devices observe the change in APICv
+	 *        state prior to updating KVM's metadata caches, so that KVM
+	 *        can safely search the IRR and set irr_pending accordingly.
+	 */
+	apic->irr_pending = true;
+
+	if (apic->apicv_active)
 		apic->isr_count = 1;
-	} else {
-		/*
-		 * Don't clear irr_pending, searching the IRR can race with
-		 * updates from the CPU as APICv is still active from hardware's
-		 * perspective.  The flag will be cleared as appropriate when
-		 * KVM injects the interrupt.
-		 */
+	else
 		apic->isr_count = count_vectors(apic->regs + APIC_ISR);
-	}
+
 	apic->highest_isr_cache = -1;
 }
 

base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
-- 
2.47.0.199.ga7371fff76-goog
Re: [PATCH v2] KVM: x86: Unconditionally set irr_pending when updating APICv state
Posted by Paolo Bonzini 2 weeks, 2 days ago
Queued, thanks.

Paolo
Re: [PATCH v2] KVM: x86: Unconditionally set irr_pending when updating APICv state
Posted by Chao Gao 2 weeks, 3 days ago
On Tue, Nov 05, 2024 at 05:51:35PM -0800, Sean Christopherson wrote:
>Always set irr_pending (to true) when updating APICv status to fix a bug
>where KVM fails to set irr_pending when userspace sets APIC state and
>APICv is disabled, which ultimate results in KVM failing to inject the
>pending interrupt(s) that userspace stuffed into the vIRR, until another
>interrupt happens to be emulated by KVM.
>
>Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
>be true if APICv is enabled, because not all vIRR updates will be visible
>to KVM.
>
>Hit the bug with a big hammer, even though strictly speaking KVM can scan
>the vIRR and set/clear irr_pending as appropriate for this specific case.
>The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't
>touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as
>the shortlog suggests, deleted code that updated irr_pending.
>
>Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with
>with the crucial difference that kvm_apic_update_apicv() did the scan even
>when APICv was being *disabled*, e.g. due to an AVIC inhibition.
>
>        struct kvm_lapic *apic = vcpu->arch.apic;
>
>        if (vcpu->arch.apicv_active) {
>                /* irr_pending is always true when apicv is activated. */
>                apic->irr_pending = true;
>                apic->isr_count = 1;
>        } else {
>                apic->irr_pending = (apic_search_irr(apic) != -1);
>                apic->isr_count = count_vectors(apic->regs + APIC_ISR);
>        }
>
>And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78
>("kvm: lapic: Introduce APICv update helper function"), prior to which KVM
>unconditionally set irr_pending to true in kvm_apic_set_state(), i.e.
>assumed that the new virtual APIC state could have a pending IRQ.
>
>Furthermore, in addition to introducing this issue, commit 755c2bf87860
>also papered over the underlying bug: KVM doesn't ensure CPUs and devices
>see APICv as disabled prior to searching the IRR.  Waiting until KVM
>emulates an EOI to update irr_pending "works", but only because KVM won't
>emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of
>memory barriers in between.  I.e. leaving irr_pending set is basically
>hacking around bad ordering.
>
>So, effectively revert to the pre-b26a695a1d78 behavior for state restore,
>even though it's sub-optimal if no IRQs are pending, in order to provide a
>minimal fix, but leave behind a FIXME to document the ugliness.  With luck,
>the ordering issue will be fixed and the mess will be cleaned up in the
>not-too-distant future.
>
>Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it")
>Cc: stable@vger.kernel.org
>Cc: Maxim Levitsky <mlevitsk@redhat.com>
>Reported-by: Yong He <zhuangel570@gmail.com>
>Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
>
>v2: Go with a big hammer fix, and plan on scanning the vIRR for all cases
>    once the ordering bug has been resolved, i.e. once KVM guarantees the
>    scan happens after CPUs and devices see the new APICv state.
>
>v1: https://lore.kernel.org/all/20241101193532.1817004-1-seanjc@google.com
>
> arch/x86/kvm/lapic.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
>diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>index 65412640cfc7..e470061b744a 100644
>--- a/arch/x86/kvm/lapic.c
>+++ b/arch/x86/kvm/lapic.c
>@@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
> 
>-	if (apic->apicv_active) {
>-		/* irr_pending is always true when apicv is activated. */
>-		apic->irr_pending = true;
>+	/*
>+	 * When APICv is enabled, KVM must always search the IRR for a pending
>+	 * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU
>+	 * isn't running.  If APICv is disabled, KVM _should_ search the IRR
>+	 * for a pending IRQ.  But KVM currently doesn't ensure *all* hardware,
>+	 * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching
>+	 * the IRR at this time could race with IRQ delivery from hardware that
>+	 * still sees APICv as being enabled.
>+	 *
>+	 * FIXME: Ensure other vCPUs and devices observe the change in APICv
>+	 *        state prior to updating KVM's metadata caches, so that KVM
>+	 *        can safely search the IRR and set irr_pending accordingly.
>+	 */
>+	apic->irr_pending = true;

Should irr_pending be cleared after the first search of IRR that finds no
pending IRQ, i.e., in apic_find_highest_irr() when !apic->apicv_active?

Otherwise, irr_pending will be out of sync until the arrival of an interrupt.
Not sure if we want to avoid the unnecessary performance overhead of repeatedly
searching IRR.
Re: [PATCH v2] KVM: x86: Unconditionally set irr_pending when updating APICv state
Posted by Sean Christopherson 2 weeks, 3 days ago
On Wed, Nov 06, 2024, Chao Gao wrote:
> On Tue, Nov 05, 2024 at 05:51:35PM -0800, Sean Christopherson wrote:
> > arch/x86/kvm/lapic.c | 29 ++++++++++++++++++-----------
> > 1 file changed, 18 insertions(+), 11 deletions(-)
> >
> >diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >index 65412640cfc7..e470061b744a 100644
> >--- a/arch/x86/kvm/lapic.c
> >+++ b/arch/x86/kvm/lapic.c
> >@@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
> > {
> > 	struct kvm_lapic *apic = vcpu->arch.apic;
> > 
> >-	if (apic->apicv_active) {
> >-		/* irr_pending is always true when apicv is activated. */
> >-		apic->irr_pending = true;
> >+	/*
> >+	 * When APICv is enabled, KVM must always search the IRR for a pending
> >+	 * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU
> >+	 * isn't running.  If APICv is disabled, KVM _should_ search the IRR
> >+	 * for a pending IRQ.  But KVM currently doesn't ensure *all* hardware,
> >+	 * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching
> >+	 * the IRR at this time could race with IRQ delivery from hardware that
> >+	 * still sees APICv as being enabled.
> >+	 *
> >+	 * FIXME: Ensure other vCPUs and devices observe the change in APICv
> >+	 *        state prior to updating KVM's metadata caches, so that KVM
> >+	 *        can safely search the IRR and set irr_pending accordingly.
> >+	 */
> >+	apic->irr_pending = true;
> 
> Should irr_pending be cleared after the first search of IRR that finds no
> pending IRQ, i.e., in apic_find_highest_irr() when !apic->apicv_active?

Definitely not in apic_find_highest_irr(), because there are ordering issues on
SMP systems.  Huh, and apic_clear_irr() and kvm_lapic_set_irr() are buggy; they
subtly rely on atomic accesses to provide ordering, but really should have memory
barriers of some kinda (off the top of my head, I'm not entirey sure what barrier
is appropriate in apic_clear_irr()).

And definitely not in the context of this bug fix, because that performance flaw
exists in multiple other scenarios.

> Otherwise, irr_pending will be out of sync until the arrival of an interrupt.
> Not sure if we want to avoid the unnecessary performance overhead of repeatedly
> searching IRR.

IMO, it's not worth the risk.  That overhead effectively exists at all times on
modern Intel CPUs, as APICv is highly likely to be enabled, i.e. irr_pending will
always be true.

Hrm, but looking at this again, I'm not sure I like v2.  Ah, it's not the code I
don't like, it's the name of the helper that's flawed.  kvm_apic_update_apicv()
makes it seem like it's only releveant when enable_apicv=true, but that's not the
case.  E.g. doing apic->irr_pending = enable_apicv would be broken.

But that flaws exists even before this patch, because kvm_apic_set_state() and
kvm_lapic_reset() rely on it to update apic->isr_count (which needs to be cleared
on INIT for the latter).

I'll send a (not for 6.12) patch to rename it, e.g. to kvm_apic_update_sw_caches()
or something.