arch/x86/kvm/lapic.c | 9 +++++++++ 1 file changed, 9 insertions(+)
Explicitly set apic->irr_pending after stuffing the vIRR when userspace
sets APIC state and APICv is disabled, otherwise KVM will skip scanning
the vIRR in subsequent calls to apic_find_highest_irr(), and ultimately
fail to inject the interrupt until another interrupt happens to be added
to the vIRR.
Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to
be true if APICv is enabled, because not all vIRR updates will be visible
to KVM.
Note, irr_pending is intentionally not updated in kvm_apic_update_apicv(),
because when APICv is being inhibited/disabled, KVM needs to keep the flag
set until the next emulated EOI so that KVM will correctly handle any
in-flight updates to the vIRR from hardware. But when setting APIC state,
neither the VM nor the VMM can assume specific ordering between an update
from hardware and overwriting all state in kvm_apic_set_state(), thus KVM
can safely clear irr_pending if the vIRR is empty.
Reported-by: Yong He <zhuangel570@gmail.com>
Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/lapic.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 65412640cfc7..deb73aea2c06 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3086,6 +3086,15 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
kvm_x86_call(hwapic_irr_update)(vcpu,
apic_find_highest_irr(apic));
kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
+ } else {
+ /*
+ * Note, kvm_apic_update_apicv() is responsible for updating
+ * isr_count and highest_isr_cache. irr_pending is somewhat
+ * special because it mustn't be cleared when APICv is disabled
+ * at runtime, and only state restore can cause an IRR bit to
+ * be set without also refreshing irr_pending.
+ */
+ apic->irr_pending = apic_search_irr(apic) != -1;
}
kvm_make_request(KVM_REQ_EVENT, vcpu);
if (ioapic_in_kernel(vcpu->kvm))
base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6
--
2.47.0.163.g1226f6d8fa-goog
+Maxim On Fri, Nov 01, 2024, Sean Christopherson wrote: > Explicitly set apic->irr_pending after stuffing the vIRR when userspace > sets APIC state and APICv is disabled, otherwise KVM will skip scanning > the vIRR in subsequent calls to apic_find_highest_irr(), and ultimately > fail to inject the interrupt until another interrupt happens to be added > to the vIRR. > > Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to > be true if APICv is enabled, because not all vIRR updates will be visible > to KVM. > > Note, irr_pending is intentionally not updated in kvm_apic_update_apicv(), > because when APICv is being inhibited/disabled, KVM needs to keep the flag > set until the next emulated EOI so that KVM will correctly handle any > in-flight updates to the vIRR from hardware. But when setting APIC state, > neither the VM nor the VMM can assume specific ordering between an update > from hardware and overwriting all state in kvm_apic_set_state(), thus KVM > can safely clear irr_pending if the vIRR is empty. > > Reported-by: Yong He <zhuangel570@gmail.com> > Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/lapic.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 65412640cfc7..deb73aea2c06 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -3086,6 +3086,15 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) > kvm_x86_call(hwapic_irr_update)(vcpu, > apic_find_highest_irr(apic)); > kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic)); > + } else { > + /* > + * Note, kvm_apic_update_apicv() is responsible for updating > + * isr_count and highest_isr_cache. irr_pending is somewhat > + * special because it mustn't be cleared when APICv is disabled > + * at runtime, and only state restore can cause an IRR bit to > + * be set without also refreshing irr_pending. > + */ > + apic->irr_pending = apic_search_irr(apic) != -1; I did a bit more archaeology in order to give this a Fixes tag (and a Cc: stable), and found two interesting evolutions of this code. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that update irr_pending. Before that commit, kvm_apic_update_apicv() did more or less what I am proposing here, with the obvious difference that the proposed fix is specific to kvm_lapic_reset(). struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"). Prior to 97a71c444a14, KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ (which isn't a terrible assumption. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates EOI to update irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and because there are plenty of memory barries in between, but leaving irr_pending set is basically hacking around bad ordering, which I _think_ can be fixed by: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 83fe0a78146f..85d330b56c7e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) goto out; apic->apicv_active = activate; - kvm_apic_update_apicv(vcpu); kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu); + kvm_apic_update_apicv(vcpu); /* * When APICv gets disabled, we may still have injected interrupts So, while searching the IRR is technically sufficient to fix the bug, I'm leaning *very* strongly torward fixing this bug by unconditionally setting irr_pending to true in kvm_apic_update_apicv(), with a FIXME to call out what KVM should be doing. And then address that FIXME in a future series (I have a rather massive pile of fixes and cleanups that are closely related, so there will be ample opportunity). From: Sean Christopherson <seanjc@google.com> Date: Fri, 1 Nov 2024 12:35:32 -0700 Subject: [PATCH] KVM: x86: Unconditionally set irr_pending when updating APICv state TODO: writeme Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Reported-by: Yong He <zhuangel570@gmail.com> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/lapic.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2098dc689088..95c6beb8ce27 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2629,19 +2629,26 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; - if (apic->apicv_active) { - /* irr_pending is always true when apicv is activated. */ - apic->irr_pending = true; + /* + * When APICv is enabled, KVM must always search the IRR for a pending + * IRQ, as other vCPUs and devices can set IRR bits even if the vCPU + * isn't running. If APICv is disabled, KVM _should_ search the IRR + * for a pending IRQ. But KVM currently doesn't ensure *all* hardware, + * e.g. CPUs and IOMMUs, has seen the change in state, i.e. searching + * the IRR at this time could race with IRQ delivery from hardware that + * still sees APICv as being enabled. + * + * FIXME: Ensure other vCPUs and devices observe the change in APICv + * state prior to updating KVM's metadata caches, so that KVM + * can safely search the IRR and set irr_pending accordingly. + */ + apic->irr_pending = true; + + if (apic->apicv_active) apic->isr_count = 1; - } else { - /* - * Don't clear irr_pending, searching the IRR can race with - * updates from the CPU as APICv is still active from hardware's - * perspective. The flag will be cleared as appropriate when - * KVM injects the interrupt. - */ + else apic->isr_count = count_vectors(apic->regs + APIC_ISR); - } + apic->highest_isr_cache = -1; } base-commit: 8fe4fefefa1b9ea01557d454699c20fdf709e890 --
>Furthermore, in addition to introducing this issue, commit 755c2bf87860 also >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv >as disabled prior to searching the IRR. Waiting until KVM emulates EOI to update >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), >and because there are plenty of memory barries in between, but leaving irr_pending >set is basically hacking around bad ordering, which I _think_ can be fixed by: > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >index 83fe0a78146f..85d330b56c7e 100644 >--- a/arch/x86/kvm/x86.c >+++ b/arch/x86/kvm/x86.c >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) > goto out; > > apic->apicv_active = activate; >- kvm_apic_update_apicv(vcpu); > kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu); >+ kvm_apic_update_apicv(vcpu); I may miss something important. how does this change ensure CPUs and devices see APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when performing IPI virtualization just looks up the PID_table while IOMMU looks up the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them.
On Wed, Nov 06, 2024, Chao Gao wrote: > >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also > >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv > >as disabled prior to searching the IRR. Waiting until KVM emulates EOI to update > >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), > >and because there are plenty of memory barries in between, but leaving irr_pending > >set is basically hacking around bad ordering, which I _think_ can be fixed by: > > > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >index 83fe0a78146f..85d330b56c7e 100644 > >--- a/arch/x86/kvm/x86.c > >+++ b/arch/x86/kvm/x86.c > >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) > > goto out; > > > > apic->apicv_active = activate; > >- kvm_apic_update_apicv(vcpu); > > kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu); > >+ kvm_apic_update_apicv(vcpu); > > I may miss something important. how does this change ensure CPUs and devices see > APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when > performing IPI virtualization just looks up the PID_table while IOMMU looks up > the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them. For Intel, which is a bug (one of many in this area). AMD does update both. The failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where it needs to be inhibited/disabled.
On Wed, Nov 06, 2024 at 05:54:19AM -0800, Sean Christopherson wrote: >On Wed, Nov 06, 2024, Chao Gao wrote: >> >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also >> >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv >> >as disabled prior to searching the IRR. Waiting until KVM emulates EOI to update >> >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), >> >and because there are plenty of memory barries in between, but leaving irr_pending >> >set is basically hacking around bad ordering, which I _think_ can be fixed by: >> > >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> >index 83fe0a78146f..85d330b56c7e 100644 >> >--- a/arch/x86/kvm/x86.c >> >+++ b/arch/x86/kvm/x86.c >> >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) >> > goto out; >> > >> > apic->apicv_active = activate; >> >- kvm_apic_update_apicv(vcpu); >> > kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu); >> >+ kvm_apic_update_apicv(vcpu); >> >> I may miss something important. how does this change ensure CPUs and devices see >> APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when >> performing IPI virtualization just looks up the PID_table while IOMMU looks up >> the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them. > >For Intel, which is a bug (one of many in this area). AMD does update both. The >failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where >it needs to be inhibited/disabled. Yes indeed. Actually the commit below fixes the bug for Intel already. Just the approach isn't to let other CPUs and devices see APICv disabled. Instead, pick up all pending IRQs (in PIR) before VM-entry and cancel VM-entry if needed. 1 commit 7e1901f6c86c896acff6609e0176f93f756d8b2a 2 Author: Paolo Bonzini <pbonzini@redhat.com> 3 Date: Mon Nov 22 19:43:09 2021 -0500 4 5 KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled 6 7 If APICv is disabled for this vCPU, assigned devices may still attempt to 8 post interrupts. In that case, we need to cancel the vmentry and deliver 9 the interrupt with KVM_REQ_EVENT. Extend the existing code that handles 10 injection of L1 interrupts into L2 to cover this case as well. 11 12 vmx_hwapic_irr_update is only called when APICv is active so it would be 13 confusing to add a check for vcpu->arch.apicv_active in there. Instead, 14 just use vmx_set_rvi directly in vmx_sync_pir_to_irr.
On Thu, Nov 07, 2024, Chao Gao wrote: > On Wed, Nov 06, 2024 at 05:54:19AM -0800, Sean Christopherson wrote: > >On Wed, Nov 06, 2024, Chao Gao wrote: > >> >Furthermore, in addition to introducing this issue, commit 755c2bf87860 also > >> >papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv > >> >as disabled prior to searching the IRR. Waiting until KVM emulates EOI to update > >> >irr_pending works because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), > >> >and because there are plenty of memory barries in between, but leaving irr_pending > >> >set is basically hacking around bad ordering, which I _think_ can be fixed by: > >> > > >> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >> >index 83fe0a78146f..85d330b56c7e 100644 > >> >--- a/arch/x86/kvm/x86.c > >> >+++ b/arch/x86/kvm/x86.c > >> >@@ -10548,8 +10548,8 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) > >> > goto out; > >> > > >> > apic->apicv_active = activate; > >> >- kvm_apic_update_apicv(vcpu); > >> > kvm_x86_call(refresh_apicv_exec_ctrl)(vcpu); > >> >+ kvm_apic_update_apicv(vcpu); > >> > >> I may miss something important. how does this change ensure CPUs and devices see > >> APICv as disabled (thus won't manipulate the vCPU's IRR)? Other CPUs when > >> performing IPI virtualization just looks up the PID_table while IOMMU looks up > >> the IRTE table. ->refresh_apicv_exec_ctrl() doesn't change any of them. > > > >For Intel, which is a bug (one of many in this area). AMD does update both. The > >failure Maxim was addressing was on AMD (AVIC), which has many more scenarios where > >it needs to be inhibited/disabled. > > Yes indeed. Actually the commit below fixes the bug for Intel already. Just the > approach isn't to let other CPUs and devices see APICv disabled. Instead, pick > up all pending IRQs (in PIR) before VM-entry and cancel VM-entry if needed. > > 1 commit 7e1901f6c86c896acff6609e0176f93f756d8b2a > 2 Author: Paolo Bonzini <pbonzini@redhat.com> > 3 Date: Mon Nov 22 19:43:09 2021 -0500 > 4 > 5 KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled > 6 > 7 If APICv is disabled for this vCPU, assigned devices may still attempt to > 8 post interrupts. In that case, we need to cancel the vmentry and deliver > 9 the interrupt with KVM_REQ_EVENT. Extend the existing code that handles > 10 injection of L1 interrupts into L2 to cover this case as well. > 11 > 12 vmx_hwapic_irr_update is only called when APICv is active so it would be > 13 confusing to add a check for vcpu->arch.apicv_active in there. Instead, > 14 just use vmx_set_rvi directly in vmx_sync_pir_to_irr. Ah, right, and that approach works because the posted interrupt notification IRQ is guaranteed to cause a VM-Exit, and KVM keeps the destination CPU in the PID up-to-date even if APICv is inhibited. But on AMD, the GA log interrupt is per-IOMMU and so isn't affined to the CPU on which the vCPU that generated that log entry is running, i.e. won't force an exit on the destination. Oh, and the vCPU's entry in the IPI virtualization table needs to be marked as not-running so that the sender is forced to exit and kick the target. In theory, kicking the target vCPU in avic_ga_log_notifier() would allow keeping the associated IRTEs in guest/posted mode. I'm mildly curious if that would yield better or worse performance/latency than going through the per-IRQ handler.
© 2016 - 2024 Red Hat, Inc.