[PATCH 0/6] KVM: nVMX: Fix IPIv vs. nested posted interrupts

Sean Christopherson posted 6 patches 1 year, 1 month ago
There is a newer version of this series
arch/x86/include/asm/kvm_host.h |  2 +-
arch/x86/kvm/irq.c              |  6 ++---
arch/x86/kvm/lapic.c            | 12 +++++++--
arch/x86/kvm/lapic.h            |  2 +-
arch/x86/kvm/vmx/nested.c       | 43 ++++++++++++++++++++++++---------
arch/x86/kvm/vmx/vmx.h          |  2 +-
arch/x86/kvm/x86.c              |  2 +-
7 files changed, 49 insertions(+), 20 deletions(-)
[PATCH 0/6] KVM: nVMX: Fix IPIv vs. nested posted interrupts
Posted by Sean Christopherson 1 year, 1 month ago
Fix a bug where KVM injects L2's nested posted interrupt into L1 as a
nested VM-Exit instead of triggering PI processing.  The actual bug is
technically a generic nested posted interrupts problem, but due to the
way that KVM handles interrupt delivery, I'm 99.9% certain the issue is
limited to IPI virtualization being enabled.

Found by the nested posted interrupt KUT test on SPR.

If it weren't for an annoying TOCTOU bug waiting to happen, the fix would
be quite simple, e.g. it's really just:

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f7dde74ff565..b07805daedf5 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4288,6 +4288,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
                        return -EBUSY;
                if (!nested_exit_on_intr(vcpu))
                        goto no_vmexit;
+
+               if (nested_cpu_has_posted_intr(get_vmcs12(vcpu)) &&
+                   kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
+                       vmx->nested.pi_pending = true;
+                       kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
+                       goto no_vmexit;
+               }
+
                nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
                return 0;
        }

Gory details in the last patch.

Sean Christopherson (6):
  KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection
    site
  KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no
    IRQ
  KVM: x86: Don't move VMX's nested PI notification vector from IRR to
    ISR
  KVM: nVMX: Track nested_vmx.posted_intr_nv as a signed int
  KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at
    VM-Enter
  KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit
    injection

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/irq.c              |  6 ++---
 arch/x86/kvm/lapic.c            | 12 +++++++--
 arch/x86/kvm/lapic.h            |  2 +-
 arch/x86/kvm/vmx/nested.c       | 43 ++++++++++++++++++++++++---------
 arch/x86/kvm/vmx/vmx.h          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 7 files changed, 49 insertions(+), 20 deletions(-)


base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.45.2.1089.g2a221341d9-goog
Re: [PATCH 0/6] KVM: nVMX: Fix IPIv vs. nested posted interrupts
Posted by Chao Gao 1 year, 1 month ago
On Fri, Jul 19, 2024 at 05:01:32PM -0700, Sean Christopherson wrote:
>Fix a bug where KVM injects L2's nested posted interrupt into L1 as a
>nested VM-Exit instead of triggering PI processing.  The actual bug is
>technically a generic nested posted interrupts problem, but due to the
>way that KVM handles interrupt delivery, I'm 99.9% certain the issue is
>limited to IPI virtualization being enabled.

Theoretically VT-d posted interrupt can also trigger this issue.

The fix looks good to me. For the whole series:

Reviewed-by: Chao Gao <chao.gao@intel.com>
Re: [PATCH 0/6] KVM: nVMX: Fix IPIv vs. nested posted interrupts
Posted by Sean Christopherson 1 year, 1 month ago
On Mon, Jul 22, 2024, Chao Gao wrote:
> On Fri, Jul 19, 2024 at 05:01:32PM -0700, Sean Christopherson wrote:
> >Fix a bug where KVM injects L2's nested posted interrupt into L1 as a
> >nested VM-Exit instead of triggering PI processing.  The actual bug is
> >technically a generic nested posted interrupts problem, but due to the
> >way that KVM handles interrupt delivery, I'm 99.9% certain the issue is
> >limited to IPI virtualization being enabled.
> 
> Theoretically VT-d posted interrupt can also trigger this issue.

Hmm, yeah, I think you're right.  L1 could program an assigned device to _post_
an interrupt to L2's vector, via L1's PID.PIR.  Which would let the interrupt
into vIRR without KVM checking vmcs12's NV.  It seems unlikely L1 would do that,
but it definitely seems possible.

Unless I'm missing something (else), I'll update the changelog and Fixes: to make
it clear that IPI virtualization just makes the bug easier to hit, but that the
issue exists with nested assigned devices, too.
Re: [PATCH 0/6] KVM: nVMX: Fix IPIv vs. nested posted interrupts
Posted by Sean Christopherson 1 year ago
On Fri, 19 Jul 2024 17:01:32 -0700, Sean Christopherson wrote:
> Fix a bug where KVM injects L2's nested posted interrupt into L1 as a
> nested VM-Exit instead of triggering PI processing.  The actual bug is
> technically a generic nested posted interrupts problem, but due to the
> way that KVM handles interrupt delivery, I'm 99.9% certain the issue is
> limited to IPI virtualization being enabled.
> 
> Found by the nested posted interrupt KUT test on SPR.
> 
> [...]

Applied to kvm-x86 vmx, with a massaged changelog to clarify that this bug could
be hit even without IPI virtualization.

[1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
      https://github.com/kvm-x86/linux/commit/6f373f4d941b
[2/6] KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQ
      https://github.com/kvm-x86/linux/commit/cb14e454add0
[3/6] KVM: x86: Don't move VMX's nested PI notification vector from IRR to ISR
      https://github.com/kvm-x86/linux/commit/f729851189d5
[4/6] KVM: nVMX: Track nested_vmx.posted_intr_nv as a signed int
      https://github.com/kvm-x86/linux/commit/ab9cbe044f83
[5/6] KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-Enter
      https://github.com/kvm-x86/linux/commit/be02aa1e52d2
[6/6] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
      https://github.com/kvm-x86/linux/commit/44518120c4ca

--
https://github.com/kvm-x86/linux/tree/next
[PATCH 1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Posted by Sean Christopherson 1 year, 1 month ago
Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
and only path where KVM invokes nested_vmx_vmexit() with
EXIT_REASON_EXTERNAL_INTERRUPT.  A future fix will perform a last-minute
check on L2's nested posted interrupt notification vector, just before
injecting a nested VM-Exit.  To handle that scenario correctly, KVM needs
to get the interrupt _before_ injecting VM-Exit, as simply querying the
highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
TOCTOU bug, as a new, higher priority interrupt could arrive between
kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().

Opportunistically convert the WARN_ON() to a WARN_ON_ONCE().  If KVM has
a bug that results in a false positive from kvm_cpu_has_interrupt(),
spamming dmesg won't help the situation.

Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
they are always "wanted" by L0.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2392a7ef254d..b3e17635f7e3 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4284,11 +4284,26 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	}
 
 	if (kvm_cpu_has_interrupt(vcpu) && !vmx_interrupt_blocked(vcpu)) {
+		u32 exit_intr_info;
+
 		if (block_nested_events)
 			return -EBUSY;
 		if (!nested_exit_on_intr(vcpu))
 			goto no_vmexit;
-		nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
+
+		if (nested_exit_intr_ack_set(vcpu)) {
+			int irq;
+
+			irq = kvm_cpu_get_interrupt(vcpu);
+			WARN_ON_ONCE(irq < 0);
+
+			exit_intr_info = INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq;
+		} else {
+			exit_intr_info = 0;
+		}
+
+		nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT,
+				  exit_intr_info, 0);
 		return 0;
 	}
 
@@ -4969,14 +4984,6 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
 	if (likely(!vmx->fail)) {
-		if ((u16)vm_exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT &&
-		    nested_exit_intr_ack_set(vcpu)) {
-			int irq = kvm_cpu_get_interrupt(vcpu);
-			WARN_ON(irq < 0);
-			vmcs12->vm_exit_intr_info = irq |
-				INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
-		}
-
 		if (vm_exit_reason != -1)
 			trace_kvm_nested_vmexit_inject(vmcs12->vm_exit_reason,
 						       vmcs12->exit_qualification,
-- 
2.45.2.1089.g2a221341d9-goog
Re: [PATCH 1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Posted by Nathan Chancellor 12 months ago
Hi Sean,

On Fri, Jul 19, 2024 at 05:01:33PM -0700, Sean Christopherson wrote:
> Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from
> nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one
> and only path where KVM invokes nested_vmx_vmexit() with
> EXIT_REASON_EXTERNAL_INTERRUPT.  A future fix will perform a last-minute
> check on L2's nested posted interrupt notification vector, just before
> injecting a nested VM-Exit.  To handle that scenario correctly, KVM needs
> to get the interrupt _before_ injecting VM-Exit, as simply querying the
> highest priority interrupt, via kvm_cpu_has_interrupt(), would result in
> TOCTOU bug, as a new, higher priority interrupt could arrive between
> kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt().
> 
> Opportunistically convert the WARN_ON() to a WARN_ON_ONCE().  If KVM has
> a bug that results in a false positive from kvm_cpu_has_interrupt(),
> spamming dmesg won't help the situation.
> 
> Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as
> they are always "wanted" by L0.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2392a7ef254d..b3e17635f7e3 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4284,11 +4284,26 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (kvm_cpu_has_interrupt(vcpu) && !vmx_interrupt_blocked(vcpu)) {
> +		u32 exit_intr_info;
> +
>  		if (block_nested_events)
>  			return -EBUSY;
>  		if (!nested_exit_on_intr(vcpu))
>  			goto no_vmexit;
> -		nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
> +
> +		if (nested_exit_intr_ack_set(vcpu)) {
> +			int irq;
> +
> +			irq = kvm_cpu_get_interrupt(vcpu);
> +			WARN_ON_ONCE(irq < 0);
> +
> +			exit_intr_info = INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq;
> +		} else {
> +			exit_intr_info = 0;
> +		}
> +
> +		nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT,
> +				  exit_intr_info, 0);
>  		return 0;
>  	}
>  
> @@ -4969,14 +4984,6 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>  
>  	if (likely(!vmx->fail)) {
> -		if ((u16)vm_exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT &&
> -		    nested_exit_intr_ack_set(vcpu)) {
> -			int irq = kvm_cpu_get_interrupt(vcpu);
> -			WARN_ON(irq < 0);
> -			vmcs12->vm_exit_intr_info = irq |
> -				INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
> -		}
> -
>  		if (vm_exit_reason != -1)
>  			trace_kvm_nested_vmexit_inject(vmcs12->vm_exit_reason,
>  						       vmcs12->exit_qualification,
> -- 
> 2.45.2.1089.g2a221341d9-goog
> 

I bisected (log below) an issue with starting a nested guest that
appears on two of my newer Intel test machines (but not a somewhat old
laptop) when this change as commit 6f373f4d941b ("KVM: nVMX: Get
to-be-acknowledge IRQ for nested VM-Exit at injection site") in -next is
present in the host kernel.

I start a virtual machine with a full distribution using QEMU then start
a nested virtual machine using QEMU with the same kernel and a much
simpler Buildroot initrd, just to test the ability to run a nested
guest. After this change, starting a nested guest results in no output
from the nested guest and eventually the first guest restarts, sometimes
printing a lockup message that appears to be caused from qemu-system-x86

My command for the first guest on the host (in case it matters):

  $ qemu-system-x86_64 \
      -display none \
      -serial mon:stdio \
      -nic user,model=virtio-net-pci,hostfwd=tcp::8022-:22 \
      -object rng-random,filename=/dev/urandom,id=rng0 \
      -device virtio-rng-pci \
      -chardev socket,id=char0,path=$VM_FOLDER/x86_64/arch/vfsd.sock \
      -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=host \
      -object memory-backend-memfd,id=mem,share=on,size=16G \
      -numa node,memdev=mem \
      -m 16G \
      -device virtio-balloon \
      -smp 8 \
      -drive if=pflash,format=raw,file=$VM_FOLDER/x86_64/arch/efi.img,readonly=on \
      -drive if=pflash,format=raw,file=$VM_FOLDER/x86_64/arch/efi_vars.img \
      -cpu host \
      -enable-kvm \
      -M q35 \
      -drive if=virtio,format=qcow2,file=$VM_FOLDER/x86_64/arch/disk.img

My commands in the first guest to spawn the nested guest:

  $ cd $(mktemp -d)

  $ curl -LSs https://archive.archlinux.org/packages/l/linux/linux-6.10.8.arch1-1-x86_64.pkg.tar.zst | tar --zstd -xf -

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20230707-182910/x86_64-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-x86_64 \
      -display none \
      -nodefaults \
      -M q35 \
      -d unimp,guest_errors \
      -append 'console=ttyS0 earlycon=uart8250,io,0x3f8 loglevel=7' \
      -kernel usr/lib/modules/6.10.8-arch1-1/vmlinuz \
      -initrd rootfs.cpio \
      -cpu host \
      -enable-kvm \
      -m 512m \
      -smp 8 \
      -serial mon:stdio

If there is any additional information I can provide or patches I can
test, I am more than happy to do so.

Cheers,
Nathan

# bad: [6804f0edbe7747774e6ae60f20cec4ee3ad7c187] Add linux-next specific files for 20240903
# good: [67784a74e258a467225f0e68335df77acd67b7ab] Merge tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
git bisect start '6804f0edbe7747774e6ae60f20cec4ee3ad7c187' '67784a74e258a467225f0e68335df77acd67b7ab'
# good: [6b63f16410fa86f6a2e9f52c9cb52ba94c363f3e] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git
git bisect good 6b63f16410fa86f6a2e9f52c9cb52ba94c363f3e
# good: [194eaf7dd63eef7cee974daeb4df01a3e6b144fe] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git
git bisect good 194eaf7dd63eef7cee974daeb4df01a3e6b144fe
# bad: [a8f65643f59dac67451d09ff298fa7f6e7917794] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git
git bisect bad a8f65643f59dac67451d09ff298fa7f6e7917794
# good: [f80eff5b9f33c4f8d86ba046f3ee54c4f2224454] Merge branch 'timers/drivers/next' of https://git.linaro.org/people/daniel.lezcano/linux.git
git bisect good f80eff5b9f33c4f8d86ba046f3ee54c4f2224454
# bad: [a93e40d038ccd17e6cf691e1b8fec8821998baf2] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git
git bisect bad a93e40d038ccd17e6cf691e1b8fec8821998baf2
# good: [500b6c92524183f97e3a3c8e6c56f8af69429ba4] Merge branch 'non-rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
git bisect good 500b6c92524183f97e3a3c8e6c56f8af69429ba4
# bad: [642613182efa0927c8bd4d4ef2c6b8350554b6ad] Merge branches 'fixes', 'generic', 'misc', 'mmu', 'pat_vmx_msrs', 'selftests', 'svm' and 'vmx'
git bisect bad 642613182efa0927c8bd4d4ef2c6b8350554b6ad
# good: [1876dd69dfe8c29e249070b0e4dc941fc15ac1e4] KVM: x86: Add fastpath handling of HLT VM-Exits
git bisect good 1876dd69dfe8c29e249070b0e4dc941fc15ac1e4
# bad: [44518120c4ca569cfb9c6199e94c312458dc1c07] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
git bisect bad 44518120c4ca569cfb9c6199e94c312458dc1c07
# good: [2ab637df5f68d4e0cfa9becd10f43400aee785b3] KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()
git bisect good 2ab637df5f68d4e0cfa9becd10f43400aee785b3
# bad: [f729851189d5741e7d1059e250422611028657f8] KVM: x86: Don't move VMX's nested PI notification vector from IRR to ISR
git bisect bad f729851189d5741e7d1059e250422611028657f8
# bad: [cb14e454add0efc9bc461c1ad30d9c950c406fab] KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQ
git bisect bad cb14e454add0efc9bc461c1ad30d9c950c406fab
# bad: [6f373f4d941bf79f06e9abd4a827288ad1de6399] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
git bisect bad 6f373f4d941bf79f06e9abd4a827288ad1de6399
# first bad commit: [6f373f4d941bf79f06e9abd4a827288ad1de6399] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Re: [PATCH 1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Posted by Sean Christopherson 12 months ago
On Wed, Sep 04, 2024, Nathan Chancellor wrote:
> I bisected (log below) an issue with starting a nested guest that
> appears on two of my newer Intel test machines (but not a somewhat old
> laptop) when this change as commit 6f373f4d941b ("KVM: nVMX: Get
> to-be-acknowledge IRQ for nested VM-Exit at injection site") in -next is
> present in the host kernel.
> 
> I start a virtual machine with a full distribution using QEMU then start
> a nested virtual machine using QEMU with the same kernel and a much
> simpler Buildroot initrd, just to test the ability to run a nested
> guest. After this change, starting a nested guest results in no output
> from the nested guest and eventually the first guest restarts, sometimes
> printing a lockup message that appears to be caused from qemu-system-x86

*sigh*

It's not you, it's me.

I just bisected hangs in my nested setup to this same commit.  Apparently, I
completely and utterly failed at testing.

There isn't that much going on here, so knock wood, getting a root cause shouldn't
be terribly difficult.
Re: [PATCH 1/6] KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection site
Posted by Sean Christopherson 12 months ago
On Wed, Sep 04, 2024, Sean Christopherson wrote:
> On Wed, Sep 04, 2024, Nathan Chancellor wrote:
> > I bisected (log below) an issue with starting a nested guest that
> > appears on two of my newer Intel test machines (but not a somewhat old
> > laptop) when this change as commit 6f373f4d941b ("KVM: nVMX: Get
> > to-be-acknowledge IRQ for nested VM-Exit at injection site") in -next is
> > present in the host kernel.
> > 
> > I start a virtual machine with a full distribution using QEMU then start
> > a nested virtual machine using QEMU with the same kernel and a much
> > simpler Buildroot initrd, just to test the ability to run a nested
> > guest. After this change, starting a nested guest results in no output
> > from the nested guest and eventually the first guest restarts, sometimes
> > printing a lockup message that appears to be caused from qemu-system-x86
> 
> *sigh*
> 
> It's not you, it's me.
> 
> I just bisected hangs in my nested setup to this same commit.  Apparently, I
> completely and utterly failed at testing.
> 
> There isn't that much going on here, so knock wood, getting a root cause shouldn't
> be terribly difficult.

Well fudge.  My attempt to avoid splitting kvm_get_apic_interrupt() and exposing
more lapic.c internals to nested VMX failed spectaculary.

Hiding down in apic_set_isr() is a call to hwapic_isr_update(), which updates
vmcs.GUEST_INTERRUPT_STATUS.SVI to mirror the highest vector in the virtual APIC's
ISR.  On a nested VM-Exit due to a IRQ, that update is supposed to hit vmcs01.
By moving the call to kvm_get_apic_interrupt() out of nested_vmx_vmexit(), that
update hits vmcs02 instead, and things go downhill from there.

The obvious/easy solution is to split kvm_get_apic_interrupt() so that nVMX can
find an interrupt, emulate nested VM-Exit or posted interrupt processing as
appropriate, and _then_ ACK the IRQ (if a VM-Exit was synthesized).  It's not
really any harder than what I did here, as above I just didn't want to split
kvm_get_apic_interrupt().  But I don't see any sane alternative, and in the end
it's not any worse than plumbing the notification vector into kvm_get_apic_interrupt();
either way, we're bleeding implementation details between common x86 code and
nVMX.

Luckily, this series is sitting at the top of `kvm-x86 vmx` (yay, topic branches!),
so I'll just drop the entire series and post a full v2.  Unless I botched this
new version too (haven't tested yet), I should get v2 posted tomorrow.

Sorry for pushing garbage, this should never have been posted, let alone gotten
applied to -next.
[PATCH 2/6] KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQ
Posted by Sean Christopherson 1 year, 1 month ago
In the should-be-impossible scenario that kvm_cpu_get_interrupt() doesn't
return a valid vector after checking kvm_cpu_has_interrupt(), skip VM-Exit
injection to reduce the probability of crashing/confusing L1.  Now that
KVM gets the IRQ _before_ calling nested_vmx_vmexit(), squashing the
VM-Exit injection is trivial since there are no actions that need to be
undone.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b3e17635f7e3..b042b70560f2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4295,7 +4295,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 			int irq;
 
 			irq = kvm_cpu_get_interrupt(vcpu);
-			WARN_ON_ONCE(irq < 0);
+			if (WARN_ON_ONCE(irq < 0))
+				goto no_vmexit;
 
 			exit_intr_info = INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq;
 		} else {
-- 
2.45.2.1089.g2a221341d9-goog
[PATCH 3/6] KVM: x86: Don't move VMX's nested PI notification vector from IRR to ISR
Posted by Sean Christopherson 1 year, 1 month ago
When getting an IRQ from the local APIC, don't move the vector to the ISR
and skip the PPR update if the found vector is the vCPU's nested posted
interrupt notification vector, i.e. if the IRQ should trigger posted
interrupt processing in L2 instead of being deliver to L1.

For now, pass in -1 from all callers and defer passing the actual nested
notification vector to a separate patch, as more prep work is needed.

Functionally, this should be a glorified nop, i.e. no true functional
change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/irq.c              |  6 +++---
 arch/x86/kvm/lapic.c            | 12 ++++++++++--
 arch/x86/kvm/lapic.h            |  2 +-
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 6 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..b40703f05b27 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2251,7 +2251,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
-int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v, int nested_pi_nv);
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 3d7eb11d0e45..69d04d80f143 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -135,13 +135,13 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
 /*
  * Read pending interrupt vector and intack.
  */
-int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+int kvm_cpu_get_interrupt(struct kvm_vcpu *v, int nested_pi_nv)
 {
 	int vector = kvm_cpu_get_extint(v);
 	if (vector != -1)
-		return vector;			/* PIC */
+		return vector;				/* PIC */
 
-	return kvm_get_apic_interrupt(v);	/* APIC */
+	return kvm_get_apic_interrupt(v, nested_pi_nv);	/* APIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a7172ba59ad2..c5c4473f50f6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2924,7 +2924,7 @@ void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu)
 	}
 }
 
-int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
+int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu, int nested_pi_nv)
 {
 	int vector = kvm_apic_has_interrupt(vcpu);
 	struct kvm_lapic *apic = vcpu->arch.apic;
@@ -2939,8 +2939,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 	 * on exit" mode.  Then we cannot inject the interrupt via RVI,
 	 * because the process would deliver it through the IDT.
 	 */
-
 	apic_clear_irr(vector, apic);
+
+	/*
+	 * If the vector is L2's posted interrupt notification vector, return
+	 * without moving the vector to the ISR, as notification interrupts
+	 * trigger processing in L2, i.e. aren't delivered to L1.
+	 */
+	if (vector == nested_pi_nv)
+		return vector;
+
 	if (kvm_hv_synic_auto_eoi_set(vcpu, vector)) {
 		/*
 		 * For auto-EOI interrupts, there might be another pending
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 7ef8ae73e82d..c8ff3bd2ce2c 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -89,7 +89,7 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu);
 
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
-int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
+int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu, int nested_pi_nv);
 int kvm_apic_accept_events(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b042b70560f2..7e0a944088eb 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4294,7 +4294,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		if (nested_exit_intr_ack_set(vcpu)) {
 			int irq;
 
-			irq = kvm_cpu_get_interrupt(vcpu);
+			irq = kvm_cpu_get_interrupt(vcpu, -1);
 			if (WARN_ON_ONCE(irq < 0))
 				goto no_vmexit;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index af6c8cf6a37a..4c14ea000e89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10548,7 +10548,7 @@ static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu,
 		if (r < 0)
 			goto out;
 		if (r) {
-			int irq = kvm_cpu_get_interrupt(vcpu);
+			int irq = kvm_cpu_get_interrupt(vcpu, -1);
 
 			if (!WARN_ON_ONCE(irq == -1)) {
 				kvm_queue_interrupt(vcpu, irq, false);
-- 
2.45.2.1089.g2a221341d9-goog
[PATCH 4/6] KVM: nVMX: Track nested_vmx.posted_intr_nv as a signed int
Posted by Sean Christopherson 1 year, 1 month ago
Track nested_vmx.posted_intr_nv as a signed 32-bit integer instead of an
unsigned 16-bit integer so that it can be passed to kvm_cpu_get_interrupt()
without relying on sign-extension to do the right thing when the vector is
invalid, i.e. when it's -1.

No true functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..dc0921bc4569 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -208,7 +208,7 @@ struct nested_vmx {
 
 	struct pi_desc *pi_desc;
 	bool pi_pending;
-	u16 posted_intr_nv;
+	int posted_intr_nv;
 
 	struct hrtimer preemption_timer;
 	u64 preemption_timer_deadline;
-- 
2.45.2.1089.g2a221341d9-goog
[PATCH 5/6] KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-Enter
Posted by Sean Christopherson 1 year, 1 month ago
Explicitly set posted_intr_nv to -1 when emulating nested VM-Enter and
posted interrupts are disabled to make it clear that posted_intr_nv is
valid if and only if nested posted interrupts are enabled, and as a cheap
way to harden against KVM bugs.

Note, KVM initializes posted_intr_nv to -1 at vCPU creation and when
resets it to -1 when unloading vmcs12 and/or leaving nested mode, i.e.
this is not a bug fix (or at least, it's not intended to be a bug fix).

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7e0a944088eb..40cf4839ca47 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2317,10 +2317,12 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 
 	/* Posted interrupts setting is only taken from vmcs12.  */
 	vmx->nested.pi_pending = false;
-	if (nested_cpu_has_posted_intr(vmcs12))
+	if (nested_cpu_has_posted_intr(vmcs12)) {
 		vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
-	else
+	} else {
+		vmx->nested.posted_intr_nv = -1;
 		exec_control &= ~PIN_BASED_POSTED_INTR;
+	}
 	pin_controls_set(vmx, exec_control);
 
 	/*
-- 
2.45.2.1089.g2a221341d9-goog
[PATCH 6/6] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
Posted by Sean Christopherson 1 year, 1 month ago
When synthensizing a nested VM-Exit due to an external interrupt, pend a
nested posted interrupt if the external interrupt vector matches L2's PI
notification vector, i.e. if the interrupt is a PI notification for L2.
This fixes a bug where KVM will incorrectly inject VM-Exit instead of
processing nested posted interrupt when IPI virtualization is enabled.

Per the SDM, detection of the notification vector doesn't occur until the
interrupt is acknowledge and deliver to the CPU core.

  If the external-interrupt exiting VM-execution control is 1, any unmasked
  external interrupt causes a VM exit (see Section 26.2). If the "process
  posted interrupts" VM-execution control is also 1, this behavior is
  changed and the processor handles an external interrupt as follows:

    1. The local APIC is acknowledged; this provides the processor core
       with an interrupt vector, called here the physical vector.
    2. If the physical vector equals the posted-interrupt notification
       vector, the logical processor continues to the next step. Otherwise,
       a VM exit occurs as it would normally due to an external interrupt;
       the vector is saved in the VM-exit interruption-information field.

For the most part, KVM has avoided problems because a PI NV for L2 that
arrives will L2 is active will be processed by hardware, and KVM checks
for a pending notification vector during nested VM-Enter.  Thus, to hit
the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while
L2 is active.

Without IPI virtualization, the scenario is practically impossible to hit
as the ordering between vmx_deliver_posted_interrupt() and nested VM-Enter
effectively guarantees that either the sender will see the vCPU as being
in_guest_mode(), or the receiver will see the interrupt in its vIRR.

With IPI virtualization, the sending CPU effectively implements a rough
equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check.
If the target vCPU has a valid PID, the CPU will send a PI NV interrupt
based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs.

  PIR := 32 bytes at PID_ADDR;
  // under lock
  PIR[V] := 1;
  store PIR at PID_ADDR;
  // release lock

  NotifyInfo := 8 bytes at PID_ADDR + 32;
  // under lock
  IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN
    NotifyInfo.ON := 1;
    SendNotify := 1;
  ELSE
    SendNotify := 0;
  FI;
  store NotifyInfo at PID_ADDR + 32;
  // release lock

  IF SendNotify = 1; THEN
    send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV;
  FI;

As a result, the target vCPU ends up receiving an interrupt on KVM's
POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for
L2's nested PI NV.  The POSTED_INTR_VECTOR interrupt triggers a VM-Exit
from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a
KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(),
effectively bypassing all of KVM's "early" checks on nested PI NV.

Note, the Fixes tag is a bit of a lie, as the bug is technically a generic
nested posted interrupt issue.  However, as above, it's practically
impossible to hit the bug without IPI virtualization being enabled.

Cc: Chao Gao <chao.gao@intel.com>
Cc: Zeng Guang <guang.zeng@intel.com>
Cc: stable@vger.kernel.org
Fixes: d588bb9be1da ("KVM: VMX: enable IPI virtualization")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 40cf4839ca47..f1fe4d5a1ed8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4296,10 +4296,21 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		if (nested_exit_intr_ack_set(vcpu)) {
 			int irq;
 
-			irq = kvm_cpu_get_interrupt(vcpu, -1);
+			irq = kvm_cpu_get_interrupt(vcpu, vmx->nested.posted_intr_nv);
 			if (WARN_ON_ONCE(irq < 0))
 				goto no_vmexit;
 
+			/*
+			 * If the IRQ is L2's PI notification vector, process
+			 * posted interrupts instead of injecting VM-Exit, as
+			 * the detection/morphing architecturally occurs when
+			 * the IRQ is delivered to the CPU.  Note, enabling PI
+			 * requires ACK-on-exit.
+			 */
+			if (irq == vmx->nested.posted_intr_nv) {
+				vmx->nested.pi_pending = true;
+				goto no_vmexit;
+			}
 			exit_intr_info = INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq;
 		} else {
 			exit_intr_info = 0;
-- 
2.45.2.1089.g2a221341d9-goog
Re: [PATCH 6/6] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
Posted by Chao Gao 1 year, 1 month ago
On Fri, Jul 19, 2024 at 05:01:38PM -0700, Sean Christopherson wrote:
>When synthensizing a nested VM-Exit due to an external interrupt, pend a
>nested posted interrupt if the external interrupt vector matches L2's PI
>notification vector, i.e. if the interrupt is a PI notification for L2.
>This fixes a bug where KVM will incorrectly inject VM-Exit instead of
>processing nested posted interrupt when IPI virtualization is enabled.
>
>Per the SDM, detection of the notification vector doesn't occur until the
>interrupt is acknowledge and deliver to the CPU core.
>
>  If the external-interrupt exiting VM-execution control is 1, any unmasked
>  external interrupt causes a VM exit (see Section 26.2). If the "process
>  posted interrupts" VM-execution control is also 1, this behavior is
>  changed and the processor handles an external interrupt as follows:
>
>    1. The local APIC is acknowledged; this provides the processor core
>       with an interrupt vector, called here the physical vector.
>    2. If the physical vector equals the posted-interrupt notification
>       vector, the logical processor continues to the next step. Otherwise,
>       a VM exit occurs as it would normally due to an external interrupt;
>       the vector is saved in the VM-exit interruption-information field.
>
>For the most part, KVM has avoided problems because a PI NV for L2 that
>arrives will L2 is active will be processed by hardware, and KVM checks
>for a pending notification vector during nested VM-Enter.

With this series in place, I wonder if we can remove the check for a pending
notification vector during nested VM-Enter.

	/* Emulate processing of posted interrupts on VM-Enter. */
	if (nested_cpu_has_posted_intr(vmcs12) &&
	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
		vmx->nested.pi_pending = true;
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
	}

I believe the check is arguably incorrect because:

1. nested_vmx_run() may set pi_pending and clear the IRR bit of the notification
vector, but this doesn't guarantee that vmx_complete_nested_posted_interrupt()
will be called later in vmx_check_nested_events(). This could lead to partial
posted interrupt processing, where the IRR bit is cleared but PIR isn't copied
into VIRR. This might confuse L1 since, from L1's perspective, posted interrupt
processing should be atomic. Per the SDM, the logical processor performs
posted-interrupt processing "in an uninterruptible manner".

2. The check doesn't respect event priority. For example, if a higher-priority
event (preemption timer exit or NMI-window exit) causes an immediate nested
VM-exit, the notification vector should remain pending after the nested VM-exit.
Re: [PATCH 6/6] KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injection
Posted by Sean Christopherson 1 year, 1 month ago
On Tue, Jul 23, 2024, Chao Gao wrote:
> On Fri, Jul 19, 2024 at 05:01:38PM -0700, Sean Christopherson wrote:
> >When synthensizing a nested VM-Exit due to an external interrupt, pend a
> >nested posted interrupt if the external interrupt vector matches L2's PI
> >notification vector, i.e. if the interrupt is a PI notification for L2.
> >This fixes a bug where KVM will incorrectly inject VM-Exit instead of
> >processing nested posted interrupt when IPI virtualization is enabled.
> >
> >Per the SDM, detection of the notification vector doesn't occur until the
> >interrupt is acknowledge and deliver to the CPU core.
> >
> >  If the external-interrupt exiting VM-execution control is 1, any unmasked
> >  external interrupt causes a VM exit (see Section 26.2). If the "process
> >  posted interrupts" VM-execution control is also 1, this behavior is
> >  changed and the processor handles an external interrupt as follows:
> >
> >    1. The local APIC is acknowledged; this provides the processor core
> >       with an interrupt vector, called here the physical vector.
> >    2. If the physical vector equals the posted-interrupt notification
> >       vector, the logical processor continues to the next step. Otherwise,
> >       a VM exit occurs as it would normally due to an external interrupt;
> >       the vector is saved in the VM-exit interruption-information field.
> >
> >For the most part, KVM has avoided problems because a PI NV for L2 that
> >arrives will L2 is active will be processed by hardware, and KVM checks
> >for a pending notification vector during nested VM-Enter.
> 
> With this series in place, I wonder if we can remove the check for a pending
> notification vector during nested VM-Enter.
> 
> 	/* Emulate processing of posted interrupts on VM-Enter. */
> 	if (nested_cpu_has_posted_intr(vmcs12) &&
> 	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
> 		vmx->nested.pi_pending = true;
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
> 	}
> 
> I believe the check is arguably incorrect because:
> 
> 1. nested_vmx_run() may set pi_pending and clear the IRR bit of the notification
> vector, but this doesn't guarantee that vmx_complete_nested_posted_interrupt()
> will be called later in vmx_check_nested_events(). This could lead to partial
> posted interrupt processing, where the IRR bit is cleared but PIR isn't copied
> into VIRR. This might confuse L1 since, from L1's perspective, posted interrupt
> processing should be atomic. Per the SDM, the logical processor performs
> posted-interrupt processing "in an uninterruptible manner".

vmx_deliver_nested_posted_interrupt() is also broken in this regard.  I don't see
a sane way to handle that though, at least not without completely losing the value
of posted interrupts.  Ooh, maybe we could call vmx_complete_nested_posted_interrupt()
from nested_vmx_vmexit()?  That is a little scary, but probably worth trying?

> 2. The check doesn't respect event priority. For example, if a higher-priority
> event (preemption timer exit or NMI-window exit) causes an immediate nested
> VM-exit, the notification vector should remain pending after the nested VM-exit.

Ah, right, because block_nested_events would be true due to the pending nested
VM-Enter, which would ensure KVM enters L2 and trips NMI/IRQ window exiting.

The downside is that removing that code would regress performance for the more
common case (no NMI/IRQ window), as KVM would need to complete the nested
VM-Enter before consuming the IRQ, i.e. would need to do a VM-Enter and force a
VM-Exit.  But as you say, that's the architecturally correct behavior.