[PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation

Sean Christopherson posted 62 patches 4 months ago
[PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Posted by Sean Christopherson 4 months ago
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).

Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  9 ++++++++-
 arch/x86/kvm/svm/avic.c         | 14 ++++++++++++--
 arch/x86/kvm/svm/svm.h          |  3 ++-
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2a6ef1398da7..a9b709db7c59 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1314,6 +1314,12 @@ enum kvm_apicv_inhibit {
 	 */
 	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
 
+	/*
+	 * AVIC is disabled because the vCPU's APIC ID is beyond the max
+	 * supported by AVIC/x2AVIC, i.e. the vCPU is unaddressable.
+	 */
+	APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG,
+
 	NR_APICV_INHIBIT_REASONS,
 };
 
@@ -1332,7 +1338,8 @@ enum kvm_apicv_inhibit {
 	__APICV_INHIBIT_REASON(IRQWIN),			\
 	__APICV_INHIBIT_REASON(PIT_REINJ),		\
 	__APICV_INHIBIT_REASON(SEV),			\
-	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
+	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
+	__APICV_INHIBIT_REASON(PHYSICAL_ID_TOO_BIG)
 
 struct kvm_arch {
 	unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ab228872a19b..f0a74b102c57 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	int id = vcpu->vcpu_id;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	/*
+	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
+	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
+	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
+	 * avic_vcpu_load() expects to be called if and only if the vCPU has
+	 * fully initialized AVIC.
+	 */
 	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
-	    (id > X2AVIC_MAX_PHYSICAL_ID))
-		return -EINVAL;
+	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
+		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
+		vcpu->arch.apic->apicv_active = false;
+		return 0;
+	}
 
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1585288200f4..71e3c003580e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,7 +722,8 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
 	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |	\
-	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED)	\
+	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED) |	\
+	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG)	\
 )
 
 bool avic_hardware_setup(void);
-- 
2.50.0.rc1.591.g9c95f17f64-goog
Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Posted by Naveen N Rao 3 months, 3 weeks ago
On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
> an ID that is too big, but otherwise allow vCPU creation to succeed.
> Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
> that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
> than 254 (AVIC) or 511 (x2AVIC).
> 
> Alternatively, KVM could advertise an accurate value depending on which
> AVIC mode is in use, but that wouldn't really solve the underlying problem,
> e.g. would be a breaking change if KVM were to ever try and enable AVIC or
> x2AVIC by default.
> 
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  9 ++++++++-
>  arch/x86/kvm/svm/avic.c         | 14 ++++++++++++--
>  arch/x86/kvm/svm/svm.h          |  3 ++-
>  3 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2a6ef1398da7..a9b709db7c59 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1314,6 +1314,12 @@ enum kvm_apicv_inhibit {
>  	 */
>  	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>  
> +	/*
> +	 * AVIC is disabled because the vCPU's APIC ID is beyond the max
> +	 * supported by AVIC/x2AVIC, i.e. the vCPU is unaddressable.
> +	 */
> +	APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG,
> +
>  	NR_APICV_INHIBIT_REASONS,
>  };
>  
> @@ -1332,7 +1338,8 @@ enum kvm_apicv_inhibit {
>  	__APICV_INHIBIT_REASON(IRQWIN),			\
>  	__APICV_INHIBIT_REASON(PIT_REINJ),		\
>  	__APICV_INHIBIT_REASON(SEV),			\
> -	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> +	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
> +	__APICV_INHIBIT_REASON(PHYSICAL_ID_TOO_BIG)
>  
>  struct kvm_arch {
>  	unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index ab228872a19b..f0a74b102c57 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  	int id = vcpu->vcpu_id;
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> +	/*
> +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> +	 * fully initialized AVIC.
> +	 */
>  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> -		return -EINVAL;
> +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> +		vcpu->arch.apic->apicv_active = false;

This bothers me a bit. kvm_create_lapic() does this:
          if (enable_apicv) {
                  apic->apicv_active = true;
                  kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
	  }

But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
is going to be a no-op.

This does not look to be a big deal given that kvm_create_lapic() itself 
is called just a bit before svm_vcpu_create() (which calls the above 
function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
isn't that much done before apicv_active is toggled.

But, this made me wonder if introducing a kvm_x86_op to check and 
enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
overall. Maybe even have it be the initialization point for APICv: 
apicv_init(), so we can invoke avic_init_vcpu() right away?


- Naveen
Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Posted by Sean Christopherson 3 months, 3 weeks ago
On Tue, Jun 17, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index ab228872a19b..f0a74b102c57 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> >  	int id = vcpu->vcpu_id;
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  
> > +	/*
> > +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> > +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> > +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> > +	 * fully initialized AVIC.
> > +	 */
> >  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> > -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> > -		return -EINVAL;
> > +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> > +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> > +		vcpu->arch.apic->apicv_active = false;
> 
> This bothers me a bit. kvm_create_lapic() does this:
>           if (enable_apicv) {
>                   apic->apicv_active = true;
>                   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> 	  }
> 
> But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
> is going to be a no-op.

That's fine, KVM_REQ_APICV_UPDATE is a nop in other scenarios, too.  I agree it's
not ideal, but this is a rather extreme edge case and a one-time slow path, so I
don't think it's worth doing anything special just to avoid KVM_REQ_APICV_UPDATE.

> This does not look to be a big deal given that kvm_create_lapic() itself 
> is called just a bit before svm_vcpu_create() (which calls the above 
> function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
> isn't that much done before apicv_active is toggled.
> 
> But, this made me wonder if introducing a kvm_x86_op to check and 
> enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
> overall

Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c is all
kinds of gross, but clearing apic->apicv_active in lapic.c to handle this particular
scenario is just as problematic, because then avic_init_backing_page() would need
to check kvm_vcpu_apicv_active() to determine whether or not to allocate the backing
page.  In a way, that's even worse, because setting apic->apicv_active by default
is purely an optimization, i.e. leaving it %false _should_ work as well, it would
just be suboptimal.  But if AVIC were to key off apic->apicv_active, that could
lead to KVM incorrectly skipping allocation of the AVIC backing page.

> Maybe even have it be the initialization point for APICv: 
> apicv_init(), so we can invoke avic_init_vcpu() right away?

I mostly like this idea (actually, I take that back; see below), but VMX throws
a big wrench in things.  Unlike SVM, VMX doesn't have a singular "enable APICv"
flag.  Rather, KVM considers "APICv" to be the combination of APIC-register
virtualization, virtual interrupt delivery, and posted interrupts.

Which is totally fine.  The big wrench is that the are other features that interact
with "APICv" and require allocations.  E.g.  the APIC access page isn't actually
tied to enable_apicv, it's tied to yet another VMX feature, "virtualize APIC
accesses" (not to be confused with APIC-register virtualization; don't blame me,
I didn't create the control knobs/names).

As a result, KVM may need to allocate the APIC access page (not to be confused
with the APIC *backing* page; again, don't blame me :-D) when enable_apicv=false,
and even more confusingly, NOT allocate the APIC access page when enable_apicv=true.

	if (cpu_need_virtualize_apic_accesses(vcpu)) {  <=== not an enable_apic check, *sigh*
		err = kvm_alloc_apic_access_page(vcpu->kvm);
		if (err)
			goto free_vmcs;
	}

And for both SVM and VMX, IPI virtualization adds another wrinkle, in that the
per-vCPU setup needs to fill an entry in a per-VM table.  And filling that entry
needs to happen at the very end of vCPU creation, e.g. so that the vCPU can't be
targeted until KVM is ready to run the vCPU.

Ouch.  And I'm pretty sure there's a use-after-free in the AVIC code.  If
svm_vcpu_alloc_msrpm() fails, the avic_physical_id_table[] will still have a pointer
to freed vAPIC page.

Thus, invoking avic_init_vcpu() "right away" is unfortunately flat out unsafe
(took me a while to realize that).

So while I 100% agree with your dislike of this patch, I think it's the least
awful band-aid, at least in the short term.

Longer term, I'd definitely be in favor of cleaning up the related flows, but I
think that's best done on top of this series, because I suspect it'll be somewhat
invasive.
Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Posted by Naveen N Rao 3 months, 3 weeks ago
On Tue, Jun 17, 2025 at 09:10:10AM -0700, Sean Christopherson wrote:
> On Tue, Jun 17, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > index ab228872a19b..f0a74b102c57 100644
> > > --- a/arch/x86/kvm/svm/avic.c
> > > +++ b/arch/x86/kvm/svm/avic.c
> > > @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> > >  	int id = vcpu->vcpu_id;
> > >  	struct vcpu_svm *svm = to_svm(vcpu);
> > >  
> > > +	/*
> > > +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > > +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> > > +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> > > +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> > > +	 * fully initialized AVIC.
> > > +	 */
> > >  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> > > -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> > > -		return -EINVAL;
> > > +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> > > +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> > > +		vcpu->arch.apic->apicv_active = false;
> > 
> > This bothers me a bit. kvm_create_lapic() does this:
> >           if (enable_apicv) {
> >                   apic->apicv_active = true;
> >                   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> > 	  }
> > 
> > But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
> > is going to be a no-op.
> 
> That's fine, KVM_REQ_APICV_UPDATE is a nop in other scenarios, too.  I agree it's
> not ideal, but this is a rather extreme edge case and a one-time slow path, so I
> don't think it's worth doing anything special just to avoid KVM_REQ_APICV_UPDATE.
> 
> > This does not look to be a big deal given that kvm_create_lapic() 
> > is called just a bit before svm_vcpu_create() (which calls the above 
> > function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
> > isn't that much done before apicv_active is toggled.
> > 
> > But, this made me wonder if introducing a kvm_x86_op to check and 
> > enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
> > overall
> 
> Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c is all
> kinds of gross, but clearing apic->apicv_active in lapic.c to handle this particular
> scenario is just as problematic, because then avic_init_backing_page() would need
> to check kvm_vcpu_apicv_active() to determine whether or not to allocate the backing
> page.  In a way, that's even worse, because setting apic->apicv_active by default
> is purely an optimization, i.e. leaving it %false _should_ work as well, it would
> just be suboptimal.  But if AVIC were to key off apic->apicv_active, 
> that could
> lead to KVM incorrectly skipping allocation of the AVIC backing page.

I understand your concern about key'ing off apic->apicv_active - that 
would definitely require a thorough audit and does add complexity to 
this.

However, as far as I can see, after your current series, we no longer 
maintain a pointer to the AVIC backing page, but instead rely on the 
lapic-allocated page.

Were you referring to the APIC access page though? That is behind 
kvm_apicv_activated() today, which looks to be problematic if there are 
inhibits set during vcpu_create() and if those can be unset later?  
Shouldn't we be allocating the apic access page unconditionally here?

> 
> > Maybe even have it be the initialization point for APICv: 
> > apicv_init(), so we can invoke avic_init_vcpu() right away?
> 
> I mostly like this idea (actually, I take that back; see below), but VMX throws
> a big wrench in things.  Unlike SVM, VMX doesn't have a singular "enable APICv"
> flag.  Rather, KVM considers "APICv" to be the combination of APIC-register
> virtualization, virtual interrupt delivery, and posted interrupts.
> 
> Which is totally fine.  The big wrench is that the are other features that interact
> with "APICv" and require allocations.  E.g.  the APIC access page isn't actually
> tied to enable_apicv, it's tied to yet another VMX feature, "virtualize APIC
> accesses" (not to be confused with APIC-register virtualization; don't blame me,
> I didn't create the control knobs/names).
> 
> As a result, KVM may need to allocate the APIC access page (not to be confused
> with the APIC *backing* page; again, don't blame me :-D) when enable_apicv=false,
> and even more confusingly, NOT allocate the APIC access page when enable_apicv=true.
> 
> 	if (cpu_need_virtualize_apic_accesses(vcpu)) {  <=== not an enable_apic check, *sigh*
> 		err = kvm_alloc_apic_access_page(vcpu->kvm);
> 		if (err)
> 			goto free_vmcs;
> 	}

Thanks for the background and the details there -- I am going to need 
some time to go unpack all of that :)

> 
> And for both SVM and VMX, IPI virtualization adds another wrinkle, in that the
> per-vCPU setup needs to fill an entry in a per-VM table.  And filling that entry
> needs to happen at the very end of vCPU creation, e.g. so that the vCPU can't be
> targeted until KVM is ready to run the vCPU.
> 
> Ouch.  And I'm pretty sure there's a use-after-free in the AVIC code.  If
> svm_vcpu_alloc_msrpm() fails, the avic_physical_id_table[] will still have a pointer
> to freed vAPIC page.

Oops.

> 
> Thus, invoking avic_init_vcpu() "right away" is unfortunately flat out unsafe
> (took me a while to realize that).
> 
> So while I 100% agree with your dislike of this patch, I think it's the least
> awful band-aid, at least in the short term.
> 
> Longer term, I'd definitely be in favor of cleaning up the related flows, but I
> think that's best done on top of this series, because I suspect it'll be somewhat
> invasive.

Sure, that makes sense. For this patch:
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>


Thanks,
Naveen
Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
Posted by Sean Christopherson 3 months, 3 weeks ago
On Wed, Jun 18, 2025, Naveen N Rao wrote:
> On Tue, Jun 17, 2025 at 09:10:10AM -0700, Sean Christopherson wrote:
> > Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c
> > is all kinds of gross, but clearing apic->apicv_active in lapic.c to handle
> > this particular scenario is just as problematic, because then
> > avic_init_backing_page() would need to check kvm_vcpu_apicv_active() to
> > determine whether or not to allocate the backing page.  In a way, that's
> > even worse, because setting apic->apicv_active by default is purely an
> > optimization, i.e. leaving it %false _should_ work as well, it would just
> > be suboptimal.  But if AVIC were to key off apic->apicv_active, that could
> > lead to KVM incorrectly skipping allocation of the AVIC backing page.
> 
> I understand your concern about key'ing off apic->apicv_active - that 
> would definitely require a thorough audit and does add complexity to 
> this.
> 
> However, as far as I can see, after your current series, we no longer 
> maintain a pointer to the AVIC backing page, but instead rely on the 
> lapic-allocated page.
> 
> Were you referring to the APIC access page though? 

Gah, yes.  I was hyper aware of the two things when typing up the response, and
still managed to screw up.  *sigh* :-)

> That is behind kvm_apicv_activated() today, which looks to be problematic if
> there are inhibits set during vcpu_create() and if those can be unset later?
> Shouldn't we be allocating the apic access page unconditionally here?

In theory, yes.  In practice, this guards against an unnecessary allocation for
SEV+ guests (see APICV_INHIBIT_REASON_SEV).

That said, I completely agree that checking kvm_apicv_activated() is weird and
sketchy.  Hopefully that can be cleaned up, too (but after this series).