KVM: SVM: Set PFERR_GUEST_{PAGE,FINAL}_MASK for nested NPF and add selftest

[PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Kevin Cheng 2 weeks, 5 days ago

When KVM emulates an instruction for L2 and encounters a nested page
fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
injects an NPF to L1. However, the code incorrectly hardcodes
(1ULL << 32) for exit_info_1's upper bits when the original exit was
not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
occurred on a page table page, preventing L1 from correctly identifying
the cause of the fault.

Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
the fault occurs on the final GPA-to-HPA translation.

Widen error_code in struct x86_exception from u16 to u64 to accommodate
the PFERR_GUEST_* bits (bits 32 and 33).

Update nested_svm_inject_npf_exit() to use fault->error_code directly
instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
indicate a bug in the page fault handling code.

Signed-off-by: Kevin Cheng <chengkev@google.com>
---
 arch/x86/kvm/kvm_emulate.h     |  2 +-
 arch/x86/kvm/mmu/paging_tmpl.h | 22 ++++++++++------------
 arch/x86/kvm/svm/nested.c      | 11 +++++------
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index fb3dab4b5a53e..ff4f9b0a01ff7 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -22,7 +22,7 @@ enum x86_intercept_stage;
 struct x86_exception {
 	u8 vector;
 	bool error_code_valid;
-	u16 error_code;
+	u64 error_code;
 	bool nested_page_fault;
 	u64 address; /* cr2 or nested page fault gpa */
 	u8 async_page_fault;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 901cd2bd40b84..923179bfd5c74 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -379,18 +379,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(table_gfn),
 					     nested_access, &walker->fault);
 
-		/*
-		 * FIXME: This can happen if emulation (for of an INS/OUTS
-		 * instruction) triggers a nested page fault.  The exit
-		 * qualification / exit info field will incorrectly have
-		 * "guest page access" as the nested page fault's cause,
-		 * instead of "guest page structure access".  To fix this,
-		 * the x86_exception struct should be augmented with enough
-		 * information to fix the exit_qualification or exit_info_1
-		 * fields.
-		 */
-		if (unlikely(real_gpa == INVALID_GPA))
+		if (unlikely(real_gpa == INVALID_GPA)) {
+#if PTTYPE != PTTYPE_EPT
+			walker->fault.error_code |= PFERR_GUEST_PAGE_MASK;
+#endif
 			return 0;
+		}
 
 		slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(real_gpa));
 		if (!kvm_is_visible_memslot(slot))
@@ -446,8 +440,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 #endif
 
 	real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, &walker->fault);
-	if (real_gpa == INVALID_GPA)
+	if (real_gpa == INVALID_GPA) {
+#if PTTYPE != PTTYPE_EPT
+		walker->fault.error_code |= PFERR_GUEST_FINAL_MASK;
+#endif
 		return 0;
+	}
 
 	walker->gfn = real_gpa >> PAGE_SHIFT;
 
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd5..f8dfd5c333023 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
 	struct vmcb *vmcb = svm->vmcb;
 
 	if (vmcb->control.exit_code != SVM_EXIT_NPF) {
-		/*
-		 * TODO: track the cause of the nested page fault, and
-		 * correctly fill in the high bits of exit_info_1.
-		 */
-		vmcb->control.exit_code = SVM_EXIT_NPF;
-		vmcb->control.exit_info_1 = (1ULL << 32);
+		vmcb->control.exit_info_1 = fault->error_code;
 		vmcb->control.exit_info_2 = fault->address;
 	}
 
+	vmcb->control.exit_code = SVM_EXIT_NPF;
 	vmcb->control.exit_info_1 &= ~0xffffffffULL;
 	vmcb->control.exit_info_1 |= fault->error_code;
 
+	WARN_ON_ONCE(!(vmcb->control.exit_info_1 &
+		       (PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK)));
+
 	nested_svm_vmexit(svm);
 }
 
-- 
2.52.0.457.g6b5491de43-goog

Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Sean Christopherson 2 weeks, 4 days ago

On Wed, Jan 21, 2026, Kevin Cheng wrote:
> When KVM emulates an instruction for L2 and encounters a nested page
> fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> injects an NPF to L1. However, the code incorrectly hardcodes
> (1ULL << 32) for exit_info_1's upper bits when the original exit was
> not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> occurred on a page table page, preventing L1 from correctly identifying
> the cause of the fault.
> 
> Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> the fault occurs on the final GPA-to-HPA translation.
> 
> Widen error_code in struct x86_exception from u16 to u64 to accommodate
> the PFERR_GUEST_* bits (bits 32 and 33).

Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
error codes, let alone 64-bit error codes, so this seemingly innocuous change
needs to be accompanied by a lengthy changelog that effectively audits all usage
to "prove" this change is ok.

> Update nested_svm_inject_npf_exit() to use fault->error_code directly
> instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> indicate a bug in the page fault handling code.
> 
> Signed-off-by: Kevin Cheng <chengkev@google.com>
> ---
>  arch/x86/kvm/kvm_emulate.h     |  2 +-
>  arch/x86/kvm/mmu/paging_tmpl.h | 22 ++++++++++------------
>  arch/x86/kvm/svm/nested.c      | 11 +++++------
>  3 files changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index fb3dab4b5a53e..ff4f9b0a01ff7 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -22,7 +22,7 @@ enum x86_intercept_stage;
>  struct x86_exception {
>  	u8 vector;
>  	bool error_code_valid;
> -	u16 error_code;
> +	u64 error_code;
>  	bool nested_page_fault;
>  	u64 address; /* cr2 or nested page fault gpa */
>  	u8 async_page_fault;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 901cd2bd40b84..923179bfd5c74 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -379,18 +379,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(table_gfn),
>  					     nested_access, &walker->fault);
>  
> -		/*
> -		 * FIXME: This can happen if emulation (for of an INS/OUTS
> -		 * instruction) triggers a nested page fault.  The exit
> -		 * qualification / exit info field will incorrectly have
> -		 * "guest page access" as the nested page fault's cause,
> -		 * instead of "guest page structure access".  To fix this,
> -		 * the x86_exception struct should be augmented with enough
> -		 * information to fix the exit_qualification or exit_info_1
> -		 * fields.
> -		 */
> -		if (unlikely(real_gpa == INVALID_GPA))
> +		if (unlikely(real_gpa == INVALID_GPA)) {
> +#if PTTYPE != PTTYPE_EPT
> +			walker->fault.error_code |= PFERR_GUEST_PAGE_MASK;
> +#endif

Why exclude EPT?  EPT doesn't use the error code _verbatim_, but EPT shares the
concept/reporting of intermediate vs. final walks.

>  			return 0;
> +		}
>  
>  		slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(real_gpa));
>  		if (!kvm_is_visible_memslot(slot))
> @@ -446,8 +440,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  #endif
>  
>  	real_gpa = kvm_translate_gpa(vcpu, mmu, gfn_to_gpa(gfn), access, &walker->fault);
> -	if (real_gpa == INVALID_GPA)
> +	if (real_gpa == INVALID_GPA) {
> +#if PTTYPE != PTTYPE_EPT
> +		walker->fault.error_code |= PFERR_GUEST_FINAL_MASK;
> +#endif

Same thing here.

>  		return 0;
> +	}
>  
>  	walker->gfn = real_gpa >> PAGE_SHIFT;
>  
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd5..f8dfd5c333023 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
>  	struct vmcb *vmcb = svm->vmcb;
>  
>  	if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> -		/*
> -		 * TODO: track the cause of the nested page fault, and
> -		 * correctly fill in the high bits of exit_info_1.
> -		 */
> -		vmcb->control.exit_code = SVM_EXIT_NPF;
> -		vmcb->control.exit_info_1 = (1ULL << 32);
> +		vmcb->control.exit_info_1 = fault->error_code;
>  		vmcb->control.exit_info_2 = fault->address;
>  	}
>  
> +	vmcb->control.exit_code = SVM_EXIT_NPF;
>  	vmcb->control.exit_info_1 &= ~0xffffffffULL;
>  	vmcb->control.exit_info_1 |= fault->error_code;

So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
@fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen, but
nothing in the changelog explains why such a scenario is impossible, and nothing
in the code hardens KVM against such goofs.

> +	WARN_ON_ONCE(!(vmcb->control.exit_info_1 &
> +		       (PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK)));
> +
>  	nested_svm_vmexit(svm);
>  }
>  
> -- 
> 2.52.0.457.g6b5491de43-goog
>

Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Yosry Ahmed 2 weeks, 4 days ago

On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > When KVM emulates an instruction for L2 and encounters a nested page
> > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > injects an NPF to L1. However, the code incorrectly hardcodes
> > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > occurred on a page table page, preventing L1 from correctly identifying
> > the cause of the fault.
> > 
> > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > the fault occurs on the final GPA-to-HPA translation.
> > 
> > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > the PFERR_GUEST_* bits (bits 32 and 33).
> 
> Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> error codes, let alone 64-bit error codes, so this seemingly innocuous change
> needs to be accompanied by a lengthy changelog that effectively audits all usage
> to "prove" this change is ok.

Semi-jokingly, we can add error_code_hi to track the high bits and
side-step the problem for Intel (dejavu?).

> 
> > Update nested_svm_inject_npf_exit() to use fault->error_code directly
> > instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> > PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> > indicate a bug in the page fault handling code.
> > 
> > Signed-off-by: Kevin Cheng <chengkev@google.com>
[..]
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index de90b104a0dd5..f8dfd5c333023 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> >  	struct vmcb *vmcb = svm->vmcb;
> >  
> >  	if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > -		/*
> > -		 * TODO: track the cause of the nested page fault, and
> > -		 * correctly fill in the high bits of exit_info_1.
> > -		 */
> > -		vmcb->control.exit_code = SVM_EXIT_NPF;
> > -		vmcb->control.exit_info_1 = (1ULL << 32);
> > +		vmcb->control.exit_info_1 = fault->error_code;
> >  		vmcb->control.exit_info_2 = fault->address;
> >  	}
> >  
> > +	vmcb->control.exit_code = SVM_EXIT_NPF;
> >  	vmcb->control.exit_info_1 &= ~0xffffffffULL;
> >  	vmcb->control.exit_info_1 |= fault->error_code;
> 
> So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
> @fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen,
> but nothing in the changelog explains why such a scenario is
> impossible, and nothing in the code hardens KVM against such goofs.

I guess we can update the WARN below to check for that as well, and
fallback to the current behavior (set PFERR_GUEST_FINAL_MASK):

	fault_stage = vmcb->control.exit_info_1 &
			(PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK);
	if (WARN_ON_ONCE(fault_stage != PFERR_GUEST_FINAL_MASK &&
			 fault_stage != PFERR_GUEST_PAGE_MASK))
		vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;

> 
> > +	WARN_ON_ONCE(!(vmcb->control.exit_info_1 &
> > +		       (PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK)));
> > +
> >  	nested_svm_vmexit(svm);
> >  }
> >  
> > -- 
> > 2.52.0.457.g6b5491de43-goog
> >

Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Sean Christopherson 1 week, 4 days ago

On Thu, Jan 22, 2026, Yosry Ahmed wrote:
> On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > > When KVM emulates an instruction for L2 and encounters a nested page
> > > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > > injects an NPF to L1. However, the code incorrectly hardcodes
> > > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > > occurred on a page table page, preventing L1 from correctly identifying
> > > the cause of the fault.
> > > 
> > > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > > the fault occurs on the final GPA-to-HPA translation.
> > > 
> > > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > > the PFERR_GUEST_* bits (bits 32 and 33).
> > 
> > Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> > error codes, let alone 64-bit error codes, so this seemingly innocuous change
> > needs to be accompanied by a lengthy changelog that effectively audits all usage
> > to "prove" this change is ok.
> 
> Semi-jokingly, we can add error_code_hi to track the high bits and
> side-step the problem for Intel (dejavu?).

Technically, it would require three fields: u16 error_code, u16 error_code_hi,
and u32 error_code_ultra_hi.  :-D

Isolating the (ultra) hi flags is very tempting, but I worry that it would lead
to long term pain, e.g. because inevitably we'll forget to grab the hi flags at
some point.  I'd rather audit the current code and ensure that KVM truncates the
error code as needed.

VMX is probably a-ok, e.g. see commit eba9799b5a6e ("KVM: VMX: Drop bits 31:16
when shoving exception error code into VMCS").  I'd be more worred SVM, where
it's legal to shove a 32-bit value into the error code, i.e. where KVM might not
have existing explicit truncation.

> > > Update nested_svm_inject_npf_exit() to use fault->error_code directly
> > > instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> > > PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> > > indicate a bug in the page fault handling code.
> > > 
> > > Signed-off-by: Kevin Cheng <chengkev@google.com>
> [..]
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index de90b104a0dd5..f8dfd5c333023 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> > >  	struct vmcb *vmcb = svm->vmcb;
> > >  
> > >  	if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > > -		/*
> > > -		 * TODO: track the cause of the nested page fault, and
> > > -		 * correctly fill in the high bits of exit_info_1.
> > > -		 */
> > > -		vmcb->control.exit_code = SVM_EXIT_NPF;
> > > -		vmcb->control.exit_info_1 = (1ULL << 32);
> > > +		vmcb->control.exit_info_1 = fault->error_code;
> > >  		vmcb->control.exit_info_2 = fault->address;
> > >  	}
> > >  
> > > +	vmcb->control.exit_code = SVM_EXIT_NPF;
> > >  	vmcb->control.exit_info_1 &= ~0xffffffffULL;
> > >  	vmcb->control.exit_info_1 |= fault->error_code;
> > 
> > So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
> > @fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen,
> > but nothing in the changelog explains why such a scenario is
> > impossible, and nothing in the code hardens KVM against such goofs.
> 
> I guess we can update the WARN below to check for that as well, and
> fallback to the current behavior (set PFERR_GUEST_FINAL_MASK):
> 
> 	fault_stage = vmcb->control.exit_info_1 &
> 			(PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK);
> 	if (WARN_ON_ONCE(fault_stage != PFERR_GUEST_FINAL_MASK &&
> 			 fault_stage != PFERR_GUEST_PAGE_MASK))
> 		vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;

Except that doesn't do the right thing if both bits are set.  And we can use
hweight64(), which is a single POPCNT on modern CPUs.

Might be easiest to add something like PFERR_GUEST_FAULT_STAGE_MASK, then do:

	/*
	 * All nested page faults should be annotated as occuring on the final
	 * translation *OR* the page walk.  Arbitrarily choose "final" if KVM
	 * is buggy and enumerated both or none.
	 */
	if (WARN_ON_ONCE(hweight64(vmcb->control.exit_info_1 &
				   PFERR_GUEST_FAULT_STAGE_MASK) != 1)) {
		vmcb->control.exit_info_1 &= ~PFERR_GUEST_FAULT_STAGE_MASK;	
		vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
	}

Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Kevin Cheng 4 days, 21 hours ago

On Wed, Jan 28, 2026 at 10:48 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 22, 2026, Yosry Ahmed wrote:
> > On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > > > When KVM emulates an instruction for L2 and encounters a nested page
> > > > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > > > injects an NPF to L1. However, the code incorrectly hardcodes
> > > > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > > > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > > > occurred on a page table page, preventing L1 from correctly identifying
> > > > the cause of the fault.
> > > >
> > > > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > > > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > > > the fault occurs on the final GPA-to-HPA translation.
> > > >
> > > > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > > > the PFERR_GUEST_* bits (bits 32 and 33).
> > >
> > > Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> > > error codes, let alone 64-bit error codes, so this seemingly innocuous change
> > > needs to be accompanied by a lengthy changelog that effectively audits all usage
> > > to "prove" this change is ok.
> >
> > Semi-jokingly, we can add error_code_hi to track the high bits and
> > side-step the problem for Intel (dejavu?).
>
> Technically, it would require three fields: u16 error_code, u16 error_code_hi,
> and u32 error_code_ultra_hi.  :-D
>
> Isolating the (ultra) hi flags is very tempting, but I worry that it would lead
> to long term pain, e.g. because inevitably we'll forget to grab the hi flags at
> some point.  I'd rather audit the current code and ensure that KVM truncates the
> error code as needed.
>
> VMX is probably a-ok, e.g. see commit eba9799b5a6e ("KVM: VMX: Drop bits 31:16
> when shoving exception error code into VMCS").  I'd be more worred SVM, where
> it's legal to shove a 32-bit value into the error code, i.e. where KVM might not
> have existing explicit truncation.

As I understand it, intel CPUs don't allow for setting bits 31:16 of
the error code, but AMD CPUs allow bits 31:16 to be set. The
86_exception error_code field is u16 currently so it is always
truncated to u16 by default. In that case, after widening the error
code to 64 bits, do I have to ensure that any usage of the error that
isn't for NPF, has to truncate it to 16 bits? Or do I just need to
verify that all SVM usages of the error_code for exceptions truncate
the 64 bits down to 32 bits and all VMX usages truncate to 16 bits?

Just wanted to clarify because I think the wording of that statement
is confusing me into thinking that maybe there is something wrong with
32 bit error codes for SVM?

If the only usage of the widened field is NPF, wouldn't it be better
to go with an additional field like Yosry suggested (I see that VMX
has the added exit_qualification field in the struct)?

>
> > > > Update nested_svm_inject_npf_exit() to use fault->error_code directly
> > > > instead of hardcoding the upper bits. Also add a WARN_ON_ONCE if neither
> > > > PFERR_GUEST_FINAL_MASK nor PFERR_GUEST_PAGE_MASK is set, as this would
> > > > indicate a bug in the page fault handling code.
> > > >
> > > > Signed-off-by: Kevin Cheng <chengkev@google.com>
> > [..]
> > > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > > index de90b104a0dd5..f8dfd5c333023 100644
> > > > --- a/arch/x86/kvm/svm/nested.c
> > > > +++ b/arch/x86/kvm/svm/nested.c
> > > > @@ -40,18 +40,17 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
> > > >   struct vmcb *vmcb = svm->vmcb;
> > > >
> > > >   if (vmcb->control.exit_code != SVM_EXIT_NPF) {
> > > > -         /*
> > > > -          * TODO: track the cause of the nested page fault, and
> > > > -          * correctly fill in the high bits of exit_info_1.
> > > > -          */
> > > > -         vmcb->control.exit_code = SVM_EXIT_NPF;
> > > > -         vmcb->control.exit_info_1 = (1ULL << 32);
> > > > +         vmcb->control.exit_info_1 = fault->error_code;
> > > >           vmcb->control.exit_info_2 = fault->address;
> > > >   }
> > > >
> > > > + vmcb->control.exit_code = SVM_EXIT_NPF;
> > > >   vmcb->control.exit_info_1 &= ~0xffffffffULL;
> > > >   vmcb->control.exit_info_1 |= fault->error_code;
> > >
> > > So... what happens when exit_info_1 already has PFERR_GUEST_PAGE_MASK, and then
> > > @fault sets PFERR_GUEST_FINAL_MASK?  Presumably that can't/shouldn't happen,
> > > but nothing in the changelog explains why such a scenario is
> > > impossible, and nothing in the code hardens KVM against such goofs.
> >
> > I guess we can update the WARN below to check for that as well, and
> > fallback to the current behavior (set PFERR_GUEST_FINAL_MASK):
> >
> >       fault_stage = vmcb->control.exit_info_1 &
> >                       (PFERR_GUEST_FINAL_MASK | PFERR_GUEST_PAGE_MASK);
> >       if (WARN_ON_ONCE(fault_stage != PFERR_GUEST_FINAL_MASK &&
> >                        fault_stage != PFERR_GUEST_PAGE_MASK))
> >               vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
>
> Except that doesn't do the right thing if both bits are set.  And we can use
> hweight64(), which is a single POPCNT on modern CPUs.
>
> Might be easiest to add something like PFERR_GUEST_FAULT_STAGE_MASK, then do:
>
>         /*
>          * All nested page faults should be annotated as occuring on the final
>          * translation *OR* the page walk.  Arbitrarily choose "final" if KVM
>          * is buggy and enumerated both or none.
>          */
>         if (WARN_ON_ONCE(hweight64(vmcb->control.exit_info_1 &
>                                    PFERR_GUEST_FAULT_STAGE_MASK) != 1)) {
>                 vmcb->control.exit_info_1 &= ~PFERR_GUEST_FAULT_STAGE_MASK;
>                 vmcb->control.exit_info_1 |= PFERR_GUEST_FINAL_MASK;
>         }

Re: [PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK

Posted by Sean Christopherson 3 days, 13 hours ago

On Wed, Feb 04, 2026, Kevin Cheng wrote:
> On Wed, Jan 28, 2026 at 10:48 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 22, 2026, Yosry Ahmed wrote:
> > > On Wed, Jan 21, 2026 at 02:07:56PM -0800, Sean Christopherson wrote:
> > > > On Wed, Jan 21, 2026, Kevin Cheng wrote:
> > > > > When KVM emulates an instruction for L2 and encounters a nested page
> > > > > fault (e.g., during string I/O emulation), nested_svm_inject_npf_exit()
> > > > > injects an NPF to L1. However, the code incorrectly hardcodes
> > > > > (1ULL << 32) for exit_info_1's upper bits when the original exit was
> > > > > not an NPF. This always sets PFERR_GUEST_FINAL_MASK even when the fault
> > > > > occurred on a page table page, preventing L1 from correctly identifying
> > > > > the cause of the fault.
> > > > >
> > > > > Set PFERR_GUEST_PAGE_MASK in the error code when a nested page fault
> > > > > occurs during a guest page table walk, and PFERR_GUEST_FINAL_MASK when
> > > > > the fault occurs on the final GPA-to-HPA translation.
> > > > >
> > > > > Widen error_code in struct x86_exception from u16 to u64 to accommodate
> > > > > the PFERR_GUEST_* bits (bits 32 and 33).
> > > >
> > > > Please do this in a separate patch.  Intel CPUs straight up don't support 32-bit
> > > > error codes, let alone 64-bit error codes, so this seemingly innocuous change
> > > > needs to be accompanied by a lengthy changelog that effectively audits all usage
> > > > to "prove" this change is ok.
> > >
> > > Semi-jokingly, we can add error_code_hi to track the high bits and
> > > side-step the problem for Intel (dejavu?).
> >
> > Technically, it would require three fields: u16 error_code, u16 error_code_hi,
> > and u32 error_code_ultra_hi.  :-D
> >
> > Isolating the (ultra) hi flags is very tempting, but I worry that it would lead
> > to long term pain, e.g. because inevitably we'll forget to grab the hi flags at
> > some point.  I'd rather audit the current code and ensure that KVM truncates the
> > error code as needed.
> >
> > VMX is probably a-ok, e.g. see commit eba9799b5a6e ("KVM: VMX: Drop bits 31:16
> > when shoving exception error code into VMCS").  I'd be more worred SVM, where
> > it's legal to shove a 32-bit value into the error code, i.e. where KVM might not
> > have existing explicit truncation.
> 
> As I understand it, intel CPUs don't allow for setting bits 31:16 of
> the error code, but AMD CPUs allow bits 31:16 to be set.

Yep.

> The 86_exception error_code field is u16 currently so it is always truncated
> to u16 by default. In that case, after widening the error code to 64 bits, do
> I have to ensure that any usage of the error that isn't for NPF, has to
> truncate it to 16 bits?
>
> Or do I just need to verify that all SVM usages of the error_code for
> exceptions truncate the 64 bits down to 32 bits and all VMX usages truncate
> to 16 bits?

Hmm, good question.

I was going to say "the second one", but that's actually meaningless because
(a) "struct kvm_queued_exception" stores the error code as a u32, which it should,
and (b) event_inj_err is also a u32, i.e. it's impossible to shove a 64-bit error
code into hardware on SVM.

And thinking through this more, if there is _existing_ code that tries to set
bits > 15 in the error_code, then UBSAN would likely have detected the issue,
e.g. due to trying to OR in a "bad" value.

Aha!  A serendipitous quirk in this patch is that it does NOT change the local
error_code in FNAME(walk_addr_generic) from a u16 to a u64.

We should double down on that with a comment.  That'd give me enough confidence
that we aren't likely to break legacy shadow paging now or in the future.  E.g.
in a patch to change x86_exception.error_code to a u64, also do:

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 901cd2bd40b8..f1790aa9e391 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -317,6 +317,12 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
        const int write_fault = access & PFERR_WRITE_MASK;
        const int user_fault  = access & PFERR_USER_MASK;
        const int fetch_fault = access & PFERR_FETCH_MASK;
+       /*
+        * Note!  Track the error_code that's common to legacy shadow paging
+        * and NPT shadow paging as a u16 to guard against unintentionally
+        * setting any of bits 63:16.  Architecturally, the #PF error code is
+        * 32 bits, and Intel CPUs don't support settings bits 31:16.
+        */
        u16 errcode = 0;
        gpa_t real_gpa;
        gfn_t gfn;

> Just wanted to clarify because I think the wording of that statement
> is confusing me into thinking that maybe there is something wrong with
> 32 bit error codes for SVM?

It's more that I am confident that either KVM already truncates the error code
on VMX, or that we'll notice *really* quickly, because failure to truncate an
error code will generate a VM-Fail.

On SVM, we could royally screw up an error code and it's entirely possible we
wouldn't notice until some random guest breaks in some weird way.

> If the only usage of the widened field is NPF, wouldn't it be better
> to go with an additional field like Yosry suggested (I see that VMX
> has the added exit_qualification field in the struct)?

No?  paging_tmpl.h is used to shadow all flavors of nested NPT as well as all
flavors of legacy paging.  And more importantly, unlike EPT, nested NPT isn't
otherwise special cased.  As shown by this patch, it _is_ possible to identify
nested NPT in select flows, and we could certainly do so more generically by
checking if the MMU is nested, but I'd prefer not to special case any particular
type of shadow paging without a strong reason to do so.

And more importantly, if we have two (or three) error code fields, then we need
to remember to pull data from all error code fields.  I.e. in an effort to avoid
introducing bugs, we would actually make it easier to introduce _other_ bugs.

[PATCH 1/3] KVM: SVM: Fix nested NPF injection to set PFERR_GUEST_{PAGE,FINAL}_MASK
[PATCH 2/3] KVM: selftests: Add TDP unmap helpers
[PATCH 3/3] KVM: selftests: Add nested NPF injection test for SVM