[PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults

Sean Christopherson posted 16 patches 9 months ago
There is a newer version of this series
[PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Sean Christopherson 9 months ago
WARN and skip the emulated MMIO fastpath if a private, reserved page fault
is encountered, as private+reserved should be an impossible combination
(KVM should never create an MMIO SPTE for a private access).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd342ebd0809..9206cfa58feb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		error_code |= PFERR_PRIVATE_ACCESS;
 
 	r = RET_PF_INVALID;
-	if (unlikely(error_code & PFERR_RSVD_MASK)) {
+	if (unlikely((error_code & PFERR_RSVD_MASK) &&
+		     !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
 		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
 		if (r == RET_PF_EMULATE)
 			goto emulate;
-- 
2.44.0.278.ge034bb2e1d-goog
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Huang, Kai 8 months, 4 weeks ago

On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> WARN and skip the emulated MMIO fastpath if a private, reserved page fault
> is encountered, as private+reserved should be an impossible combination
> (KVM should never create an MMIO SPTE for a private access).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd342ebd0809..9206cfa58feb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>   		error_code |= PFERR_PRIVATE_ACCESS;
>   
>   	r = RET_PF_INVALID;
> -	if (unlikely(error_code & PFERR_RSVD_MASK)) {
> +	if (unlikely((error_code & PFERR_RSVD_MASK) &&
> +		     !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
>   		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
>   		if (r == RET_PF_EMULATE)
>   			goto emulate;

It seems this will make KVM continue to call kvm_mmu_do_page_fault() 
when such private+reserve error code actually happens (e.g., due to 
bug), because @r is still RET_PF_INVALID in such case.

Is it better to just return error, e.g., -EINVAL, and give up?
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Sean Christopherson 8 months, 4 weeks ago
On Fri, Mar 01, 2024, Kai Huang wrote:
> 
> 
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > WARN and skip the emulated MMIO fastpath if a private, reserved page fault
> > is encountered, as private+reserved should be an impossible combination
> > (KVM should never create an MMIO SPTE for a private access).
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index bd342ebd0809..9206cfa58feb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> >   		error_code |= PFERR_PRIVATE_ACCESS;
> >   	r = RET_PF_INVALID;
> > -	if (unlikely(error_code & PFERR_RSVD_MASK)) {
> > +	if (unlikely((error_code & PFERR_RSVD_MASK) &&
> > +		     !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> >   		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
> >   		if (r == RET_PF_EMULATE)
> >   			goto emulate;
> 
> It seems this will make KVM continue to call kvm_mmu_do_page_fault() when
> such private+reserve error code actually happens (e.g., due to bug), because
> @r is still RET_PF_INVALID in such case.

Yep.

> Is it better to just return error, e.g., -EINVAL, and give up?

As long as there is no obvious/immediate danger to the host, no obvious way for
the "bad" behavior to cause data corruption for the guest, and continuing on has
a plausible chance of working, then KVM should generally try to continue on and
not terminate the VM.

E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
and treat the fault like a PRIVATE fault.  Hmm, but page_fault_handle_page_track()
would skip write tracking, which could theoretically cause data corruption, so I
guess arguably it would be safer to bail?

Anyone else have an opinion?  This type of bug should never escape development,
so I'm a-ok effectively killing the VM.  Unless someone has a good argument for
continuing on, I'll go with Kai's suggestion and squash this:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cedacb1b89c5..d796a162b2da 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
                error_code |= PFERR_PRIVATE_ACCESS;
 
        r = RET_PF_INVALID;
-       if (unlikely((error_code & PFERR_RSVD_MASK) &&
-                    !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
+       if (unlikely(error_code & PFERR_RSVD_MASK)) {
+               if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
+                       return -EFAULT;
+
                r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
                if (r == RET_PF_EMULATE)
                        goto emulate;
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Huang, Kai 8 months, 4 weeks ago

On 1/03/2024 12:06 pm, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> WARN and skip the emulated MMIO fastpath if a private, reserved page fault
>>> is encountered, as private+reserved should be an impossible combination
>>> (KVM should never create an MMIO SPTE for a private access).
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    arch/x86/kvm/mmu/mmu.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index bd342ebd0809..9206cfa58feb 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5866,7 +5866,8 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>>>    		error_code |= PFERR_PRIVATE_ACCESS;
>>>    	r = RET_PF_INVALID;
>>> -	if (unlikely(error_code & PFERR_RSVD_MASK)) {
>>> +	if (unlikely((error_code & PFERR_RSVD_MASK) &&
>>> +		     !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
>>>    		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
>>>    		if (r == RET_PF_EMULATE)
>>>    			goto emulate;
>>
>> It seems this will make KVM continue to call kvm_mmu_do_page_fault() when
>> such private+reserve error code actually happens (e.g., due to bug), because
>> @r is still RET_PF_INVALID in such case.
> 
> Yep.
> 
>> Is it better to just return error, e.g., -EINVAL, and give up?
> 
> As long as there is no obvious/immediate danger to the host, no obvious way for
> the "bad" behavior to cause data corruption for the guest, and continuing on has
> a plausible chance of working, then KVM should generally try to continue on and
> not terminate the VM.

Agreed.  But I think sometimes it is hard to tell whether there's any 
dangerous things waiting to happen, because that means we have to sanity 
check a lot of code, and when new patches arrive we need to keep that in 
mind too, which could be a nightmare in terms of maintenance.

> 
> E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
> and treat the fault like a PRIVATE fault.  Hmm, but page_fault_handle_page_track()
> would skip write tracking, which could theoretically cause data corruption, so I
> guess arguably it would be safer to bail?
> 
> Anyone else have an opinion?  This type of bug should never escape development,
> so I'm a-ok effectively killing the VM.  Unless someone has a good argument for
> continuing on, I'll go with Kai's suggestion and squash this:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cedacb1b89c5..d796a162b2da 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>                  error_code |= PFERR_PRIVATE_ACCESS;
>   
>          r = RET_PF_INVALID;
> -       if (unlikely((error_code & PFERR_RSVD_MASK) &&
> -                    !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> +       if (unlikely(error_code & PFERR_RSVD_MASK)) {
> +               if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
> +                       return -EFAULT;

-EFAULT is part of guest_memfd() memory fault ABI.  I didn't think over 
this thoroughly but do you want to return -EFAULT here?
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Sean Christopherson 8 months, 3 weeks ago
On Fri, Mar 01, 2024, Kai Huang wrote:
> On 1/03/2024 12:06 pm, Sean Christopherson wrote:
> > E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
> > and treat the fault like a PRIVATE fault.  Hmm, but page_fault_handle_page_track()
> > would skip write tracking, which could theoretically cause data corruption, so I
> > guess arguably it would be safer to bail?
> > 
> > Anyone else have an opinion?  This type of bug should never escape development,
> > so I'm a-ok effectively killing the VM.  Unless someone has a good argument for
> > continuing on, I'll go with Kai's suggestion and squash this:
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index cedacb1b89c5..d796a162b2da 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
> >                  error_code |= PFERR_PRIVATE_ACCESS;
> >          r = RET_PF_INVALID;
> > -       if (unlikely((error_code & PFERR_RSVD_MASK) &&
> > -                    !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
> > +       if (unlikely(error_code & PFERR_RSVD_MASK)) {
> > +               if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
> > +                       return -EFAULT;
> 
> -EFAULT is part of guest_memfd() memory fault ABI.  I didn't think over this
> thoroughly but do you want to return -EFAULT here?

Yes, I/we do.  There are many existing paths that can return -EFAULT from KVM_RUN
without setting run->exit_reason to KVM_EXIT_MEMORY_FAULT.  Userspace is responsible
for checking run->exit_reason on -EFAULT (and -EHWPOISON), i.e. must be prepared
to handle a "bare" -EFAULT, where for all intents and purposes "handle" means
"terminate the guest".

That's actually one of the reasons why KVM_EXIT_MEMORY_FAULT exists, it'd require
an absurd amount of work and churn in KVM to *safely* return useful information
on *all* -EFAULTs.  FWIW, I had hopes and dreams of actually doing exactly this,
but have long since abandoned those dreams.

In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
(a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
KVM is in a state where fixing the underlying problem and resuming the guest is
safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Huang, Kai 8 months, 3 weeks ago

On 5/03/2024 4:51 am, Sean Christopherson wrote:
> On Fri, Mar 01, 2024, Kai Huang wrote:
>> On 1/03/2024 12:06 pm, Sean Christopherson wrote:
>>> E.g. in this case, KVM will just skip various fast paths because of the RSVD flag,
>>> and treat the fault like a PRIVATE fault.  Hmm, but page_fault_handle_page_track()
>>> would skip write tracking, which could theoretically cause data corruption, so I
>>> guess arguably it would be safer to bail?
>>>
>>> Anyone else have an opinion?  This type of bug should never escape development,
>>> so I'm a-ok effectively killing the VM.  Unless someone has a good argument for
>>> continuing on, I'll go with Kai's suggestion and squash this:
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index cedacb1b89c5..d796a162b2da 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5892,8 +5892,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>>>                   error_code |= PFERR_PRIVATE_ACCESS;
>>>           r = RET_PF_INVALID;
>>> -       if (unlikely((error_code & PFERR_RSVD_MASK) &&
>>> -                    !WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))) {
>>> +       if (unlikely(error_code & PFERR_RSVD_MASK)) {
>>> +               if (WARN_ON_ONCE(error_code & PFERR_PRIVATE_ACCESS))
>>> +                       return -EFAULT;
>>
>> -EFAULT is part of guest_memfd() memory fault ABI.  I didn't think over this
>> thoroughly but do you want to return -EFAULT here?
> 
> Yes, I/we do.  There are many existing paths that can return -EFAULT from KVM_RUN
> without setting run->exit_reason to KVM_EXIT_MEMORY_FAULT.  Userspace is responsible
> for checking run->exit_reason on -EFAULT (and -EHWPOISON), i.e. must be prepared
> to handle a "bare" -EFAULT, where for all intents and purposes "handle" means
> "terminate the guest".

Right.

> 
> That's actually one of the reasons why KVM_EXIT_MEMORY_FAULT exists, it'd require
> an absurd amount of work and churn in KVM to *safely* return useful information
> on *all* -EFAULTs.  FWIW, I had hopes and dreams of actually doing exactly this,
> but have long since abandoned those dreams.

I am not sure whether we need to do that.  Perhaps it made you feel so 
after we changed to use -EFAULT to carry KVM_EXIT_MEMORY_FAULT. :-)

> 
> In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
> (a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
> KVM is in a state where fixing the underlying problem and resuming the guest is
> safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
> 

Sure.  One small issue might be that, in a later code check, we actually 
return KVM_EXIT_MEMORY_FAULT when private fault hits RET_PF_EMULATION -- 
see your patch:

[PATCH 01/16] KVM: x86/mmu: Exit to userspace with -EFAULT if private 
fault hits emulation

So here if we just return -EFAULT w/o reporting KVM_EXIT_MEMORY_FAULT 
when private+reserved is hit, it seems there's a little bit 
inconsistency here.

But you may have concern of corrupting guest here as you mentioned above.
Re: [PATCH 08/16] KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults
Posted by Sean Christopherson 8 months, 3 weeks ago
On Wed, Mar 06, 2024, Kai Huang wrote:
> On 5/03/2024 4:51 am, Sean Christopherson wrote:
> > In other words, KVM_EXIT_MEMORY_FAULT essentially communicates to userspace that
> > (a) userspace can likely fix whatever badness triggered the -EFAULT, and (b) that
> > KVM is in a state where fixing the underlying problem and resuming the guest is
> > safe, e.g. won't corrupt the guest (because KVM is in a half-baked state).
> > 
> 
> Sure.  One small issue might be that, in a later code check, we actually
> return KVM_EXIT_MEMORY_FAULT when private fault hits RET_PF_EMULATION -- see
> your patch:
> 
> [PATCH 01/16] KVM: x86/mmu: Exit to userspace with -EFAULT if private fault
> hits emulation
> 
> So here if we just return -EFAULT w/o reporting KVM_EXIT_MEMORY_FAULT when
> private+reserved is hit, it seems there's a little bit inconsistency here.

It's intentionally inconsistent.  -EFAULT without KVM_EXIT_MEMORY_FAULT is
essentially KVM saying "something bad happened, and it can't be fixed", whereas
exiting with KVM_EXIT_MEMORY_FAULT says "there's an issue, but you may be able
to resolve it".

The ABI is a bit messy, e.g. in some ways it would be cleaner if KVM returned '0'.
But doing that in a backwards compatible way would have required a rather ugly
opt-in, and it would also make it more tedious to extend KVM_EXIT_MEMORY_FAULT,
e.g. pairing it with -EHWPOISON didn't require any new flags.