[v1] KVM: x86: Event/exception fixes and cleanups

[PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 4 years, 3 months ago

The main goal of this series is to fix KVM's longstanding bug of not
honoring L1's exception intercepts wants when handling an exception that
occurs during delivery of a different exception.  E.g. if L0 and L1 are
using shadow paging, and L2 hits a #PF, and then hits another #PF while
vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
so that the #PF is routed to L1, not injected into L2 as a #DF.

nVMX has hacked around the bug for years by overriding the #PF injector
for shadow paging to go straight to VM-Exit, and nSVM has started doing
the same.  The hacks mostly work, but they're incomplete, confusing, and
lead to other hacky code, e.g. bailing from the emulator because #PF
injection forced a VM-Exit and suddenly KVM is back in L1.

Everything leading up to that are related fixes and cleanups I encountered
along the way; some through code inspection, some through tests (I truly
thought this series was finished 10 commits and 3 days ago...).

Nothing in here is all that urgent; all bugs tagged for stable have been
around for multiple releases (years in most cases).

Sean Christopherson (21):
  KVM: x86: Return immediately from x86_emulate_instruction() on code
    #DB
  KVM: nVMX: Unconditionally purge queued/injected events on nested
    "exit"
  KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  KVM: x86: Don't check for code breakpoints when emulating on exception
  KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
  KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  KVM: x86: Treat #DBs from the emulator as fault-like (code and
    DR7.GD=1)
  KVM: x86: Use DR7_GD macro instead of open coding check in emulator
  KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
  KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
  KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
  KVM: x86: Make kvm_queued_exception a properly named, visible struct
  KVM: x86: Formalize blocking of nested pending exceptions
  KVM: x86: Use kvm_queue_exception_e() to queue #DF
  KVM: x86: Hoist nested event checks above event injection logic
  KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
    VM-Exit
  KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
    behavior
  KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
  KVM: selftests: Add an x86-only test to verify nested exception
    queueing

 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  33 +-
 arch/x86/kvm/emulate.c                        |   3 +-
 arch/x86/kvm/svm/nested.c                     | 100 ++---
 arch/x86/kvm/svm/svm.c                        |  18 +-
 arch/x86/kvm/vmx/nested.c                     | 322 +++++++++-----
 arch/x86/kvm/vmx/sgx.c                        |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  53 ++-
 arch/x86/kvm/x86.c                            | 409 ++++++++++++------
 arch/x86/kvm/x86.h                            |  10 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   5 +-
 .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
 .../kvm/x86_64/nested_exceptions_test.c       | 307 +++++++++++++
 15 files changed, 914 insertions(+), 403 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c


base-commit: 4a204f7895878363ca8211f50ec610408c8c70aa
-- 
2.35.1.723.g4982287a31-goog

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maciej S. Szmigiero 4 years, 3 months ago

Hi Sean,

On 11.03.2022 04:27, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.

Looks like we were working on similar KVM area recently [1].

It look like parts of our patch sets touch the same code.
Since your patch set is much bigger and comprehensive I will base mine on
top of yours once there are no more incoming review comments for your
patch set (in other words, once it is in its final form).

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/cover.1646944472.git.maciej.szmigiero@oracle.com/

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 4 years, 3 months ago

On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests (I truly
> thought this series was finished 10 commits and 3 days ago...).
> 
> Nothing in here is all that urgent; all bugs tagged for stable have been
> around for multiple releases (years in most cases).
> 
> Sean Christopherson (21):
>   KVM: x86: Return immediately from x86_emulate_instruction() on code
>     #DB
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  33 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 100 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 322 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 409 ++++++++++++------
>  arch/x86/kvm/x86.h                            |  10 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   5 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 307 +++++++++++++
>  15 files changed, 914 insertions(+), 403 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 4a204f7895878363ca8211f50ec610408c8c70aa

I am just curious. Are you aware that I worked on this few months ago?
I am sure that you even reviewed some of my code back then.

If so, could you have had at least mentioned this and/or pinged me to continue
working on this instead of re-implementing it?

Best regards,
	Maxim Levitsky

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 4 years, 3 months ago

On Sun, Mar 13, 2022, Maxim Levitsky wrote:
> On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > 
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> > 
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests (I truly
> > thought this series was finished 10 commits and 3 days ago...).
> > 
> > Nothing in here is all that urgent; all bugs tagged for stable have been
> > around for multiple releases (years in most cases).
> > 
> I am just curious. Are you aware that I worked on this few months ago?

Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
I completely forgot about it :-/  In my defense, that was nearly a year ago[1][2], though
I suppose one could argue 11 == "a few" :-)

[1] https://lore.kernel.org/all/20210225154135.405125-1-mlevitsk@redhat.com
[2] https://lore.kernel.org/all/20210401143817.1030695-3-mlevitsk@redhat.com

> I am sure that you even reviewed some of my code back then.

Yep, now that I've found the threads I remember discussing the mechanics.

> If so, could you have had at least mentioned this and/or pinged me to continue
> working on this instead of re-implementing it?

I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
work, I simply forgot.

As for the technical aspects, looking back at your series, I strongly considered
taking the same approach of splitting pending vs. injected (again, without any
recollection of your work).  I ultimately opted to go with the "immediated morph
to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
pending exceptions.  It also neatly handles one-off things like async #PF in L2.

However, I hadn't considered your approach, which addresses the ABI conundrum by
processing pending=>injected immediately after handling the VM-Exit.  I can't think
of any reason that wouldn't work, but I really don't like splitting the event
priority logic, nor do I like having two event injection sites (getting rid of the
extra calls to kvm_check_nested_events() is still on my wish list).  If we could go
back in time, I would likely vote for properly tracking injected vs. pending, but
since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
VM-Exit" hack over the "immediately morph to 'injected' exception" hack.

[*] https://en.wikipedia.org/wiki/Hanlon%27s_razor

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maciej S. Szmigiero 4 years, 2 months ago

On 24.03.2022 22:31, Sean Christopherson wrote:
> On Sun, Mar 13, 2022, Maxim Levitsky wrote:
>> On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
>>> The main goal of this series is to fix KVM's longstanding bug of not
>>> honoring L1's exception intercepts wants when handling an exception that
>>> occurs during delivery of a different exception.  E.g. if L0 and L1 are
>>> using shadow paging, and L2 hits a #PF, and then hits another #PF while
>>> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
>>> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
>>> so that the #PF is routed to L1, not injected into L2 as a #DF.
>>>
>>> nVMX has hacked around the bug for years by overriding the #PF injector
>>> for shadow paging to go straight to VM-Exit, and nSVM has started doing
>>> the same.  The hacks mostly work, but they're incomplete, confusing, and
>>> lead to other hacky code, e.g. bailing from the emulator because #PF
>>> injection forced a VM-Exit and suddenly KVM is back in L1.
>>>
>>> Everything leading up to that are related fixes and cleanups I encountered
>>> along the way; some through code inspection, some through tests (I truly
>>> thought this series was finished 10 commits and 3 days ago...).
>>>
>>> Nothing in here is all that urgent; all bugs tagged for stable have been
>>> around for multiple releases (years in most cases).
>>>
>> I am just curious. Are you aware that I worked on this few months ago?
> 
> Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
> I completely forgot about it :-/  In my defense, that was nearly a year ago[1][2], though
> I suppose one could argue 11 == "a few" :-)
> 
> [1] https://lore.kernel.org/all/20210225154135.405125-1-mlevitsk@redhat.com
> [2] https://lore.kernel.org/all/20210401143817.1030695-3-mlevitsk@redhat.com
> 
>> I am sure that you even reviewed some of my code back then.
> 
> Yep, now that I've found the threads I remember discussing the mechanics.
> 
>> If so, could you have had at least mentioned this and/or pinged me to continue
>> working on this instead of re-implementing it?
> 
> I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
> work, I simply forgot.
> 
> As for the technical aspects, looking back at your series, I strongly considered
> taking the same approach of splitting pending vs. injected (again, without any
> recollection of your work).  I ultimately opted to go with the "immediated morph
> to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
> case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
> pending exceptions.  It also neatly handles one-off things like async #PF in L2.
> 
> However, I hadn't considered your approach, which addresses the ABI conundrum by
> processing pending=>injected immediately after handling the VM-Exit.  I can't think
> of any reason that wouldn't work, but I really don't like splitting the event
> priority logic, nor do I like having two event injection sites (getting rid of the
> extra calls to kvm_check_nested_events() is still on my wish list).  If we could go
> back in time, I would likely vote for properly tracking injected vs. pending, but
> since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
> VM-Exit" hack over the "immediately morph to 'injected' exception" hack.

So, what's the plan here: is your patch set Sean considered to supersede
Maxim's earlier proposed changes or will you post an updated patch set
incorporating at least some of them?

I am asking because I have a series that touches the same general area
of KVM [1] and would preferably have it based on the final form of the
event injection code to avoid unforeseen negative interactions between
these changes.

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/d04e096a-b12e-91e2-204e-b3643a62d705@maciej.szmigiero.name/

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 4 years, 2 months ago

On Fri, Mar 25, 2022, Maciej S. Szmigiero wrote:
> So, what's the plan here: is your patch set Sean considered to supersede
> Maxim's earlier proposed changes or will you post an updated patch set
> incorporating at least some of them?

Next step is to reach a consensus on how we want to solve the problem (or if we
can't reach consensus, until Paolo uses his special powers).  I definitely won't
post anything new until there's more conversation.

> I am asking because I have a series that touches the same general area
> of KVM [1] and would preferably have it based on the final form of the
> event injection code to avoid unforeseen negative interactions between
> these changes.

I don't think you need to do anything, at a glance your changes are orthogonal
even though they have similar themes.  Any conflicts should be minor.

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maciej S. Szmigiero 4 years, 2 months ago

On 26.03.2022 00:02, Sean Christopherson wrote:
> On Fri, Mar 25, 2022, Maciej S. Szmigiero wrote:
>> So, what's the plan here: is your patch set Sean considered to supersede
>> Maxim's earlier proposed changes or will you post an updated patch set
>> incorporating at least some of them?
> 
> Next step is to reach a consensus on how we want to solve the problem (or if we
> can't reach consensus, until Paolo uses his special powers).  I definitely won't
> post anything new until there's more conversation.

Ack.

>> I am asking because I have a series that touches the same general area
>> of KVM [1] and would preferably have it based on the final form of the
>> event injection code to avoid unforeseen negative interactions between
>> these changes.
> 
> I don't think you need to do anything, at a glance your changes are orthogonal
> even though they have similar themes.  Any conflicts should be minor.

All right - I will try to keep an eye on the developments related to the
event injection code just to be sure there are no obvious conflicts here.

Thanks,
Maciej

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 4 years, 2 months ago

On Thu, 2022-03-24 at 21:31 +0000, Sean Christopherson wrote:
> On Sun, Mar 13, 2022, Maxim Levitsky wrote:
> > On Fri, 2022-03-11 at 03:27 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests (I truly
> > > thought this series was finished 10 commits and 3 days ago...).
> > > 
> > > Nothing in here is all that urgent; all bugs tagged for stable have been
> > > around for multiple releases (years in most cases).
> > > 
> > I am just curious. Are you aware that I worked on this few months ago?
> 
> Ah, so that's why I had a feeling of deja vu when factoring out kvm_queued_exception.
> I completely forgot about it :-/  In my defense, that was nearly a year ago[1][2], though
> I suppose one could argue 11 == "a few" :-)
> 
> [1] https://lore.kernel.org/all/20210225154135.405125-1-mlevitsk@redhat.com
> [2] https://lore.kernel.org/all/20210401143817.1030695-3-mlevitsk@redhat.com
> 
> > I am sure that you even reviewed some of my code back then.
> 
> Yep, now that I've found the threads I remember discussing the mechanics.
> 
> > If so, could you have had at least mentioned this and/or pinged me to continue
> > working on this instead of re-implementing it?
> 
> I'm invoking Hanlon's razor[*]; I certainly didn't intended to stomp over your
> work, I simply forgot.

Thank you very much for the explanation, and I am glad that it was a honest mistake.

Other than that I am actually very happy that you posted this patch series,
as this gives more chance that this long standing issue will be fixed,
and if your patches are better/simpler/less invasive to KVM and still address the issue, 
I fully support using them instead of mine.
 
Totally agree with you about your thoughts about splitting pending/injected exception,
I also can't say I liked my approach that much, for the same reasons you mentioned.
 
It is also the main reason I put the whole thing on the backlog lately, 
because I was feeling that I am changing too much of the KVM, 
for a relatively theoretical issue.
 
 
I will review your patches, compare them to mine, and check if you or I missed something.

PS:

Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
which it does thankfully rarely. There are several (theoretical) issues there.
I don't remember those details, I need to refresh my memory.

AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
(and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)

Qemu does this without considering nested and/or pending exception/etc.
It just kind of abuses the KVM_SET_VCPU_EVENTS for that.

Best regards,
	Maxim Levitsky


> 
> As for the technical aspects, looking back at your series, I strongly considered
> taking the same approach of splitting pending vs. injected (again, without any
> recollection of your work).  I ultimately opted to go with the "immediated morph
> to pending VM-Exit" approach as it allows KVM to do the right thing in almost every
> case without requiring new ABI, and even if KVM screws up, e.g. queues multiple
> pending exceptions.  It also neatly handles one-off things like async #PF in L2.
> 
> However, I hadn't considered your approach, which addresses the ABI conundrum by
> processing pending=>injected immediately after handling the VM-Exit.  I can't think
> of any reason that wouldn't work, but I really don't like splitting the event
> priority logic, nor do I like having two event injection sites (getting rid of the
> extra calls to kvm_check_nested_events() is still on my wish list).  If we could go
> back in time, I would likely vote for properly tracking injected vs. pending, but
> since we're mostly stuck with KVM's ABI, I prefer the "immediately morph to pending
> VM-Exit" hack over the "immediately morph to 'injected' exception" hack.
> 
> [*] https://en.wikipedia.org/wiki/Hanlon%27s_razor
>

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 4 years, 2 months ago

On Sun, Mar 27, 2022, Maxim Levitsky wrote:
> Other than that I am actually very happy that you posted this patch series,
> as this gives more chance that this long standing issue will be fixed,
> and if your patches are better/simpler/less invasive to KVM and still address the issue, 
> I fully support using them instead of mine.

I highly doubt they're simpler or less invasive, but I do hope that the approach
wil be easier to maintain.

> Totally agree with you about your thoughts about splitting pending/injected exception,
> I also can't say I liked my approach that much, for the same reasons you mentioned.
>  
> It is also the main reason I put the whole thing on the backlog lately, 
> because I was feeling that I am changing too much of the KVM, 
> for a relatively theoretical issue.
>  
>  
> I will review your patches, compare them to mine, and check if you or I missed something.
> 
> PS:
> 
> Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
> which it does thankfully rarely. There are several (theoretical) issues there.
> I don't remember those details, I need to refresh my memory.
> 
> AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
> if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
> (and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)
> 
> Qemu does this without considering nested and/or pending exception/etc.
> It just kind of abuses the KVM_SET_VCPU_EVENTS for that.

I wouldn't call that abuse, the ioctl() isn't just for migration.  Not checking for
a pending exception is firmly a userspace bug and not something KVM should try to
fix.

For #DB, I suspect it's a non-issue.  The exit is synchronous, so unless userspace
is deferring the reflection, which would be architecturally wrong in and of itself,
there can never be another pending exception. 

For #MC, I think the correct behavior would be to defer the synthesized #MC if there's
a pending exception and resume the guest until the exception is injected.  E.g. if a
different task encounters the real #MC, the synthesized #MC will be fully asynchronous
and may be coincident with a pending exception that is unrelated to the #MC.  That
would require to userspace to enable KVM_CAP_EXCEPTION_PAYLOAD, otherwise userspace
won't be able to differentiate between a pending and injected exception, e.g. if the
#MC occurs during exception vectoring, userspace should override the injected exception
and synthesize #MC, otherwise it would likely soft hang the guest.

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 4 years, 2 months ago

On Mon, 2022-03-28 at 17:50 +0000, Sean Christopherson wrote:
> On Sun, Mar 27, 2022, Maxim Levitsky wrote:
> > Other than that I am actually very happy that you posted this patch series,
> > as this gives more chance that this long standing issue will be fixed,
> > and if your patches are better/simpler/less invasive to KVM and still address the issue, 
> > I fully support using them instead of mine.
> 
> I highly doubt they're simpler or less invasive, but I do hope that the approach
> wil be easier to maintain.
>   
> > Totally agree with you about your thoughts about splitting pending/injected exception,
> > I also can't say I liked my approach that much, for the same reasons you mentioned.
> >  
> > It is also the main reason I put the whole thing on the backlog lately, 
> > because I was feeling that I am changing too much of the KVM, 
> > for a relatively theoretical issue.
> >  
> >  
> > I will review your patches, compare them to mine, and check if you or I missed something.
> > 
> > PS:
> > 
> > Back then, I also did an extensive review on few cases when qemu injects exceptions itself,
> > which it does thankfully rarely. There are several (theoretical) issues there.
> > I don't remember those details, I need to refresh my memory.
> > 
> > AFAIK, qemu injects #MC sometimes when it gets it from the kernel in form of a signal,
> > if I recall this correctly, and it also reflects back #DB, when guest debug was enabled
> > (and that is the reason for some work I did in this area, like the KVM_GUESTDBG_BLOCKIRQ thing)
> > 
> > Qemu does this without considering nested and/or pending exception/etc.
> > It just kind of abuses the KVM_SET_VCPU_EVENTS for that.
> 
> I wouldn't call that abuse, the ioctl() isn't just for migration.  Not checking for
> a pending exception is firmly a userspace bug and not something KVM should try to
> fix.

yes, but to make the right decision, the userspace has to know if there is a pending
exception, and if there is, then merge it (which might even involve triple fault),

On top of that it is possible that pending excpetion is not intercepted by L1,
but merged result is, so injecting the exception will cause nested VMexit,
which is something that is hard for userspace to model.

I think that the cleanest way to do this is to add new ioctl, KVM_INJECT_EXCEPTION,
which can do the right thing in the kernel, but I am not sure that it is worth it,
knowing that thankfully userspace doesn't inject exceptions much.



> 
> For #DB, I suspect it's a non-issue.  The exit is synchronous, so unless userspace
> is deferring the reflection, which would be architecturally wrong in and of itself,
> there can never be another pending exception. 
Could very be, but still there could be corner cases. Like what if you set data fetch
breakpoint on a IDT entry of some exception? I guess during delivery of that exception
there might be #DB, but I am not 100% expert on when and how #DB is generated, so
I can't be sure. Anyway #DB isn't a big deal because qemu only re-injects it when
guest debug is enabled and that is broken anyway, and does worse things like leaking EFLAGS.TF
to the guest stack and such.


> 
> For #MC, I think the correct behavior would be to defer the synthesized #MC if there's
> a pending exception and resume the guest until the exception is injected.  E.g. if a
> different task encounters the real #MC, the synthesized #MC will be fully asynchronous
> and may be coincident with a pending exception that is unrelated to the #MC.  That
> would require to userspace to enable KVM_CAP_EXCEPTION_PAYLOAD, otherwise userspace
> won't be able to differentiate between a pending and injected exception, e.g. if the
> #MC occurs during exception vectoring, userspace should override the injected exception
> and synthesize #MC, otherwise it would likely soft hang the guest.
Something like that, I don't remember all the details.

Best regards,
	Maxim Levitsky

>

Re: [PATCH 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 4 years, 2 months ago

On Tue, Mar 29, 2022, Maxim Levitsky wrote:
> On Mon, 2022-03-28 at 17:50 +0000, Sean Christopherson wrote:
> > I wouldn't call that abuse, the ioctl() isn't just for migration.  Not checking for
> > a pending exception is firmly a userspace bug and not something KVM should try to
> > fix.
> 
> yes, but to make the right decision, the userspace has to know if there is a pending
> exception, and if there is, then merge it (which might even involve triple fault),

There's no need for userspace to ever merge exceptions unless KVM supports either
exiting to userspace on an exception that can occur during exception delivery, or
userspace itself is emulating exception delivery.  Outside of debug scenarios, #PF
is likely the only exception that might ever be forwarded to userspace.  But in
those scenarios, userspace is almost always going to fix the #PF and resume the
guest.  If userspace doesn't fix the #PF, the guest is completely hosed because
its IDT will trigger #PF, i.e. it's headed to shutdown regardless of KVM's ABI.

VM introspection is the only use case I can think of that might possibly want to
emulate exception delivery in userspace, and VMI is a completely new set of APIs,
in no small part because supporting something like this in KVM would require far
more hooks than KVM provides.

> On top of that it is possible that pending excpetion is not intercepted by L1,
> but merged result is, so injecting the exception will cause nested VMexit,
> which is something that is hard for userspace to model.
> 
> I think that the cleanest way to do this is to add new ioctl, KVM_INJECT_EXCEPTION,
> which can do the right thing in the kernel, but I am not sure that it is worth it,
> knowing that thankfully userspace doesn't inject exceptions much.
> 
> > 
> > For #DB, I suspect it's a non-issue.  The exit is synchronous, so unless userspace
> > is deferring the reflection, which would be architecturally wrong in and of itself,
> > there can never be another pending exception. 
> Could very be, but still there could be corner cases. Like what if you set data fetch
> breakpoint on a IDT entry of some exception? I guess during delivery of that exception
> there might be #DB, but I am not 100% expert on when and how #DB is generated, so
> I can't be sure.

Data #DBs are trap-like.  The #DB will arrive after exception delivery completes,
i.e. will occur "on" the first instruction in the exception handler.