[v2] KVM: x86: Event/exception fixes and cleanups

[PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 3 years, 10 months ago

The main goal of this series is to fix KVM's longstanding bug of not
honoring L1's exception intercepts wants when handling an exception that
occurs during delivery of a different exception.  E.g. if L0 and L1 are
using shadow paging, and L2 hits a #PF, and then hits another #PF while
vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
so that the #PF is routed to L1, not injected into L2 as a #DF.

nVMX has hacked around the bug for years by overriding the #PF injector
for shadow paging to go straight to VM-Exit, and nSVM has started doing
the same.  The hacks mostly work, but they're incomplete, confusing, and
lead to other hacky code, e.g. bailing from the emulator because #PF
injection forced a VM-Exit and suddenly KVM is back in L1.

Everything leading up to that are related fixes and cleanups I encountered
along the way; some through code inspection, some through tests.

v2:
  - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
    overhaul.
    https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
  - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.

v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com

Sean Christopherson (21):
  KVM: nVMX: Unconditionally purge queued/injected events on nested
    "exit"
  KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  KVM: x86: Don't check for code breakpoints when emulating on exception
  KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
  KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  KVM: x86: Treat #DBs from the emulator as fault-like (code and
    DR7.GD=1)
  KVM: x86: Use DR7_GD macro instead of open coding check in emulator
  KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
  KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
  KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
  KVM: x86: Make kvm_queued_exception a properly named, visible struct
  KVM: x86: Formalize blocking of nested pending exceptions
  KVM: x86: Use kvm_queue_exception_e() to queue #DF
  KVM: x86: Hoist nested event checks above event injection logic
  KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
    VM-Exit
  KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
  KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
    behavior
  KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
  KVM: selftests: Add an x86-only test to verify nested exception
    queueing

 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  35 +-
 arch/x86/kvm/emulate.c                        |   3 +-
 arch/x86/kvm/svm/nested.c                     | 102 ++---
 arch/x86/kvm/svm/svm.c                        |  18 +-
 arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
 arch/x86/kvm/vmx/sgx.c                        |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  53 ++-
 arch/x86/kvm/x86.c                            | 404 +++++++++++-------
 arch/x86/kvm/x86.h                            |  11 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
 .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
 .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
 15 files changed, 886 insertions(+), 418 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c


base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
-- 
2.36.1.476.g0c4daa206d-goog

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests.
> 
> v2:
>   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
>     overhaul.
>     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
>   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> 
> v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> 
> Sean Christopherson (21):
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  35 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 102 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
>  arch/x86/kvm/x86.h                            |  11 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
>  15 files changed, 886 insertions(+), 418 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb

Next week I will review all of this patch series.

Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests.
> 
> v2:
>   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
>     overhaul.
>     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
>   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> 
> v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> 
> Sean Christopherson (21):
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  35 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 102 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
>  arch/x86/kvm/x86.h                            |  11 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
>  15 files changed, 886 insertions(+), 418 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb

Hi Sean and everyone!
 
 
Before I continue reviewing the patch series, I would like you to check if
I understand the monitor trap/pending debug exception/event injection
logic on VMX correctly. I was looking at the spec for several hours and I still have more
questions that answers about it.
 
So let me state what I understand:
 
1. Event injection (aka eventinj in SVM terms):
 
  (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
 
  If I understand correctly all event injections types just like on SVM just inject,
  and never create something pending, and/or drop the injection if event is not allowed
  (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
  if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
  I haven't checked this)
 
  All event injections happen right away, don't deliver any payload (like DR6), etc.
 
  Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
  type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
  to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
  with cases when these trap like events are intercepted, where the interception happens
  on the start of the instruction despite exceptions being trap-like.
 
 
2. #DB is the only trap like exception that can be pending for one more instruction
   if MOV SS shadow is on (any other cases?).
   (AMD just ignores the whole thing, rightfully)
 
   That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
   I understand that it will be written by CPU in case we have VM exit at the moment
   where #DB is already pending but not yet delivered.
 
   That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
   and this #DB will actually update DR6 and such, and might be delayed/lost.
 
 
3. Facts about MTF:
 
   * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
     instruction', and is enabled in primary execution controls.
 
   * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
     and that has no connection to the 'feature', although usually this injection will be useful
     when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
 
   * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
     which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
     overrides it.
 
   * MTF event is raised (when the primary execution controls bit is enabled) when:
 
	- after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
	  (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
	  I am not 100% sure about this but this seems to be what PRM implies:
 
	  "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
	  26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
	  VM entry."
 
	- If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
	  then once again MTF will happen on first instruction of the exception/interrupt handler
 
	  "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
	  (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
	  on the instruction boundary following delivery of the event (or any nested exception)."
 
	  That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
 
	- If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
 
	- Otherwise after an instruction (or REP iteration) retires.
 

If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 10 months ago

On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> >
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> >
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests.
> >
> > v2:
> >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> >     overhaul.
> >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> >
> > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> >
> > Sean Christopherson (21):
> >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> >     "exit"
> >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> >   KVM: x86: Don't check for code breakpoints when emulating on exception
> >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> >     DR7.GD=1)
> >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> >   KVM: x86: Formalize blocking of nested pending exceptions
> >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> >   KVM: x86: Hoist nested event checks above event injection logic
> >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> >     VM-Exit
> >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> >     behavior
> >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> >   KVM: selftests: Add an x86-only test to verify nested exception
> >     queueing
> >
> >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> >  arch/x86/include/asm/kvm_host.h               |  35 +-
> >  arch/x86/kvm/emulate.c                        |   3 +-
> >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> >  arch/x86/kvm/svm/svm.c                        |  18 +-
> >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> >  arch/x86/kvm/x86.h                            |  11 +-
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> >  15 files changed, 886 insertions(+), 418 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> >
> >
> > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
>
> Hi Sean and everyone!
>
>
> Before I continue reviewing the patch series, I would like you to check if
> I understand the monitor trap/pending debug exception/event injection
> logic on VMX correctly. I was looking at the spec for several hours and I still have more
> questions that answers about it.
>
> So let me state what I understand:
>
> 1. Event injection (aka eventinj in SVM terms):
>
>   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
>
>   If I understand correctly all event injections types just like on SVM just inject,
>   and never create something pending, and/or drop the injection if event is not allowed
>   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
>   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
>   I haven't checked this)
>
>   All event injections happen right away, don't deliver any payload (like DR6), etc.
>
>   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
>   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
>   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
>   with cases when these trap like events are intercepted, where the interception happens
>   on the start of the instruction despite exceptions being trap-like.
>
>
> 2. #DB is the only trap like exception that can be pending for one more instruction
>    if MOV SS shadow is on (any other cases?).
>    (AMD just ignores the whole thing, rightfully)
>
>    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
>    I understand that it will be written by CPU in case we have VM exit at the moment
>    where #DB is already pending but not yet delivered.
>
>    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
>    and this #DB will actually update DR6 and such, and might be delayed/lost.
>
>
> 3. Facts about MTF:
>
>    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
>      instruction', and is enabled in primary execution controls.
>
>    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
>      and that has no connection to the 'feature', although usually this injection will be useful
>      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
>
>    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
>      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
>      overrides it.
>
>    * MTF event is raised (when the primary execution controls bit is enabled) when:
>
>         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
>           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
>           I am not 100% sure about this but this seems to be what PRM implies:
>
>           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
>           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
>           VM entry."
>
>         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
>           then once again MTF will happen on first instruction of the exception/interrupt handler
>
>           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
>           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
>           on the instruction boundary following delivery of the event (or any nested exception)."
>
>           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
>
>         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
>
>         - Otherwise after an instruction (or REP iteration) retires.
>
>
> If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

Here's a comprehensive spreadsheet on virtualizing MTF, compiled by
Peter Shier. (Just in case anyone is interested in *truly*
virtualizing the feature under KVM, rather than just setting a
VM-execution control bit in vmcs02 and calling it done.)

https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Wed, 2022-06-29 at 08:53 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> > 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> > 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> > 
> > 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> >    (AMD just ignores the whole thing, rightfully)
> > 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> > 
> > 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> > 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> > 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> Here's a comprehensive spreadsheet on virtualizing MTF, compiled by
> Peter Shier. (Just in case anyone is interested in *truly*
> virtualizing the feature under KVM, rather than just setting a
> VM-execution control bit in vmcs02 and calling it done.)
> 
> https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml

Neither can I access this document sadly :(

Best regards,
	Maxim Levitsky

>

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 10 months ago

On Thu, Jun 30, 2022 at 1:24 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> Neither can I access this document sadly :(
Try this one: https://docs.google.com/spreadsheets/d/1u6yjgj0Fshd31YKFJ524mwle7BhxB3yuEy9fhdSoh-0

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 10 months ago

On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> >
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> >
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests.
> >
> > v2:
> >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> >     overhaul.
> >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> >
> > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> >
> > Sean Christopherson (21):
> >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> >     "exit"
> >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> >   KVM: x86: Don't check for code breakpoints when emulating on exception
> >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> >     DR7.GD=1)
> >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> >   KVM: x86: Formalize blocking of nested pending exceptions
> >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> >   KVM: x86: Hoist nested event checks above event injection logic
> >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> >     VM-Exit
> >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> >     behavior
> >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> >   KVM: selftests: Add an x86-only test to verify nested exception
> >     queueing
> >
> >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> >  arch/x86/include/asm/kvm_host.h               |  35 +-
> >  arch/x86/kvm/emulate.c                        |   3 +-
> >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> >  arch/x86/kvm/svm/svm.c                        |  18 +-
> >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> >  arch/x86/kvm/x86.h                            |  11 +-
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> >  15 files changed, 886 insertions(+), 418 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> >
> >
> > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
>
> Hi Sean and everyone!
>
>
> Before I continue reviewing the patch series, I would like you to check if
> I understand the monitor trap/pending debug exception/event injection
> logic on VMX correctly. I was looking at the spec for several hours and I still have more
> questions that answers about it.
>
> So let me state what I understand:
>
> 1. Event injection (aka eventinj in SVM terms):
>
>   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
>
>   If I understand correctly all event injections types just like on SVM just inject,
>   and never create something pending, and/or drop the injection if event is not allowed
>   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
>   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
>   I haven't checked this)

The event is never just "dropped." If it is illegal to deliver the
event, VM-entry fails. See the second bullet under section 26.2.1.3:
VM-Entry Control Fields, in the SDM, volume 3.


>   All event injections happen right away, don't deliver any payload (like DR6), etc.

Correct.

>   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
>   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
>   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
>   with cases when these trap like events are intercepted, where the interception happens
>   on the start of the instruction despite exceptions being trap-like.

Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
at the start of the instruction. In early Intel VT-x parts, one could
not easily reinject an intercepted software interrupt or exception
using event injection, because VM-entry required a non-zero
instruction length, and the guest RIP had already advanced. On CPUs
that support a non-zero instruction length, one can now reinject a
software interrupt or exception, by setting the VM-entry instruction
length to 0.

> 2. #DB is the only trap like exception that can be pending for one more instruction
>    if MOV SS shadow is on (any other cases?).

I believe that's it. I'm not entirely sure about RTM,though.

>    (AMD just ignores the whole thing, rightfully)

When you say "ignores," do you mean that AMD ignores a data breakpoint
or single-step trap generated by MOV-SS, or it ignores the fact that
delivering such a #DB trap between the MOV-SS and the subsequent
MOV-ESP will create a stack frame in the wrong place?

>    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
>    I understand that it will be written by CPU in case we have VM exit at the moment
>    where #DB is already pending but not yet delivered.
>
>    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
>    and this #DB will actually update DR6 and such, and might be delayed/lost.

Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
easier than emulating the next instruction or using MTF to step
through the next instruction, and getting all of the deferred #DB
delivery rules right. :-)

>
> 3. Facts about MTF:
>
>    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
>      instruction', and is enabled in primary execution controls.
>
>    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
>      and that has no connection to the 'feature', although usually this injection will be useful
>      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
>
>    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
>      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
>      overrides it.

Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."

>    * MTF event is raised (when the primary execution controls bit is enabled) when:
>
>         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
>           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
>           I am not 100% sure about this but this seems to be what PRM implies:
>
>           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
>           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
>           VM entry."
>
>         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
>           then once again MTF will happen on first instruction of the exception/interrupt handler
>
>           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
>           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
>           on the instruction boundary following delivery of the event (or any nested exception)."
>
>           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB

These are single-step, I/O and data breakpoint traps.

>         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
>
>         - Otherwise after an instruction (or REP iteration) retires.
>
>
> If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

You might be interested in my augmented Table 6-2 (from volume 3 of
the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 9 months ago

On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> 
> The event is never just "dropped." If it is illegal to deliver the
> event, VM-entry fails. See the second bullet under section 26.2.1.3:
> VM-Entry Control Fields, in the SDM, volume 3.
> 
> 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> 
> Correct.
> 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> 
> Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> at the start of the instruction. In early Intel VT-x parts, one could
> not easily reinject an intercepted software interrupt or exception
> using event injection, because VM-entry required a non-zero
> instruction length, and the guest RIP had already advanced. On CPUs
> that support a non-zero instruction length, one can now reinject a
> software interrupt or exception, by setting the VM-entry instruction
> length to 0.
> 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> 
> I believe that's it. I'm not entirely sure about RTM,though.
> 
> >    (AMD just ignores the whole thing, rightfully)
> 
> When you say "ignores," do you mean that AMD ignores a data breakpoint
> or single-step trap generated by MOV-SS, or it ignores the fact that
> delivering such a #DB trap between the MOV-SS and the subsequent
> MOV-ESP will create a stack frame in the wrong place?
> 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> 
> Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
> easier than emulating the next instruction or using MTF to step
> through the next instruction, and getting all of the deferred #DB
> delivery rules right. :-)
> 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> 
> Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> 
> These are single-step, I/O and data breakpoint traps.
> 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> You might be interested in my augmented Table 6-2 (from volume 3 of
> the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml
> 


This is this table, slightly processed by me:

--
=====================================================================================
My Notes:
=====================================================================================


- Events happen on the instruction boundary.

- On the instruction boundary, the previous instruction is fully finished executing,
  which means that it is retired, or in other words, the arch state changes made by it
  are fully commited, and that includes transfer to an exception handler if
  that instruction caused a fault like exception.
  
  (Statement about transfer to an exception is not 100% true in KVM, since we use hardware to inject
   exceptions, thus when we deal with events, we still have to finish last instruction
   by delivering an exception it caused if any).

- On instruction boundary, the next instruction might already started executing, but none of its results
  were not committed to arch state YET.

- The events are from highest (1) to lowest (10). The highest event always wins,
  meaning that it is delivered, while all other events are lost.

=====================================================================================
Previous instruciton is retired
=====================================================================================

1.0 Hardware Reset and Machine Checks
  - RESET
  - Machine Check


2.0 Trap on Task Switch
  - T flag in TSS is set, and the task switch was done by previous instruction


3.0 External Hardware Interventions
  - FLUSH
  - STOPCLK
  - SMI
  - INIT

3.5 Pending MTF VM-exit
   "System-management interrupts (SMIs), INIT signals, and higher priority events take priority over MTF
   VM exits. MTF VM exits take priority over debug-trap exceptions and lower priority events."

   - Note that MTF became pending due to previous instruction and/or injection.

4.0 #DB Traps on Previous Instruction
  - Breakpoints
  - Debug Trap Exceptions (TF flag set or data/IO breakpoint)

4.3 VMX-preemption timer expired VM-exit
   "Debug-trap exceptions and higher priority events take priority over VM exits caused by the VMX-preemption
    timer. VM exits caused by the VMX-preemption timer take priority over VM exits caused by the “NMI-window
    exiting” VM-execution control and lower priority events."

4.6 NMI-window exiting VM-exit
   "Debug-trap exceptions (see Section 26.7.3) and higher priority events take priority over VM exits caused by
   [NMI-window exiting]. VM exits caused by this control take priority over non-maskable interrupts (NMIs) and lower
   priority events."

5.0 Nonmaskable Interrupts (NMI)


5.5 Interrupt-window exiting VM-exit + Virtual-interrupt delivery (if "interrupt-window exiting" is 0)

  "Virtual-interrupt delivery has the same priority as that of VM exits due to the 1-setting of the “interrupt-window
  exiting” VM-execution control.1 Thus, non-maskable interrupts (NMIs) and higher priority events take priority over
  delivery of a virtual interrupt; delivery of a virtual interrupt takes priority over external interrupts and lower priority
  events. "

6.0 Maskable Hardware Interrupts
  - Real hardware interrupts

=====================================================================================
Execution of next instruction starts
=====================================================================================

7.0 #DB Fault on next instruction
  - Instruction breakpoint
  - General detect ?? 

8.0 Faults from Fetching Next Instruction
  - Code-Segment Limit Violation
  - Code Page Fault
  - Control protection exception (missing ENDBRANCH at target of indirect call or jump)

9.0 Faults from Decoding Next Instruction
  - Instruction length > 15 bytes
  - Invalid Opcode
  - Coprocessor Not Available

10. Faults on Executing Next Instruction
  - Overflow
  - Bound error
  - Invalid TSS
  - Segment Not Present
  - Stack fault
  - General Protection
  - Data Page Fault
  - Alignment Check
  - x86 FPU Floating-point exception
  - SIMD floating-point exception
  - Virtualization exception
  - Control protection exception
--


So here are my questions:


1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
   and the dual monitor handler figures out that #MTF was pending and re-injects it when it
   VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
   do anything about it.

   Or maybe pending MTF is saved to SMRAM somewhere.

   In case you will say that I am inventing this again, I am saying now that the above is
   just a guess.


2. For case 7, what about General Detect? Since to raise it, the CPU needs to decode the instruction
   Its more natural to have it belong to case 9.


3. Finally just to state it, looks like MTF can only be lost due to #SMI or machine check,
   because the trap on task switch under VMX is purely software thing - task switch is emulated,
   thus the hypervisor can (and KVM doesn't) test that bit upon completion of the task switch,
   and do all monitor trap related things it needs then.

   (RESET/INIT doesn't matter as that makes the CPU lose most of its state)


Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 9 months ago

On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
>    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
>    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
>    do anything about it.
>
>    Or maybe pending MTF is saved to SMRAM somewhere.
>
>    In case you will say that I am inventing this again, I am saying now that the above is
>    just a guess.

This is covered in the SDM, volume 3, section 31.14.1: "Default
Treatment of SMI Delivery:"

The pseudocode above makes reference to the saving of VMX-critical
state. This state consists of the following:
(1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
state of blocking by STI and by MOV SS (see
Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
(only if the processor is in VMX non-root oper-
ation and the “virtual NMIs” VM-execution control is 1); and (5) an
indication of whether an MTF VM exit is pending
(see Section 25.5.2). These data may be saved internal to the
processor or in the VMCS region of the current
VMCS. Processors that do not support SMI recognition while there is
blocking by STI or by MOV SS need not save
the state of such blocking.

Saving VMX-critical state to SMRAM is not documented as an option.

> 2. For case 7, what about General Detect? Since to raise it, the CPU needs to decode the instruction
>    Its more natural to have it belong to case 9.

I think it actually belongs in case 10. The Intel table says,
"Fault-class Debug Exceptions (#DB due to instruction breakpoint),"
and I probably just blindly added "General Detect," because it is a
fault-class debug exception. You're right; the CPU has to decode the
instruction before it can deliver a #DB for general detect.

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Sean Christopherson 3 years, 9 months ago

On Wed, Jul 06, 2022, Jim Mattson wrote:
> On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> > 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
> >    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
> >    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
> >    do anything about it.
> >
> >    Or maybe pending MTF is saved to SMRAM somewhere.
> >
> >    In case you will say that I am inventing this again, I am saying now that the above is
> >    just a guess.
> 
> This is covered in the SDM, volume 3, section 31.14.1: "Default
> Treatment of SMI Delivery:"
> 
> The pseudocode above makes reference to the saving of VMX-critical
> state. This state consists of the following:
> (1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
> state of blocking by STI and by MOV SS (see
> Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
> (only if the processor is in VMX non-root oper-
> ation and the “virtual NMIs” VM-execution control is 1); and (5) an
> indication of whether an MTF VM exit is pending
> (see Section 25.5.2). These data may be saved internal to the
> processor or in the VMCS region of the current
> VMCS. Processors that do not support SMI recognition while there is
> blocking by STI or by MOV SS need not save
> the state of such blocking.
> 
> Saving VMX-critical state to SMRAM is not documented as an option.

Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
processor" as "undocumented SMRAM fields".  But I could also be misremembering
the SMI flows.

Regardless, I do like the idea of using vmcs12 instead of SMRAM.  That would provide
some extra motivation for moving away from KVM's broken pseudo VM-Exit implementation.

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 9 months ago

On Wed, Jul 6, 2022 at 10:52 AM Sean Christopherson <seanjc@google.com> wrote:

> Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> processor" as "undocumented SMRAM fields".  But I could also be misremembering
> the SMI flows.

Start using reserved SMRAM, and you will regret it when the vendor
assigns some new bit of state to the same location.

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 9 months ago

On Wed, 2022-07-06 at 13:11 -0700, Jim Mattson wrote:
> On Wed, Jul 6, 2022 at 10:52 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> > processor" as "undocumented SMRAM fields".  But I could also be misremembering
> > the SMI flows.
> 
> Start using reserved SMRAM, and you will regret it when the vendor
> assigns some new bit of state to the same location.
> 
This is true to some extent, but our SMRAM layout doesn't follow the
spec anyway. This is the reason I asked (I posted an RFC as a good citizen),
in the first place all of you, if you prefer SMRAM or KVM internal state.

Anyway if this is a concern, I can just save the interrupt shadow in KVM,
and migrate it, its not hard, in fact the v1 of my patches did exactly that.

Paolo, what should I do?

Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 9 months ago

On Wed, 2022-07-06 at 17:52 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Jim Mattson wrote:
> > On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > > 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
> > >    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
> > >    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
> > >    do anything about it.
> > > 
> > >    Or maybe pending MTF is saved to SMRAM somewhere.
> > > 
> > >    In case you will say that I am inventing this again, I am saying now that the above is
> > >    just a guess.
> > 
> > This is covered in the SDM, volume 3, section 31.14.1: "Default
> > Treatment of SMI Delivery:"
> > 
> > The pseudocode above makes reference to the saving of VMX-critical
> > state. This state consists of the following:
> > (1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
> > state of blocking by STI and by MOV SS (see
> > Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
> > (only if the processor is in VMX non-root oper-
> > ation and the “virtual NMIs” VM-execution control is 1); and (5) an
> > indication of whether an MTF VM exit is pending
> > (see Section 25.5.2). These data may be saved internal to the
> > processor or in the VMCS region of the current
> > VMCS. Processors that do not support SMI recognition while there is
> > blocking by STI or by MOV SS need not save
> > the state of such blocking.
> > 
> > Saving VMX-critical state to SMRAM is not documented as an option.
> 
> Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> processor" as "undocumented SMRAM fields".  But I could also be misremembering
> the SMI flows.
> 
> Regardless, I do like the idea of using vmcs12 instead of SMRAM.  That would provide
> some extra motivation for moving away from KVM's broken pseudo VM-Exit implementation.
> 

For preserving pending MTF, I guess it makes sense to use vmcb12, especially since we own
its format.

Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> 
> The event is never just "dropped." If it is illegal to deliver the
> event, VM-entry fails. See the second bullet under section 26.2.1.3:
> VM-Entry Control Fields, in the SDM, volume 3.
Yes, that is what I wanted to confirm.


> 
> 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> 
> Correct.
> 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> 
> Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> at the start of the instruction.

Are you sure about that? 


This is what the SDM says for direct intercepts.

27.3.3 Saving RIP, RSP, RFLAGS, and SSP

"If the VM exit is due to a software exception (due to an execution of INT3 or INTO) or a privileged software
exception (due to an execution of INT1), the value saved references the INT3, INTO, or INT1 instruction
that caused that exception."


If these events were intecepted indirectly (via IDT_VECTORING_INFO_FIELD),
its hard to understand from the PRM where the RIP points, but at least we at KVM,
do read VM_EXIT_INSTRUCTION_LEN, and then stuff it into the VM_ENTRY_INSTRUCTION_LEN
when we re-inject the event.
(look at __vmx_complete_interrupts)


Also note that you can't intercept INTn on Intel at all,
but they can be intercepted indirectly via the IDT vectoring field.


> In early Intel VT-x parts, one could
> not easily reinject an intercepted software interrupt or exception
> using event injection, because VM-entry required a non-zero
> instruction length, and the guest RIP had already advanced. On CPUs
> that support a non-zero instruction length, one can now reinject a
> software interrupt or exception, by setting the VM-entry instruction
> length to 0.
> 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> 
> I believe that's it. I'm not entirely sure about RTM,though.
> 
> >    (AMD just ignores the whole thing, rightfully)
> 
> When you say "ignores," do you mean that AMD ignores a data breakpoint
> or single-step trap generated by MOV-SS, or it ignores the fact that
> delivering such a #DB trap between the MOV-SS and the subsequent
> MOV-ESP will create a stack frame in the wrong place?

Two things which can be infered from the SVM spec. 
	- AMD doesn't distinguish between MOV SS and STI int shadow.
	- AMD has no 'pending debug exception field' in the vmcb.

I don't know what AMD does for #DB that happens on MOV SS, nor if it
does distinguish these internally, 
probably just drops the #DB or something.



> 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> 
> Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
> easier than emulating the next instruction or using MTF to step
> through the next instruction, and getting all of the deferred #DB
> delivery rules right. :-)


> 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> 
> Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
Yes, though that is would be mostly useful for nesting.

For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
the MTF, it can just process the MTF - why to re-inject it?


> 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> 
> These are single-step, I/O and data breakpoint traps.

I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
emulate the task switches, this has to be injected.

> 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> You might be interested in my augmented Table 6-2 (from volume 3 of
> the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml
> 

I can't access this document for some reason (from my redhat account, which is gmail as well).

Thanks for the info,
Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 10 months ago

On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:

> > Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> > at the start of the instruction.
>
> Are you sure about that?

I had been sure when I wrote that, but now that I see your response, I
have to question my memory. The SDM is definitely more authoritative
than I am.

> > When you say "ignores," do you mean that AMD ignores a data breakpoint
> > or single-step trap generated by MOV-SS, or it ignores the fact that
> > delivering such a #DB trap between the MOV-SS and the subsequent
> > MOV-ESP will create a stack frame in the wrong place?
>
> Two things which can be infered from the SVM spec.
>         - AMD doesn't distinguish between MOV SS and STI int shadow.
>         - AMD has no 'pending debug exception field' in the vmcb.
>
> I don't know what AMD does for #DB that happens on MOV SS, nor if it
> does distinguish these internally,
> probably just drops the #DB or something.

Without carrying pending debug exceptions, it seems that the only two
choices are to deliver the #DB, with the exception frame in an
unintended location or to drop the #DB. The latter seems preferable,
but neither one seems good. What I don't understand is why you claim
that AMD does this "rightfully." Are you saying that anyone with the
audacity to run a debugger on legacy code deserves to be thrown in
front of a moving train?

> > Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> Yes, though that is would be mostly useful for nesting.
>
> For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
> the MTF, it can just process the MTF - why to re-inject it?

You're right. The facility is probably just there to make MTF
virtualizable. Intel was paying much closer attention to
virtualizability by the time MTF came along.

> >
> > These are single-step, I/O and data breakpoint traps.
>
> I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
> while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
> emulate the task switches, this has to be injected.

Just enumerating. No more, no less.

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Thu, 2022-06-30 at 09:28 -0700, Jim Mattson wrote:
> On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> > > Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> > > at the start of the instruction.
> > 
> > Are you sure about that?
> 
> I had been sure when I wrote that, but now that I see your response, I
> have to question my memory. The SDM is definitely more authoritative
> than I am.

x86 is like a fractal, the more I know it more I realize I don't.

> 
> > > When you say "ignores," do you mean that AMD ignores a data breakpoint
> > > or single-step trap generated by MOV-SS, or it ignores the fact that
> > > delivering such a #DB trap between the MOV-SS and the subsequent
> > > MOV-ESP will create a stack frame in the wrong place?
> > 
> > Two things which can be infered from the SVM spec.
> >         - AMD doesn't distinguish between MOV SS and STI int shadow.
> >         - AMD has no 'pending debug exception field' in the vmcb.
> > 
> > I don't know what AMD does for #DB that happens on MOV SS, nor if it
> > does distinguish these internally,
> > probably just drops the #DB or something.
> 
> Without carrying pending debug exceptions, it seems that the only two
> choices are to deliver the #DB, with the exception frame in an
> unintended location or to drop the #DB. The latter seems preferable,
> but neither one seems good. What I don't understand is why you claim
> that AMD does this "rightfully." Are you saying that anyone with the
> audacity to run a debugger on legacy code deserves to be thrown in
> front of a moving train?

I understand what you mean, its a tradeof of 100% compliant implementation
vs complexity the corner cases introduce. #DB can already be missed in some
cases I think, especially from my experience from debuggers, and even more especially
when debugging an OS.

It is a pain, as the OS naturally tries to switch tasks and process
interrupts all the time, I even added that _BLOCKIRQ flag to KVM to make it a bit better.

But still I understand what you mean, so maybe indeed VMX did it better.

> 
> > > Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> > Yes, though that is would be mostly useful for nesting.
> > 
> > For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
> > the MTF, it can just process the MTF - why to re-inject it?
> 
> You're right. The facility is probably just there to make MTF
> virtualizable. Intel was paying much closer attention to
> virtualizability by the time MTF came along.

That makes sense.


> 
> > > These are single-step, I/O and data breakpoint traps.
> > 
> > I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
> > while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
> > emulate the task switches, this has to be injected.
> 
> Just enumerating. No more, no less.
> 

All right, thank you very much for the help, especialy for the tables you provided,
all of this should be enough now for me to review the patch series.

Thanks,
Best regards,
	Maxim Levitsky

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Jim Mattson 3 years, 10 months ago

On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> I can't access this document for some reason (from my redhat account, which is gmail as well).

Try this one: https://docs.google.com/spreadsheets/d/13Yp7Cdg3ZyKoeZ3Qebp3uWi7urlPNmo5CQU5zFlayzs

Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Posted by Maxim Levitsky 3 years, 10 months ago

On Thu, 2022-06-30 at 05:17 -0700, Jim Mattson wrote:
> On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> > I can't access this document for some reason (from my redhat account, which is gmail as well).
> 
> Try this one: https://docs.google.com/spreadsheets/d/13Yp7Cdg3ZyKoeZ3Qebp3uWi7urlPNmo5CQU5zFlayzs
> 
Thanks, now I can access both documents.

Best regards,
	Maxim Levitsky