[PATCH 0/7] KVM: TDX: TD vcpu enter/exit

Adrian Hunter posted 7 patches 1 year, 2 months ago
There is a newer version of this series
arch/x86/include/asm/kvm_host.h |   1 +
arch/x86/include/asm/tdx.h      |   1 +
arch/x86/kvm/vmx/main.c         |  45 ++++++++-
arch/x86/kvm/vmx/tdx.c          | 212 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/tdx.h          |  14 +++
arch/x86/kvm/vmx/x86_ops.h      |   9 ++
arch/x86/kvm/x86.c              |  24 ++++-
arch/x86/virt/vmx/tdx/tdx.c     |   8 ++
arch/x86/virt/vmx/tdx/tdx.h     |   1 +
9 files changed, 306 insertions(+), 9 deletions(-)
[PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Adrian Hunter 1 year, 2 months ago
Hi

This patch series introduces callbacks to facilitate the entry of a TD VCPU
and the corresponding save/restore of host state.

There are some outstanding things still to do (see below), so we expect to
post future revisions of this patch set, but please do review the current
patches so that they can be made ready for hand off to Paolo.  Also
direction is needed for "x86/virt/tdx: Add SEAMCALL wrapper to enter/exit
TDX guest" because it will affect KVM.

This patch set is one of several patch sets that are all needed to provide
the ability to run a functioning TD VM.  They have been split from the
"marker" sections of patch set "[PATCH v19 000/130] KVM TDX basic feature
support":

  https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@intel.com/

The recent patch sets are:

  TDX host: metadata reading
  TDX vCPU/VM creation
  TDX KVM MMU part 2
  TD vcpu enter/exit			<- this one
  TD vcpu exits/interrupts/hypercalls   <- still to come

Notably, a later patch sets deal with VCPU exits, interrupts and
hypercalls.

For x86 maintainers

This series has 1 commit that is an RFC that needs input from x86
maintainers:

  x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest

This is because wrapping TDH.VP.ENTER means dealing with multiple input and
output formats for the data in argument registers. We would like
maintainers to comment on the discussion that we will have on it.

Overview

A TD VCPU is entered via the SEAMCALL TDH.VP.ENTER. The TDX Module manages
the save/restore of guest state and, in conjunction with the SEAMCALL
interface, handles certain aspects of host state. However, there are
specific elements of the host state that require additional attention, as
detailed in the Intel TDX ABI documentation for TDH.VP.ENTER.

TDX is quite different from VMX in this regard.  For VMX, the host VMM is
heavily involved in restoring, managing and saving guest CPU state, whereas
for TDX this is handled by the TDX Module.  In that way, the TDX Module can
protect the confidentiality and integrity of TD CPU state.

The TDX Module does not save/restore all host CPU state because the host
VMM can do it more efficiently and selectively.  CPU state referred to
below is host CPU state.  Often values are already held in memory so no
explicit save is needed, and restoration may not be needed if the kernel
is not using a feature.

Outstanding things still to do:

  - how to wrap TDH.VP.ENTER SEAMCALL, refer to patch "x86/virt/tdx:
    Add SEAMCALL wrapper to enter/exit TDX guest"
  - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
    and guest_state_exit_irqoff() which comments say should be
    called from non-instrumentable code but noinst was removed
    at Sean's suggestion:
  	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
    noinstr is also needed to retain NMI-blocking by avoiding
    instrumented code that leads to an IRET which unblocks NMIs.
    A later patch set will deal with NMI VM-exits.
  - disallow TDX guest to use Intel PT
  	I think Tony will fix tdx_get_supported_xfam()
  - disallow PERFMON (TD attribute bit 63)
  - save/restore MSR IA32_UMWAIT_CONTROL or disallow guest
    CPUID(7,0).ECX.WAITPKG[5]
  - save/restore IA32_DEBUGCTL
  	VMX does:
  		vmx_vcpu_load() -> get_debugctlmsr()
  		vmx_vcpu_run() -> update_debugctlmsr()
  	TDX Module only preserves bits 1, 12 and 14

Key Details

Argument Passing: Similar to other SEAMCALLs, TDH.VP.ENTER passes
arguments through General Purpose Registers (GPRs). For the special case
of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used,
as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the
TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally,
XMM registers are not required for the existing Guest Hypervisor
Communication Interface and are handled by existing KVM code should they be
modified by the guest.

Debug Register Handling: After TDH.VP.ENTER returns, registers DR0, DR1,
DR2, DR3, DR6, and DR7 are set to their architectural INIT values. Existing
KVM code already handles the restoration of host values as needed, refer
vcpu_enter_guest() which calls hw_breakpoint_restore().

MSR Restoration: Certain Model-Specific Registers (MSRs) need to be
restored post TDH.VP.ENTER. The Intel TDX ABI documentation provides a
detailed list in the msr_preservation.json file. Most MSRs do not require
restoration if the guest is not utilizing the corresponding feature. The
following features are currently assumed to be unsupported, and their MSRs
are not restored:
  PERFMON            (TD ATTRIBUTES[63])
  LBRs               (XFAM[15])
  User Interrupts    (XFAM[14])
  Intel PT           (XFAM[8])
The one feature that is supported:
  CET                (XFAM[11-12])	is restored via kvm_put_guest_fpu()

Other host MSR/Register Handling:

MSR IA32_XFD is already restored by KVM, refer to kvm_put_guest_fpu().
The TDX Module sets MSR IA32_XFD_ERR to its RESET value (0) which is
fine for the kernel.

MSR IA32_DEBUGCTL appears to have been overlooked.  According to
msr_preservation.json, the TDX Module preserves only bits 1, 12 and 14.
For VMX there is code to save and restore in vmx_vcpu_load() and
vmx_vcpu_run() respectively, but TDX does not use those functions.

MSR IA32_UARCH_MISC_CTL is not utilized by the kernel, so it is fine if
the TDX Module sets it to it's RESET value.

MSR IA32_KERNEL_GS_BASE is addressed in patch "KVM: TDX: vcpu_run:
save/restore host state (host kernel gs)".

MSRs IA32_XSS and XCRO are handled in patch "KVM: TDX: restore host xsave
state when exiting from the guest TD".

MSRs IA32_STAR, IA32_LSTAR, IA32_FMASK, and IA32_TSC_AUX are handled in
patch "KVM: TDX: restore user ret MSRs".

MSR IA32_TSX_CTRL is handled in patch "KVM: TDX: Add TSX_CTRL msr into
uret_msrs list".

MSR IA32_UMWAIT_CONTROL appears to have been overlooked.  The host value
needs to be restored if guest CPUID(7,0).ECX.WAITPKG[5] is 1, otherwise
that guest CPUID value needs to be disallowed.

Additional Notes

The patch "KVM: TDX: Implement TDX vcpu enter/exit path" highlights that
TDX does not support "PAUSE-loop exiting".  According to the TDX Module
Base arch. spec., hypercalls are expected to be used instead.  Note that
the Linux TDX guest supports existing hypercalls via TDG.VP.VMCALL.

Base

This series is based off of a kvm-coco-queue commit and some pre-req
series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
   kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks, bug fix and info dump" [1].
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2], with some
   new feedback from Sean.
4. v2 of “TDX vCPU/VM creation” [3]
5. v2 of "TDX KVM MMU part 2" [4]

It requires TDX module 1.5.06.00.0744[5], or later. This is due to removal
of the workarounds for the lack of the NO_RBP_MOD feature required by the
kernel. Now NO_RBP_MOD is enabled (in VM/vCPU creation patches), and this
particular version of the TDX module has a required NO_RBP_MOD related bug
fix.
A working edk2 commit is 95d8a1c ("UnitTestFrameworkPkg: Use TianoCore
mirror of subhook submodule").

Testing

The series has been tested as part of the development branch for the TDX
base series. The testing consisted of TDX kvm-unit-tests and booting a
Linux TD, and TDX enhanced KVM selftests.

The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20

Matching QEMU:
https://github.com/intel-staging/qemu-tdx/commits/tdx-qemu-upstream-v6.1/

[0] https://lore.kernel.org/kvm/20240904030751.117579-1-rick.p.edgecombe@intel.com/
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com/#t
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com/
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com/
[4] https://lore.kernel.org/kvm/20241112073327.21979-1-yan.y.zhao@intel.com/
[5] https://github.com/intel/tdx-module/releases/tag/TDX_1.5.06

Chao Gao (1):
      KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
      wrmsr

Isaku Yamahata (4):
      KVM: TDX: Implement TDX vcpu enter/exit path
      KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
      KVM: TDX: restore host xsave state when exit from the guest TD
      KVM: TDX: restore user ret MSRs

Kai Huang (1):
      x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest

Yang Weijiang (1):
      KVM: TDX: Add TSX_CTRL msr into uret_msrs list

 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/include/asm/tdx.h      |   1 +
 arch/x86/kvm/vmx/main.c         |  45 ++++++++-
 arch/x86/kvm/vmx/tdx.c          | 212 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h          |  14 +++
 arch/x86/kvm/vmx/x86_ops.h      |   9 ++
 arch/x86/kvm/x86.c              |  24 ++++-
 arch/x86/virt/vmx/tdx/tdx.c     |   8 ++
 arch/x86/virt/vmx/tdx/tdx.h     |   1 +
 9 files changed, 306 insertions(+), 9 deletions(-)

Regards
Adrian
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Paolo Bonzini 1 year, 1 month ago
Applied to kvm-coco-queue, thanks.

Paolo
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Binbin Wu 1 year, 2 months ago


On 11/22/2024 4:14 AM, Adrian Hunter wrote:
[...]
>    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
>      and guest_state_exit_irqoff() which comments say should be
>      called from non-instrumentable code but noinst was removed
>      at Sean's suggestion:
>    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
>      noinstr is also needed to retain NMI-blocking by avoiding
>      instrumented code that leads to an IRET which unblocks NMIs.
>      A later patch set will deal with NMI VM-exits.
>
In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
"The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
that applies to TDX.  Either that, or there are some massive bugs lurking due to
missing code."

I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
noinstr section to avoid the unblock of NMIs due to instrumentation-induced
fault.

Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Sean Christopherson 1 year, 2 months ago
On Mon, Nov 25, 2024, Binbin Wu wrote:
> On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> [...]
> >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> >      and guest_state_exit_irqoff() which comments say should be
> >      called from non-instrumentable code but noinst was removed
> >      at Sean's suggestion:
> >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> >      noinstr is also needed to retain NMI-blocking by avoiding
> >      instrumented code that leads to an IRET which unblocks NMIs.
> >      A later patch set will deal with NMI VM-exits.
> > 
> In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> that applies to TDX.  Either that, or there are some massive bugs lurking due to
> missing code."
> 
> I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> fault.

With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
root mode, and the TDX-Module is the "host", running in SEAM root mode.

And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).

So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
are never blocked.

TDX is also different because KVM isn't responsible for context switching guest
state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
with a host value, e.g. due to take a #PF due instrumentation triggering something.

All that said, I did forget that code that runs between guest_state_enter_irqoff()
and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
as noinstr.  Just please make sure nothing else is added in the noinstr section
unless it absolutely needs to be there.
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Huang, Kai 1 year, 2 months ago
On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Binbin Wu wrote:
> > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > [...]
> > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > >      and guest_state_exit_irqoff() which comments say should be
> > >      called from non-instrumentable code but noinst was removed
> > >      at Sean's suggestion:
> > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > >      noinstr is also needed to retain NMI-blocking by avoiding
> > >      instrumented code that leads to an IRET which unblocks NMIs.
> > >      A later patch set will deal with NMI VM-exits.
> > > 
> > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > missing code."
> > 
> > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > fault.
> 
> With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> root mode, and the TDX-Module is the "host", running in SEAM root mode.
> 
> And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> 
> So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> are never blocked.

I thought NMI remains being blocked after SEAMRET?

The TDX CPU architecture extension spec says:

"
On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
While inhibited, if these events occur, then they are tailored to stay pending
and be delivered when the inhibit state is removed. NMI and external interrupts
can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
MSR_INTR_PENDING can be read to help determine status of any pending events.

On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
inhibit states are, by design, updated based on the configuration of the TD VMCS
used to perform the VM entry.

...

On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
inhibit state can be restored to the inhibit state at the time of the previous
SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
"

Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  

And the last paragraph does say "any pending NMI would be delivered if not
inhibited".  

But I thought this applies to the case when "NMI happens in SEAM VMX root", but
not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
"delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..

> 
> TDX is also different because KVM isn't responsible for context switching guest
> state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
> where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
> with a host value, e.g. due to take a #PF due instrumentation triggering something.
> 
> All that said, I did forget that code that runs between guest_state_enter_irqoff()
> and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
> in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
> as noinstr.  Just please make sure nothing else is added in the noinstr section
> unless it absolutely needs to be there.

If NMI is not a concern, is below also an option?

	guest_state_enter_irqoff();

	instructmentation_begin();
	tdh_vp_enter();
	instructmentation_end();

	guest_state_exit_irqoff();

Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Sean Christopherson 1 year, 2 months ago
On Mon, Nov 25, 2024, Kai Huang wrote:
> On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> > On Mon, Nov 25, 2024, Binbin Wu wrote:
> > > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > > [...]
> > > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > > >      and guest_state_exit_irqoff() which comments say should be
> > > >      called from non-instrumentable code but noinst was removed
> > > >      at Sean's suggestion:
> > > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > > >      noinstr is also needed to retain NMI-blocking by avoiding
> > > >      instrumented code that leads to an IRET which unblocks NMIs.
> > > >      A later patch set will deal with NMI VM-exits.
> > > > 
> > > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > > missing code."
> > > 
> > > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > > fault.
> > 
> > With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> > root mode, and the TDX-Module is the "host", running in SEAM root mode.
> > 
> > And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> > the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> > from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> > is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> > 
> > So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> > are never blocked.
> 
> I thought NMI remains being blocked after SEAMRET?

No, because NMIs weren't blocked at SEAMCALL.

> The TDX CPU architecture extension spec says:
> 
> "
> On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
> While inhibited, if these events occur, then they are tailored to stay pending
> and be delivered when the inhibit state is removed. NMI and external interrupts
> can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
> MSR_INTR_PENDING can be read to help determine status of any pending events.
> 
> On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
> inhibit states are, by design, updated based on the configuration of the TD VMCS
> used to perform the VM entry.
> 
> ...
> 
> On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
> inhibit state can be restored to the inhibit state at the time of the previous
> SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
> "
> 
> Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  

Yep.

> And the last paragraph does say "any pending NMI would be delivered if not
> inhibited".  

That's referring to the scenario where an NMI becomes pending while the CPU is in
SEAM, i.e. has NMIs blocked.

> But I thought this applies to the case when "NMI happens in SEAM VMX root", but
> not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
> "delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..

When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
TDX-Module that propagates that blocking to SEAMCALL VMCS.

Hmm, actually, this means that TDX has a causality inversion, which may become
visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
and handled before NMI X.

So the TDX-Module needs something like this:

diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
index eecfb2e..b5c17c3 100644
--- a/src/td_transitions/td_exit.c
+++ b/src/td_transitions/td_exit.c
@@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
         load_xmms_by_mask(tdvps_ptr, xmm_select);
     }
 
+    if (<is NMI VM-Exit => SEAMRET)
+    {
+        set_guest_inter_blocking_by_nmi();
+    }
+
     // 7.   Run the common SEAMRET routine.
     tdx_vmm_post_dispatching();


and then KVM should indeed handle NMI exits prior to leaving the noinstr section.
 
> > TDX is also different because KVM isn't responsible for context switching guest
> > state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
> > where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
> > with a host value, e.g. due to take a #PF due instrumentation triggering something.
> > 
> > All that said, I did forget that code that runs between guest_state_enter_irqoff()
> > and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
> > in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
> > as noinstr.  Just please make sure nothing else is added in the noinstr section
> > unless it absolutely needs to be there.
> 
> If NMI is not a concern, is below also an option?

No, because instrumentation needs to be prohibited for the entire time between
guest_state_enter_irqoff() and guest_state_exit_irqoff().

> 	guest_state_enter_irqoff();
> 
> 	instructmentation_begin();
> 	tdh_vp_enter();
> 	instructmentation_end();
> 
> 	guest_state_exit_irqoff();
> 
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Huang, Kai 1 year, 2 months ago
On Mon, 2024-11-25 at 14:51 -0800, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Kai Huang wrote:
> > On Mon, 2024-11-25 at 07:19 -0800, Sean Christopherson wrote:
> > > On Mon, Nov 25, 2024, Binbin Wu wrote:
> > > > On 11/22/2024 4:14 AM, Adrian Hunter wrote:
> > > > [...]
> > > > >    - tdx_vcpu_enter_exit() calls guest_state_enter_irqoff()
> > > > >      and guest_state_exit_irqoff() which comments say should be
> > > > >      called from non-instrumentable code but noinst was removed
> > > > >      at Sean's suggestion:
> > > > >    	https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com/
> > > > >      noinstr is also needed to retain NMI-blocking by avoiding
> > > > >      instrumented code that leads to an IRET which unblocks NMIs.
> > > > >      A later patch set will deal with NMI VM-exits.
> > > > > 
> > > > In https://lore.kernel.org/all/Zg8tJspL9uBmMZFO@google.com, Sean mentioned:
> > > > "The reason the VM-Enter flows for VMX and SVM need to be noinstr is they do things
> > > > like load the guest's CR2, and handle NMI VM-Exits with NMIs blocks.  None of
> > > > that applies to TDX.  Either that, or there are some massive bugs lurking due to
> > > > missing code."
> > > > 
> > > > I don't understand why handle NMI VM-Exits with NMIs blocks doesn't apply to
> > > > TDX.  IIUIC, similar to VMX, TDX also needs to handle the NMI VM-exit in the
> > > > noinstr section to avoid the unblock of NMIs due to instrumentation-induced
> > > > fault.
> > > 
> > > With TDX, SEAMCALL is mechnically a VM-Exit.  KVM is the "guest" running in VMX
> > > root mode, and the TDX-Module is the "host", running in SEAM root mode.
> > > 
> > > And for TDH.VP.ENTER, if a hardware NMI arrives with the TDX guest is active,
> > > the initial NMI VM-Exit, which consumes the NMI and blocks further NMIs, goes
> > > from SEAM non-root to SEAM root.  The SEAMRET from SEAM root to VMX root (KVM)
> > > is effectively a VM-Enter, and does NOT block NMIs in VMX root (at least, AFAIK).
> > > 
> > > So trying to handle the NMI "exit" in a noinstr section is pointless because NMIs
> > > are never blocked.
> > 
> > I thought NMI remains being blocked after SEAMRET?
> 
> No, because NMIs weren't blocked at SEAMCALL.
> 
> > The TDX CPU architecture extension spec says:
> > 
> > "
> > On transition to SEAM VMX root operation, the processor can inhibit NMI and SMI.
> > While inhibited, if these events occur, then they are tailored to stay pending
> > and be delivered when the inhibit state is removed. NMI and external interrupts
> > can be uninhibited in SEAM VMX-root operation. In SEAM VMX-root operation,
> > MSR_INTR_PENDING can be read to help determine status of any pending events.
> > 
> > On transition to SEAM VMX non-root operation using a VM entry, NMI and INTR
> > inhibit states are, by design, updated based on the configuration of the TD VMCS
> > used to perform the VM entry.
> > 
> > ...
> > 
> > On transition to legacy VMX root operation using SEAMRET, the NMI and SMI
> > inhibit state can be restored to the inhibit state at the time of the previous
> > SEAMCALL and any pending NMI/SMI would be delivered if not inhibited.
> > "
> > 
> > Here NMI is inhibited in SEAM VMX root, but is never inhibited in VMX root.  
> 
> Yep.
> 
> > And the last paragraph does say "any pending NMI would be delivered if not
> > inhibited".  
> 
> That's referring to the scenario where an NMI becomes pending while the CPU is in
> SEAM, i.e. has NMIs blocked.
> 
> > But I thought this applies to the case when "NMI happens in SEAM VMX root", but
> > not "NMI happens in SEAM VMX non-root"?  I thought the NMI is already
> > "delivered" when CPU is in "SEAM VMX non-root", but I guess I was wrong here..
> 
> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> TDX-Module that propagates that blocking to SEAMCALL VMCS.

Oh, I didn't read the module code, but was trying to looking for clue from the
TDX specs.  It was a surprise to me that VMX case and TDX case have different
behaviour in terms of "NMI blocking when exiting to _host_ VMM".

I was thinking SEAMRET (or hardware in general) should have done something to
make sure of it.

> 
> Hmm, actually, this means that TDX has a causality inversion, which may become
> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> and handled before NMI X.

Sorry, NMI X was acked by CPU firstly before NMI X+1, why is NMI X+1 delivered
before NMI X?

> 
> So the TDX-Module needs something like this:
> 
> diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
> index eecfb2e..b5c17c3 100644
> --- a/src/td_transitions/td_exit.c
> +++ b/src/td_transitions/td_exit.c
> @@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
>          load_xmms_by_mask(tdvps_ptr, xmm_select);
>      }
>  
> +    if (<is NMI VM-Exit => SEAMRET)
> +    {
> +        set_guest_inter_blocking_by_nmi();
> +    }
> +
>      // 7.   Run the common SEAMRET routine.
>      tdx_vmm_post_dispatching();
> 
> 
> and then KVM should indeed handle NMI exits prior to leaving the noinstr section.

Yeah to me it should be done unconditionally as it gives the same behaviour to
the normal VMX VM-Exit case, that NMI is left blocked after exiting to the host
VMM.  The NMI Exit reason is passed to host VMM anyway.

If NMI is handled immediately after SEAMRET, KVM won't have any chance to do
additional things before handing NMI like below for VMX:

  kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
  // call NMI handling routine
  kvm_after_interrupt(vcpu);

I suppose this should be a concern?


Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Binbin Wu 1 year, 2 months ago


On 11/26/2024 6:51 AM, Sean Christopherson wrote:

[...]
> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> TDX-Module that propagates that blocking to SEAMCALL VMCS.
I see, thanks for the explanation!

>
> Hmm, actually, this means that TDX has a causality inversion, which may become
> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> and handled before NMI X.

This example can also cause an issue without FRED.
1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
2. NMI X+1 becomes pending while SEAM root is active.
3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
4. NMI X+1 is delivered and handled before NMI X.
    (NMI handler could handle all NMI source events, including the source
     triggered NMI X)
5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
In step 5, because the source event caused NMI X has been handled, and NMI X
will not be detected as a second half of back-to-back NMIs, according to
Linux NMI handler, it will be considered as an unknown NMI.

Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
mode and before KVM handling the VM-Exit caused by NMI X.


>
> So the TDX-Module needs something like this:
>
> diff --git a/src/td_transitions/td_exit.c b/src/td_transitions/td_exit.c
> index eecfb2e..b5c17c3 100644
> --- a/src/td_transitions/td_exit.c
> +++ b/src/td_transitions/td_exit.c
> @@ -527,6 +527,11 @@ void td_vmexit_to_vmm(uint8_t vcpu_state, uint8_t last_td_exit, uint64_t scrub_m
>           load_xmms_by_mask(tdvps_ptr, xmm_select);
>       }
>   
> +    if (<is NMI VM-Exit => SEAMRET)
> +    {
> +        set_guest_inter_blocking_by_nmi();
> +    }
> +
>       // 7.   Run the common SEAMRET routine.
>       tdx_vmm_post_dispatching();
>
>
> and then KVM should indeed handle NMI exits prior to leaving the noinstr section.
>   
>>> TDX is also different because KVM isn't responsible for context switching guest
>>> state.  Specifically, CR2 is managed by the TDX Module, and so there is no window
>>> where KVM runs with guest CR2, and thus there is no risk of clobbering guest CR2
>>> with a host value, e.g. due to take a #PF due instrumentation triggering something.
>>>
>>> All that said, I did forget that code that runs between guest_state_enter_irqoff()
>>> and guest_state_exit_irqoff() can't be instrumeneted.  And at least as of patch 2
>>> in this series, the simplest way to make that happen is to tag tdx_vcpu_enter_exit()
>>> as noinstr.  Just please make sure nothing else is added in the noinstr section
>>> unless it absolutely needs to be there.
>> If NMI is not a concern, is below also an option?
> No, because instrumentation needs to be prohibited for the entire time between
> guest_state_enter_irqoff() and guest_state_exit_irqoff().
>
>> 	guest_state_enter_irqoff();
>>
>> 	instructmentation_begin();
>> 	tdh_vp_enter();
>> 	instructmentation_end();
>>
>> 	guest_state_exit_irqoff();
>>

Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Huang, Kai 1 year, 2 months ago
On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> 
> 
> On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> 
> [...]
> > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> I see, thanks for the explanation!
> 
> > 
> > Hmm, actually, this means that TDX has a causality inversion, which may become
> > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > and handled before NMI X.
> 
> This example can also cause an issue without FRED.
> 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> 2. NMI X+1 becomes pending while SEAM root is active.
> 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> 4. NMI X+1 is delivered and handled before NMI X.
>     (NMI handler could handle all NMI source events, including the source
>      triggered NMI X)
> 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> In step 5, because the source event caused NMI X has been handled, and NMI X
> will not be detected as a second half of back-to-back NMIs, according to
> Linux NMI handler, it will be considered as an unknown NMI.

I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
SEAMRET.

> 
> Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
> mode and before KVM handling the VM-Exit caused by NMI X.
> 

If we can make sure NMI is still blocked upon SEAMRET then everything follows
the current VMX flow IIUC.  We should make that happen IMHO.


Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Binbin Wu 1 year, 2 months ago


On 11/26/2024 11:52 AM, Huang, Kai wrote:
> On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
>>
>> On 11/26/2024 6:51 AM, Sean Christopherson wrote:
>>
>> [...]
>>> When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
>>> performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
>>> TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
>>> load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
>>> TDX-Module that propagates that blocking to SEAMCALL VMCS.
>> I see, thanks for the explanation!
>>
>>> Hmm, actually, this means that TDX has a causality inversion, which may become
>>> visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
>>> and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
>>> TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
>>> and handled before NMI X.
>> This example can also cause an issue without FRED.
>> 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
>> 2. NMI X+1 becomes pending while SEAM root is active.
>> 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
>> 4. NMI X+1 is delivered and handled before NMI X.
>>      (NMI handler could handle all NMI source events, including the source
>>       triggered NMI X)
>> 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
>> In step 5, because the source event caused NMI X has been handled, and NMI X
>> will not be detected as a second half of back-to-back NMIs, according to
>> Linux NMI handler, it will be considered as an unknown NMI.
> I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> SEAMRET.

IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

>
>> Actually, the issue could happen if NMI X+1 occurs after exiting to SEAM root
>> mode and before KVM handling the VM-Exit caused by NMI X.
>>
> If we can make sure NMI is still blocked upon SEAMRET then everything follows
> the current VMX flow IIUC.  We should make that happen IMHO.
>
>
Agree.
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Sean Christopherson 1 year, 2 months ago
On Tue, Nov 26, 2024, Binbin Wu wrote:
> On 11/26/2024 11:52 AM, Huang, Kai wrote:
> > On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> > > 
> > > On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> > > 
> > > [...]
> > > > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > > > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > > > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > > > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > > > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> > > I see, thanks for the explanation!
> > > 
> > > > Hmm, actually, this means that TDX has a causality inversion, which may become
> > > > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > > > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > > > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > > > and handled before NMI X.
> > > This example can also cause an issue without FRED.
> > > 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> > > 2. NMI X+1 becomes pending while SEAM root is active.
> > > 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> > > 4. NMI X+1 is delivered and handled before NMI X.
> > >      (NMI handler could handle all NMI source events, including the source
> > >       triggered NMI X)
> > > 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> > > In step 5, because the source event caused NMI X has been handled, and NMI X
> > > will not be detected as a second half of back-to-back NMIs, according to
> > > Linux NMI handler, it will be considered as an unknown NMI.
> > I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> > SEAMRET.
> 
> IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
> NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

Yep.  The NMI is consumed by the VM-Exit, for all intents and purposes.  KVM must
manually invoke the NMI handler.

Which is how the ordering gets messed up: NMI X+1 arrives before KVM has a chance
to manually invoke the handler for NMI X.
Re: [PATCH 0/7] KVM: TDX: TD vcpu enter/exit
Posted by Huang, Kai 1 year, 2 months ago
On Tue, 2024-11-26 at 13:29 +0800, Binbin Wu wrote:
> 
> 
> On 11/26/2024 11:52 AM, Huang, Kai wrote:
> > On Tue, 2024-11-26 at 09:44 +0800, Binbin Wu wrote:
> > > 
> > > On 11/26/2024 6:51 AM, Sean Christopherson wrote:
> > > 
> > > [...]
> > > > When an NMI happens in non-root, the NMI is acknowledged by the CPU prior to
> > > > performing VM-Exit.  In regular VMX, NMIs are blocked after such VM-Exits.  With
> > > > TDX, that blocking happens for SEAM root, but the SEAMRET back to VMX root will
> > > > load interruptibility from the SEAMCALL VMCS, and I don't see any code in the
> > > > TDX-Module that propagates that blocking to SEAMCALL VMCS.
> > > I see, thanks for the explanation!
> > > 
> > > > Hmm, actually, this means that TDX has a causality inversion, which may become
> > > > visible with FRED's NMI source reporting.  E.g. NMI X arrives in SEAM non-root
> > > > and triggers a VM-Exit.  NMI X+1 becomes pending while SEAM root is active.
> > > > TDX-Module SEAMRETs to VMX root, NMIs are unblocked, and so NMI X+1 is delivered
> > > > and handled before NMI X.
> > > This example can also cause an issue without FRED.
> > > 1. NMI X arrives in SEAM non-root and triggers a VM-Exit.
> > > 2. NMI X+1 becomes pending while SEAM root is active.
> > > 3. TDX-Module SEAMRETs to VMX root, NMIs are unblocked.
> > > 4. NMI X+1 is delivered and handled before NMI X.
> > >      (NMI handler could handle all NMI source events, including the source
> > >       triggered NMI X)
> > > 5. KVM calls exc_nmi() to handle the VM Exit caused by NMI X
> > > In step 5, because the source event caused NMI X has been handled, and NMI X
> > > will not be detected as a second half of back-to-back NMIs, according to
> > > Linux NMI handler, it will be considered as an unknown NMI.
> > I don't think KVM should call exc_nmi() anymore if NMI is unblocked upon
> > SEAMRET.
> 
> IIUC, KVM has to, because the NMI triggered the VM-Exit can't trigger the
> NMI handler to be invoked automatically even if NMI is unblocked upon SEAMRET.

Ah I missed this.  You mean unblocking NMI won't invoke NMI handler via IDT
descriptor 2.  Then I see why NMI X+1 is handled before NMI X.  Thanks.