[RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor

Lai Jiangshan posted 73 patches 1 year, 9 months ago
Documentation/virt/kvm/x86/pvm-spec.rst  |  989 +++++++
arch/x86/Kconfig                         |   32 +
arch/x86/Makefile.postlink               |    9 +-
arch/x86/entry/Makefile                  |    4 +
arch/x86/entry/calling.h                 |   47 +-
arch/x86/entry/common.c                  |    1 +
arch/x86/entry/entry_64.S                |   75 +-
arch/x86/entry/entry_64_pvm.S            |  189 ++
arch/x86/entry/entry_64_switcher.S       |  270 ++
arch/x86/entry/vsyscall/vsyscall_64.c    |    4 +
arch/x86/include/asm/alternative.h       |   14 +
arch/x86/include/asm/cpufeatures.h       |    2 +
arch/x86/include/asm/disabled-features.h |    8 +-
arch/x86/include/asm/idtentry.h          |   12 +-
arch/x86/include/asm/init.h              |    5 +
arch/x86/include/asm/kvm-x86-ops.h       |    2 +
arch/x86/include/asm/kvm_host.h          |   30 +-
arch/x86/include/asm/kvm_para.h          |    7 +
arch/x86/include/asm/page_64.h           |    3 +
arch/x86/include/asm/paravirt.h          |   14 +-
arch/x86/include/asm/pgtable_64_types.h  |   14 +-
arch/x86/include/asm/processor.h         |    5 +
arch/x86/include/asm/ptrace.h            |    5 +
arch/x86/include/asm/pvm_para.h          |  103 +
arch/x86/include/asm/segment.h           |   14 +-
arch/x86/include/asm/switcher.h          |  119 +
arch/x86/include/uapi/asm/kvm_para.h     |    8 +-
arch/x86/include/uapi/asm/pvm_para.h     |  131 +
arch/x86/kernel/Makefile                 |    1 +
arch/x86/kernel/asm-offsets_64.c         |   31 +
arch/x86/kernel/cpu/common.c             |   11 +
arch/x86/kernel/head64.c                 |   10 +
arch/x86/kernel/head64_identity.c        |  108 +-
arch/x86/kernel/head_64.S                |   34 +
arch/x86/kernel/idt.c                    |    2 +
arch/x86/kernel/kvm.c                    |    2 +
arch/x86/kernel/ldt.c                    |    3 +
arch/x86/kernel/nmi.c                    |    8 +-
arch/x86/kernel/process_64.c             |   10 +-
arch/x86/kernel/pvm.c                    |  579 ++++
arch/x86/kernel/traps.c                  |    3 +
arch/x86/kernel/vmlinux.lds.S            |   18 +
arch/x86/kvm/Kconfig                     |    9 +
arch/x86/kvm/Makefile                    |    5 +-
arch/x86/kvm/cpuid.c                     |   26 +-
arch/x86/kvm/cpuid.h                     |    3 +
arch/x86/kvm/host_entry.S                |   50 +
arch/x86/kvm/mmu/mmu.c                   |   36 +-
arch/x86/kvm/mmu/paging_tmpl.h           |    3 +
arch/x86/kvm/mmu/spte.c                  |    4 +
arch/x86/kvm/pvm/host_mmu.c              |  119 +
arch/x86/kvm/pvm/pvm.c                   | 3257 ++++++++++++++++++++++
arch/x86/kvm/pvm/pvm.h                   |  169 ++
arch/x86/kvm/svm/svm.c                   |    4 +
arch/x86/kvm/trace.h                     |    7 +-
arch/x86/kvm/vmx/vmenter.S               |   43 -
arch/x86/kvm/vmx/vmx.c                   |   18 +-
arch/x86/kvm/x86.c                       |   33 +-
arch/x86/kvm/x86.h                       |   18 +
arch/x86/mm/dump_pagetables.c            |    3 +-
arch/x86/mm/kaslr.c                      |    8 +-
arch/x86/mm/pti.c                        |    7 +
arch/x86/platform/pvh/enlighten.c        |   22 +
arch/x86/platform/pvh/head.S             |    4 +
arch/x86/tools/relocs.c                  |   88 +-
arch/x86/tools/relocs.h                  |   20 +-
arch/x86/tools/relocs_common.c           |   38 +-
arch/x86/xen/enlighten_pv.c              |    1 +
include/linux/kvm_host.h                 |   10 +
include/linux/vmalloc.h                  |    2 +
include/uapi/Kbuild                      |    4 +
mm/vmalloc.c                             |   10 +
virt/kvm/pfncache.c                      |    2 +-
73 files changed, 6793 insertions(+), 166 deletions(-)
create mode 100644 Documentation/virt/kvm/x86/pvm-spec.rst
create mode 100644 arch/x86/entry/entry_64_pvm.S
create mode 100644 arch/x86/entry/entry_64_switcher.S
create mode 100644 arch/x86/include/asm/pvm_para.h
create mode 100644 arch/x86/include/asm/switcher.h
create mode 100644 arch/x86/include/uapi/asm/pvm_para.h
create mode 100644 arch/x86/kernel/pvm.c
create mode 100644 arch/x86/kvm/host_entry.S
create mode 100644 arch/x86/kvm/pvm/host_mmu.c
create mode 100644 arch/x86/kvm/pvm/pvm.c
create mode 100644 arch/x86/kvm/pvm/pvm.h
[RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Lai Jiangshan 1 year, 9 months ago
From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

This RFC series proposes a new virtualization framework built upon the
KVM hypervisor that does not require hardware-assisted virtualization
techniques. PVM (Pagetable-based virtual machine) is implemented as a
new vendor for KVM x86, which is compatible with the KVM virtualization
software stack, such as Kata Containers, a secure container technique in
a cloud-native environment.

The work also led to a paper being accepted at SOSP 2023 [sosp-2023-acm]
[sosp-2023-pdf], and Lai delivered a presentation at the symposium in
Germany in October 2023 [sosp-2023-slides]:

	PVM: Efficient Shadow Paging for Deploying Secure Containers in
	Cloud-native Environment

PVM has been adopted by Alibaba Cloud and Ant Group in production to
host tens of thousands of secure containers daily, and it has also been
adopted by the Openanolis community.

Motivation
==========
A team in Ant Group, co-creator of Kata Containers along with Intel,
deploy the VM-based containers in our public cloud VM to satisfy dynamic
resource requests and various needs to isolate workloads. However, for
safety, nested virtualization is disabled in the L0 hypervisor, so we
cannot use KVM directly. Additionally, the current nested architecture
involves complex and expensive transitions between the L0 hypervisor and
L1 hypervisor.

So the over-arching goals of PVM are to completely decouple secure
container hosting from the host hypervisor and hardware virtualization
support to:
  1) enable nested virtualization within any IaaS clouds without affecting
  the security, flexibility, and complexity of the cloud platform;
  2) avoid costly exits to the host hypervisor and devise efficient world
  switching mechanisms.

Why PVM
=======
The PVM hypervisor has the following features:

- Compatible with KVM ecosystems.

- No requiremment for hardware assistance.  Many cloud provider doesn't
  enable nested virtualization.  And it can also enable KVM in TDX/SEV
  guests.

- Flexible. Businesses with secure containers can easily expand in the
  cloud when demand surges, instead of waiting to accquire bare metal.
  Cloud vendors often offer lower pricing for spot instances or
  preemptible VMs.

- Help for kernel CI with fast [re-]booting PVM guest kernels nested in
  cheeper VMs.

- Enable light-weight container kernels.

Design
======
The design detail can be found in our paper posted in SOSP2023.

The framework contains 3 main objects:

"Switcher" - The code and data that handling the VM enter and VM exit.

"PVM hypervisor" - A new vendor implementation for KVM x86, it uses
                   existed software emulation in KVM for virtualization,
                   e.g., shadow paging, APIC emulation, x86 instruction
                   emulator.

"PVM paravirtual guest" - A PIE linux kernel runs in hardware CPL3, and
                          use existed PVOPS to implement optimization.


                shadowed-user-pagetable shadowed-kernel-pagetable
                            +----------|-----------+
                            |   user   |  kernel   |
    h_ring3                 |  (umod)  |  (smod)   |
                            +---+------|--------+--+
                        syscall |    ^      ^   | hypercall/
            interrupt/exception |    |      |   | interrupt/exception
--------------------------------|----|------|---|------------------------------------
                                |    |sysret|   |
    h_ring0                     v    | /iret|   v
                              +------+------+----+
                              |     switcher     |
                              +---------+--------+
                            vm entry ^  | vm exit
                      (function call)|  v (function return)
      +..............................+..........................................+
      .                                                                         .
      .     +---------------+                      +--------------+             .
      .     |    kvm.ko     |                      |  kvm-pvm.ko  |             .
      .     +---------------+                      +--------------+             .
      .                         Virtualization                                  .
      .   memory virtualization                  CPU Virtualization             .
      +.........................................................................+
                                 PVM hypervisor


1. Switcher: To simplify, we reuse host entries to handle VM enter and
             VM exit, A flag is introduced to mark that the guest world
             is switched or during the switch in the entries. Therefore,
             the guest almost looks like a normal userspace process in
             the host.

2. Host MMU: The switcher needs to be accessed by the guest, which is
             similar to the CPU entry area for userspace in KPTI.
             Therefore, for simplification, we reserved a range of PGDs
             for the guest, and the guest kernel can only be allowed to
             run in this range. During the root SP allocation, the
             host PGDs of the switcher will be cloned into the guest
             SPT.

3. Event delivery: A new event delivery is used instead of the IDT-based
                   event delivery. The event delivery in PVM is similar
                   to FRED.

Design Decisions
================
In designing PVM, many decisions have been made and explained in the
patches. "Integral entry", "Exclusive address space separation and PIE
guest", and "Simple spec design" are among important decisions besides
for "KVM ecosystems" and "Ring3+Pagetable for privilege seperation".

Integral entry
--------------
The PVM switcher is integrated into the host kernel's entry code,
providing the following advantages:

- Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
  subordinate to the hypervisor/switcher, and the host Linux kernel
  loses control over the entry code. This can cause inconvenience if
  there is a need to update something when there is a bug in the
  switcher or hardware.  Integral entry gives the control back to the
  host kernel.

- Zero overhead incurred: The integrated entry code doesn't cause any
  overhead in host Linux entry path, thanks to the discreet design with
  PVM code in the switcher, where the PVM path is bypassed on host events.
  While in XENPV/Lguest, host events must be handled by the
  hypervisor/switcher before being processed.

- Integral design allows all aspects of the entry and switcher to be
  considered together.

This RFC patchset doesn't include the complete design for integral
entry. It requires fixing the issue with IST [atomic-ist-entry].
And it would be better with the conversion of some ASM code to C code
[asm-to-c] (The link provided is not the final version, and some partial
patchset had sent separately later on). The new version of the patches
for converting ASM code and fixing the IST problem will be updated
and sent separately later.

Without the complete integral entry code, this patchset still has
unresolved issues related to IST, KPTI, and so on.

Exclusive address space separation and PIE guest
------------------------------------------------
In the higher half of the address spaces (where the most significant
bits in the addresses are 1s), the address ranges that a PVM guest is
allowed are exclusive from the host kernel.

- The exclusivity of the address makes it possible to design the
  integral entry because the switcher needs to be mapped for all
  guests.

- The exclusivity of the address allows the host kernel to still utilize
  global pages and save TLB entries. (XENPV doesn't allow it)

- With exclusivity, the existing shadow page table code can be reused
  with very few changes. The shadow page table contains both the guest
  portions and the host portions.

- Exclusivity necessitates the use of a Position-Independent Executable
  (PIE) guest since the host kernel occupies the top 2GB of the address
  space.

- With PIE kernel, the PVM guest kernel in hardware ring3 can be located
  in the lower half of the address spaces in the future when Linear
  Address Space Separation (LASS) is enabled.

This RFC patchset doesn't contain PIE patches which are not specific to
PVM and our effort to make linux kernel PIE continues.

Simple spec design
------------------
Designing a new paravirtualized guest is not an ideal opportunity to
redesign the specification. However, in order to avoid the known flaws
of x86_64 and enable the paravirtualized ABI on hardware ring3, the x86
PVM specification has some moderate differences from the x86
specification.

- Remove/Ignore most indirect tables and 32-bit supervisor mode.

- Simplified event delivery and the removal of IST.

- Add some software synthetic instructions.

See more details in the patch1 which contains the whole x86 PVM
specification.

Status
======
Current some features are not supported or disabled in PVM.

- SMAP/SMEP can't be enabled directly, however, we can use PKU to
  emulate SMAP and use NX to emulate SMEP.

- 5-level paging is not fully implemented.

- Speculative control for guest is disabled.

- LDT is not supported.

- PMU virtualization is not implemented. Actually, we have reused
  the current code in pmu_intel.c and pmu_amd.c to implement it.

PVM has been adopted in Alibaba Cloud and Ant Group for hosting secure
containers, providing a more performant and cost-effective option for
cloud users.

Performance drawback
====================
The most significant drawback of PVM is shadowpaging. Shadowpaging
results in very bad performance when guest applications frequently
modify pagetable, including excessive processes forking.

However, many long-running cloud services, such as Java, modify
pagetables less frequently and can perform very well with shadowpaging.
In some cases, they can even outperform EPT since they can avoid EPT TLB
entries. Furthermore, PVM can utilize host PCIDs for guest processes,
providing a finer-grained approach compared to VPID/ASID.

To mitigate the performance problem, we designed several optimizations
for the shadow MMU (not included in the patchset) and also planning to
build a shadow EPT in L0 for L2 PVM guests.

See the paper for more optimizations and the performance details.

Future plans
============
Some optimizations are not covered in this series now.

- Parallel Page fault for SPT and Paravirtualized MMU Optimization.

- Post interrupt emulation.

- Relocate guest kernel into userspace address range.

- More flexible container solutions based on it.

Patches layout
==============
[01-02]: PVM ABI documentation and header
[03-04]: Switcher implementation
[05-49]: PVM hypervisor implementation
       - 05-13: KVM module involved changes
       - 14-49: PVM module implementation
                patch 15: Add a vmalloc helper to reserve a kernel
                          address range for guest.
                patch 19: Export 32-bit ignore syscall for PVM.

[50-73]: PVM guest implementation
       - 50-52: Pack relocation information into vmlinux and allow
                it to do relocation.
       - 53: Introduce Kconfig and cpu features.
       - 54-59: Relocate guest kernel to the allowed range.
       - 60-65: Event handling and hypercall.
       - 66-69: PVOPS implementation.
       - 70-73: Disable some features and syscalls.

Code base
=========
The code base is at branch [linux-pie] which is commit ceb6a6f023fd
("Linu 6.7-rc6") + PIE series [pie-patchset].

Complete code can be found at [linux-pvm].

Testing
=======
Testing with Kata Containers can be found at [pvm-get-started].

We also provide a VM image based on the `Official Ubuntu Cloud Image`,
which has containerd, kata, pvm hypervisor/guest, and configurations
prepared and you can use to test Kata Containers with PVM directly.
[pvm-get-started-nested-in-vm]



[sosp-2023-acm]: https://dl.acm.org/doi/10.1145/3600006.3613158
[sosp-2023-pdf]: https://github.com/virt-pvm/misc/blob/main/sosp2023-pvm-paper.pdf
[sosp-2023-slides]: https://github.com/virt-pvm/misc/blob/main/sosp2023-pvm-slides.pptx
[asm-to-c]: https://lore.kernel.org/lkml/20211126101209.8613-1-jiangshanlai@gmail.com/
[atomic-ist-entry]: https://lore.kernel.org/lkml/20230403140605.540512-1-jiangshanlai@gmail.com/
[pie-patchset]: https://lore.kernel.org/lkml/cover.1682673542.git.houwenlong.hwl@antgroup.com
[linux-pie]: https://github.com/virt-pvm/linux/tree/pie
[linux-pvm]: https://github.com/virt-pvm/linux/tree/pvm
[pvm-get-started]: https://github.com/virt-pvm/misc/blob/main/pvm-get-started-with-kata.md
[pvm-get-started-nested-in-vm]: https://github.com/virt-pvm/misc/blob/main/pvm-get-started-with-kata.md#verify-kata-containers-with-pvm-using-vm-image



Hou Wenlong (22):
  KVM: x86: Allow hypercall handling to not skip the instruction
  KVM: x86: Implement gpc refresh for guest usage
  KVM: x86/emulator: Reinject #GP if instruction emulation failed for
    PVM
  mm/vmalloc: Add a helper to reserve a contiguous and aligned kernel
    virtual area
  x86/entry: Export 32-bit ignore syscall entry and __ia32_enabled
    variable
  KVM: x86/PVM: Support for kvm_exit() tracepoint
  KVM: x86/PVM: Support for CPUID faulting
  x86/tools/relocs: Cleanup cmdline options
  x86/tools/relocs: Append relocations into input file
  x86/boot: Allow to do relocation for uncompressed kernel
  x86/pvm: Relocate kernel image to specific virtual address range
  x86/pvm: Relocate kernel image early in PVH entry
  x86/pvm: Make cpu entry area and vmalloc area variable
  x86/pvm: Relocate kernel address space layout
  x86/pvm: Allow to install a system interrupt handler
  x86/pvm: Add early kernel event entry and dispatch code
  x86/pvm: Enable PVM event delivery
  x86/pvm: Use new cpu feature to describe XENPV and PVM
  x86/pvm: Don't use SWAPGS for gsbase read/write
  x86/pvm: Adapt pushf/popf in this_cpu_cmpxchg16b_emu()
  x86/pvm: Use RDTSCP as default in vdso_read_cpunode()
  x86/pvm: Disable some unsupported syscalls and features

Lai Jiangshan (51):
  KVM: Documentation: Add the specification for PVM
  x86/ABI/PVM: Add PVM-specific ABI header file
  x86/entry: Implement switcher for PVM VM enter/exit
  x86/entry: Implement direct switching for the switcher
  KVM: x86: Set 'vcpu->arch.exception.injected' as true before vendor
    callback
  KVM: x86: Move VMX interrupt/nmi handling into kvm.ko
  KVM: x86/mmu: Adapt shadow MMU for PVM
  KVM: x86: Add PVM virtual MSRs into emulated_msrs_all[]
  KVM: x86: Introduce vendor feature to expose vendor-specific CPUID
  KVM: x86: Add NR_VCPU_SREG in SREG enum
  KVM: x86: Create stubs for PVM module as a new vendor
  KVM: x86/PVM: Implement host mmu initialization
  KVM: x86/PVM: Implement module initialization related callbacks
  KVM: x86/PVM: Implement VM/VCPU initialization related callbacks
  KVM: x86/PVM: Implement vcpu_load()/vcpu_put() related callbacks
  KVM: x86/PVM: Implement vcpu_run() callbacks
  KVM: x86/PVM: Handle some VM exits before enable interrupts
  KVM: x86/PVM: Handle event handling related MSR read/write operation
  KVM: x86/PVM: Introduce PVM mode switching
  KVM: x86/PVM: Implement APIC emulation related callbacks
  KVM: x86/PVM: Implement event delivery flags related callbacks
  KVM: x86/PVM: Implement event injection related callbacks
  KVM: x86/PVM: Handle syscall from user mode
  KVM: x86/PVM: Implement allowed range checking for #PF
  KVM: x86/PVM: Implement segment related callbacks
  KVM: x86/PVM: Implement instruction emulation for #UD and #GP
  KVM: x86/PVM: Enable guest debugging functions
  KVM: x86/PVM: Handle VM-exit due to hardware exceptions
  KVM: x86/PVM: Handle ERETU/ERETS synthetic instruction
  KVM: x86/PVM: Handle PVM_SYNTHETIC_CPUID synthetic instruction
  KVM: x86/PVM: Handle KVM hypercall
  KVM: x86/PVM: Use host PCID to reduce guest TLB flushing
  KVM: x86/PVM: Handle hypercalls for privilege instruction emulation
  KVM: x86/PVM: Handle hypercall for CR3 switching
  KVM: x86/PVM: Handle hypercall for loading GS selector
  KVM: x86/PVM: Allow to load guest TLS in host GDT
  KVM: x86/PVM: Enable direct switching
  KVM: x86/PVM: Implement TSC related callbacks
  KVM: x86/PVM: Add dummy PMU related callbacks
  KVM: x86/PVM: Handle the left supported MSRs in msrs_to_save_base[]
  KVM: x86/PVM: Implement system registers setting callbacks
  KVM: x86/PVM: Implement emulation for non-PVM mode
  x86/pvm: Add Kconfig option and the CPU feature bit for PVM guest
  x86/pvm: Detect PVM hypervisor support
  x86/pti: Force enabling KPTI for PVM guest
  x86/pvm: Add event entry/exit and dispatch code
  x86/pvm: Add hypercall support
  x86/kvm: Patch KVM hypercall as PVM hypercall
  x86/pvm: Implement cpu related PVOPS
  x86/pvm: Implement irq related PVOPS
  x86/pvm: Implement mmu related PVOPS

 Documentation/virt/kvm/x86/pvm-spec.rst  |  989 +++++++
 arch/x86/Kconfig                         |   32 +
 arch/x86/Makefile.postlink               |    9 +-
 arch/x86/entry/Makefile                  |    4 +
 arch/x86/entry/calling.h                 |   47 +-
 arch/x86/entry/common.c                  |    1 +
 arch/x86/entry/entry_64.S                |   75 +-
 arch/x86/entry/entry_64_pvm.S            |  189 ++
 arch/x86/entry/entry_64_switcher.S       |  270 ++
 arch/x86/entry/vsyscall/vsyscall_64.c    |    4 +
 arch/x86/include/asm/alternative.h       |   14 +
 arch/x86/include/asm/cpufeatures.h       |    2 +
 arch/x86/include/asm/disabled-features.h |    8 +-
 arch/x86/include/asm/idtentry.h          |   12 +-
 arch/x86/include/asm/init.h              |    5 +
 arch/x86/include/asm/kvm-x86-ops.h       |    2 +
 arch/x86/include/asm/kvm_host.h          |   30 +-
 arch/x86/include/asm/kvm_para.h          |    7 +
 arch/x86/include/asm/page_64.h           |    3 +
 arch/x86/include/asm/paravirt.h          |   14 +-
 arch/x86/include/asm/pgtable_64_types.h  |   14 +-
 arch/x86/include/asm/processor.h         |    5 +
 arch/x86/include/asm/ptrace.h            |    5 +
 arch/x86/include/asm/pvm_para.h          |  103 +
 arch/x86/include/asm/segment.h           |   14 +-
 arch/x86/include/asm/switcher.h          |  119 +
 arch/x86/include/uapi/asm/kvm_para.h     |    8 +-
 arch/x86/include/uapi/asm/pvm_para.h     |  131 +
 arch/x86/kernel/Makefile                 |    1 +
 arch/x86/kernel/asm-offsets_64.c         |   31 +
 arch/x86/kernel/cpu/common.c             |   11 +
 arch/x86/kernel/head64.c                 |   10 +
 arch/x86/kernel/head64_identity.c        |  108 +-
 arch/x86/kernel/head_64.S                |   34 +
 arch/x86/kernel/idt.c                    |    2 +
 arch/x86/kernel/kvm.c                    |    2 +
 arch/x86/kernel/ldt.c                    |    3 +
 arch/x86/kernel/nmi.c                    |    8 +-
 arch/x86/kernel/process_64.c             |   10 +-
 arch/x86/kernel/pvm.c                    |  579 ++++
 arch/x86/kernel/traps.c                  |    3 +
 arch/x86/kernel/vmlinux.lds.S            |   18 +
 arch/x86/kvm/Kconfig                     |    9 +
 arch/x86/kvm/Makefile                    |    5 +-
 arch/x86/kvm/cpuid.c                     |   26 +-
 arch/x86/kvm/cpuid.h                     |    3 +
 arch/x86/kvm/host_entry.S                |   50 +
 arch/x86/kvm/mmu/mmu.c                   |   36 +-
 arch/x86/kvm/mmu/paging_tmpl.h           |    3 +
 arch/x86/kvm/mmu/spte.c                  |    4 +
 arch/x86/kvm/pvm/host_mmu.c              |  119 +
 arch/x86/kvm/pvm/pvm.c                   | 3257 ++++++++++++++++++++++
 arch/x86/kvm/pvm/pvm.h                   |  169 ++
 arch/x86/kvm/svm/svm.c                   |    4 +
 arch/x86/kvm/trace.h                     |    7 +-
 arch/x86/kvm/vmx/vmenter.S               |   43 -
 arch/x86/kvm/vmx/vmx.c                   |   18 +-
 arch/x86/kvm/x86.c                       |   33 +-
 arch/x86/kvm/x86.h                       |   18 +
 arch/x86/mm/dump_pagetables.c            |    3 +-
 arch/x86/mm/kaslr.c                      |    8 +-
 arch/x86/mm/pti.c                        |    7 +
 arch/x86/platform/pvh/enlighten.c        |   22 +
 arch/x86/platform/pvh/head.S             |    4 +
 arch/x86/tools/relocs.c                  |   88 +-
 arch/x86/tools/relocs.h                  |   20 +-
 arch/x86/tools/relocs_common.c           |   38 +-
 arch/x86/xen/enlighten_pv.c              |    1 +
 include/linux/kvm_host.h                 |   10 +
 include/linux/vmalloc.h                  |    2 +
 include/uapi/Kbuild                      |    4 +
 mm/vmalloc.c                             |   10 +
 virt/kvm/pfncache.c                      |    2 +-
 73 files changed, 6793 insertions(+), 166 deletions(-)
 create mode 100644 Documentation/virt/kvm/x86/pvm-spec.rst
 create mode 100644 arch/x86/entry/entry_64_pvm.S
 create mode 100644 arch/x86/entry/entry_64_switcher.S
 create mode 100644 arch/x86/include/asm/pvm_para.h
 create mode 100644 arch/x86/include/asm/switcher.h
 create mode 100644 arch/x86/include/uapi/asm/pvm_para.h
 create mode 100644 arch/x86/kernel/pvm.c
 create mode 100644 arch/x86/kvm/host_entry.S
 create mode 100644 arch/x86/kvm/pvm/host_mmu.c
 create mode 100644 arch/x86/kvm/pvm/pvm.c
 create mode 100644 arch/x86/kvm/pvm/pvm.h

-- 
2.19.1.6.gb485710b
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Like Xu 1 year, 9 months ago
Hi Jiangshan,

On 26/2/2024 10:35 pm, Lai Jiangshan wrote:
> Performance drawback
> ====================
> The most significant drawback of PVM is shadowpaging. Shadowpaging
> results in very bad performance when guest applications frequently
> modify pagetable, including excessive processes forking.

Some numbers are needed here to show how bad this RFC virt-pvm version
without SPT optimization is in terms of performance. Compared to L2-VM
based on nested EPT-on-EPT, the following benchmarks show a significant
performance loss in PVM-based L2-VM (per pvm-get-started-with-kata.md):

- byte/UnixBench-shell1: -67%
- pts/sysbench-1.1.0 [Test: RAM / Memory]: -55%
- Mmap Latency [lmbench]: -92%
- Context switching [lmbench]: -83%
- syscall_get_pid_latency: -77%

Not sure if these performance conclusions are reproducible on your VM,
but it reveals the concern of potential users that there is not a strong
enough incentive to offload the burden of maintaining kvm-pvm.ko to the
upstream community until there is a public available SPT optimization
based on your or any state-of-art MMU-PV-ops impl. brought to the ring.

There are other kernel technologies used by PVM that have user scenarios
outside of PVM (e.g. unikernel/kernel-level sandbox), and it seems to me
that there's opportunities for all of them to be absorbed by upstream
individually and sequentially, but getting the KVM community to take
kvm-pvm.ko seriously may be more dependent on how much room there can
be for performance optimization based on your "Parallel Page fault for SPT
and Paravirtualized MMU Optimization" implementation, and the optimizing
space developers can squeeze out of legacy EPT-on-EPT solution.

> 
> However, many long-running cloud services, such as Java, modify
> pagetables less frequently and can perform very well with shadowpaging.
> In some cases, they can even outperform EPT since they can avoid EPT TLB
> entries. Furthermore, PVM can utilize host PCIDs for guest processes,
> providing a finer-grained approach compared to VPID/ASID.
> 
> To mitigate the performance problem, we designed several optimizations
> for the shadow MMU (not included in the patchset) and also planning to
> build a shadow EPT in L0 for L2 PVM guests.
> 
> See the paper for more optimizations and the performance details.
> 
> Future plans
> ============
> Some optimizations are not covered in this series now.
> 
> - Parallel Page fault for SPT and Paravirtualized MMU Optimization.
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Paolo Bonzini 1 year, 9 months ago
On Mon, Feb 26, 2024 at 3:34 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
>   subordinate to the hypervisor/switcher, and the host Linux kernel
>   loses control over the entry code. This can cause inconvenience if
>   there is a need to update something when there is a bug in the
>   switcher or hardware.  Integral entry gives the control back to the
>   host kernel.
>
> - Zero overhead incurred: The integrated entry code doesn't cause any
>   overhead in host Linux entry path, thanks to the discreet design with
>   PVM code in the switcher, where the PVM path is bypassed on host events.
>   While in XENPV/Lguest, host events must be handled by the
>   hypervisor/switcher before being processed.

Lguest... Now that's a name I haven't heard in a long time. :)  To be
honest, it's a bit weird to see yet another PV hypervisor. I think
what really killed Xen PV was the impossibility to protect from
various speculation side channel attacks, and I would like to
understand how PVM fares here.

You obviously did a great job in implementing this within the KVM
framework; the changes in arch/x86/ are impressively small. On the
other hand this means it's also not really my call to decide whether
this is suitable for merging upstream. The bulk of the changes are
really in arch/x86/kernel/ and arch/x86/entry/, and those are well
outside my maintenance area.

Paolo
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Lai Jiangshan 1 year, 9 months ago
Hello, Paolo

On Mon, Feb 26, 2024 at 10:49 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Mon, Feb 26, 2024 at 3:34 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
> >   subordinate to the hypervisor/switcher, and the host Linux kernel
> >   loses control over the entry code. This can cause inconvenience if
> >   there is a need to update something when there is a bug in the
> >   switcher or hardware.  Integral entry gives the control back to the
> >   host kernel.
> >
> > - Zero overhead incurred: The integrated entry code doesn't cause any
> >   overhead in host Linux entry path, thanks to the discreet design with
> >   PVM code in the switcher, where the PVM path is bypassed on host events.
> >   While in XENPV/Lguest, host events must be handled by the
> >   hypervisor/switcher before being processed.
>
> Lguest... Now that's a name I haven't heard in a long time. :)  To be
> honest, it's a bit weird to see yet another PV hypervisor. I think
> what really killed Xen PV was the impossibility to protect from
> various speculation side channel attacks, and I would like to
> understand how PVM fares here.

How does the host kernel protect itself from guest's speculation side
channel attacks?

PVM is primarily designed for secure containers like Kata containers,
where safety and security are of utmost importance.

Guests run in the hardware ring3 and they are treated as the same as the
normal user applications in the views of the host kernel's protections
and mitigations. The code employs all of the current protections and
mitigations for kernel/user interactions to host/guest and with extra
protections from pagetable isolation and with protections/mitigations
usually used for host/VTX_or_AMDV_guest (with some similar VM enter/exit
code as in vmx/ svm/). All of these are sorta easily achieved by the
"integral entry" design and "the distinct separation of the address
spaces" design can also help for protections.

How does the guest kernel protect itself from guest users' speculation
side channel attacks?

The code also tries its best to provide all of the current protections
and mitigations between the native kernel/user for virtualized kernel/user.
It is obvious that the PVM virtualized kernel operates in hardware ring3
and you can't expect all methods can be effective. Since the linux kernel
can provide protections for threads switching between different user
processes, PVM can potentially offer similar protections between guest
kernel/user through the PVM hypervisor's support.

I'm not familiar with XENPV's handling and its solutions (including its
impossibility) for the speculation side channel attacks, thus I cannot
provide additional insights or assurances in this context.

PVM is not designed as a general-purpose virtualization. The primary
objective is for secure container and Linux kernel testing. PVM intends
to allow for the universal deployment of Kata Containers inside cloud
VMs leased from any provider over the world.

For Kata containers, the protection between host/guest is much more
important and every container is only for a single tenement in which
the guest kernel is not a TCB of the external container services.
It means the protection requirements between guest kernel/user are
more flexible and customized.


>
> You obviously did a great job in implementing this within the KVM
> framework; the changes in arch/x86/ are impressively small.

Thanks for your appreciation.

> On the
> other hand this means it's also not really my call to decide whether
> this is suitable for merging upstream. The bulk of the changes are
> really in arch/x86/kernel/ and arch/x86/entry/, and those are well
> outside my maintenance area.
>

Thanks
Lai
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Sean Christopherson 1 year, 9 months ago
On Mon, Feb 26, 2024, Paolo Bonzini wrote:
> On Mon, Feb 26, 2024 at 3:34 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
> >   subordinate to the hypervisor/switcher, and the host Linux kernel
> >   loses control over the entry code. This can cause inconvenience if
> >   there is a need to update something when there is a bug in the
> >   switcher or hardware.  Integral entry gives the control back to the
> >   host kernel.
> >
> > - Zero overhead incurred: The integrated entry code doesn't cause any
> >   overhead in host Linux entry path, thanks to the discreet design with
> >   PVM code in the switcher, where the PVM path is bypassed on host events.
> >   While in XENPV/Lguest, host events must be handled by the
> >   hypervisor/switcher before being processed.
> 
> Lguest... Now that's a name I haven't heard in a long time. :)  To be
> honest, it's a bit weird to see yet another PV hypervisor. I think
> what really killed Xen PV was the impossibility to protect from
> various speculation side channel attacks, and I would like to
> understand how PVM fares here.
> 
> You obviously did a great job in implementing this within the KVM
> framework; the changes in arch/x86/ are impressively small. On the
> other hand this means it's also not really my call to decide whether
> this is suitable for merging upstream. The bulk of the changes are
> really in arch/x86/kernel/ and arch/x86/entry/, and those are well
> outside my maintenance area.

The bulk of changes in _this_ patchset are outside of arch/x86/kvm, but there are
more changes on the horizon:

 : To mitigate the performance problem, we designed several optimizations
 : for the shadow MMU (not included in the patchset) and also planning to
 : build a shadow EPT in L0 for L2 PVM guests.

 : - Parallel Page fault for SPT and Paravirtualized MMU Optimization.

And even absent _new_ shadow paging functionality, merging PVM would effectively
shatter any hopes of ever removing KVM's existing, complex shadow paging code.

Specifically, unsync 4KiB PTE support in KVM provides almost no benefit for nested
TDP.  So if we can ever drop support for legacy shadow paging, which is a big if,
but not completely impossible, then we could greatly simplify KVM's shadow MMU.

Which is a good segue into my main question: was there any one thing that was
_the_ motivating factor for taking on the cost+complexity of shadow paging?  And
as alluded to be Paolo, taking on the downsides of reduced isolation?

It doesn't seem like avoiding L0 changes was the driving decision, since IIUC
you have plans to make changes there as well.

 : To mitigate the performance problem, we designed several optimizations
 : for the shadow MMU (not included in the patchset) and also planning to
 : build a shadow EPT in L0 for L2 PVM guests.

Performance I can kinda sorta understand, but my gut feeling is that the problems
with nested virtualization are solvable by adding nested paravirtualization between
L0<=>L1, with likely lower overall cost+complexity than paravirtualizing L1<=>L2.

The bulk of the pain with nested hardware virtualization lies in having to emulate
VMX/SVM, and shadow L1's TDP page tables.  Hyper-V's eVMCS takes some of the sting
off nVMX in particular, but eVMCS is still hobbled by its desire to be almost
drop-in compatible with VMX.

If we're willing to define a fully PV interface between L0 and L1 hypervisors, I
suspect we provide performance far, far better than nVMX/nSVM.  E.g. if L0 provides
a hypercall to map an L2=>L1 GPA, then L0 doesn't need to shadow L1 TDP, and L1
doesn't even need to maintain hardware-defined page tables, it can use whatever
software-defined data structure best fits it needs.

And if we limit support to 64-bit L2 kernels and drop support for unnecessary cruft,
the L1<=>L2 entry/exit paths could be drastically simplified and streamlined.  And
it should be very doable to concoct an ABI between L0 and L2 that allows L0 to
directly emulate "hot" instructions from L2, e.g. CPUID, common MSRs, etc.  I/O
would likely be solvable too, e.g. maybe with a mediated device type solution that
allows L0 to handle the data path for L2?

The one thing that I don't see line of sight to supporting is taking L0 out of the
TCB, i.e. running L2 VMs inside TDX/SNP guests.  But for me at least, that alone
isn't sufficient justification for adding a PV flavor of KVM.
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by Lai Jiangshan 1 year, 9 months ago
Hello, Sean

On Wed, Feb 28, 2024 at 1:27 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Feb 26, 2024, Paolo Bonzini wrote:
> > On Mon, Feb 26, 2024 at 3:34 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> > > - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is
> > >   subordinate to the hypervisor/switcher, and the host Linux kernel
> > >   loses control over the entry code. This can cause inconvenience if
> > >   there is a need to update something when there is a bug in the
> > >   switcher or hardware.  Integral entry gives the control back to the
> > >   host kernel.
> > >
> > > - Zero overhead incurred: The integrated entry code doesn't cause any
> > >   overhead in host Linux entry path, thanks to the discreet design with
> > >   PVM code in the switcher, where the PVM path is bypassed on host events.
> > >   While in XENPV/Lguest, host events must be handled by the
> > >   hypervisor/switcher before being processed.
> >
> > Lguest... Now that's a name I haven't heard in a long time. :)  To be
> > honest, it's a bit weird to see yet another PV hypervisor. I think
> > what really killed Xen PV was the impossibility to protect from
> > various speculation side channel attacks, and I would like to
> > understand how PVM fares here.
> >
> > You obviously did a great job in implementing this within the KVM
> > framework; the changes in arch/x86/ are impressively small. On the
> > other hand this means it's also not really my call to decide whether
> > this is suitable for merging upstream. The bulk of the changes are
> > really in arch/x86/kernel/ and arch/x86/entry/, and those are well
> > outside my maintenance area.
>
> The bulk of changes in _this_ patchset are outside of arch/x86/kvm, but there are
> more changes on the horizon:
>
>  : To mitigate the performance problem, we designed several optimizations
>  : for the shadow MMU (not included in the patchset) and also planning to
>  : build a shadow EPT in L0 for L2 PVM guests.
>
>  : - Parallel Page fault for SPT and Paravirtualized MMU Optimization.
>
> And even absent _new_ shadow paging functionality, merging PVM would effectively
> shatter any hopes of ever removing KVM's existing, complex shadow paging code.
>
> Specifically, unsync 4KiB PTE support in KVM provides almost no benefit for nested
> TDP.  So if we can ever drop support for legacy shadow paging, which is a big if,
> but not completely impossible, then we could greatly simplify KVM's shadow MMU.
>

One of the important goals of open-sourcing PVM is to allow for the
optimization of shadow paging, especially through paravirtualization
methods, and potentially even to eliminate the need for shadow paging.

1) Technology: Shadow paging is a technique for page table compaction in
   the category of "one-dimensional paging", which includes the direct
   paging technology in XenPV. When the page tables are stable,
   one-dimensional paging can outperform TDP because it saves on TLB
   resources. Another one-dimensional paging technology would be better
   to be introduced before shadow paging is removed for performance.

2) Naming: The reason we use the name shadowpage in our paper and the
   cover letter is that this term is more widely recognized and makes it
   easier for people to understand how PVM implements its page tables.
   It also demonstrates that PVM is able to implement a paging mechanism
   with very little code on top of KVM. However, this does not mean we
   adhere to shadow paging. Any one-dimensional paging technology can
   work here too.

3) Paravirt: As you mentioned, the best way to eliminate shadow paging
   is by using a paravirtualization (PV) approach. PVM is inherently
   suitable for having PV since it is a paravirt solution and has a
   corresponding framework. However, PV pagetables leads to a complex
   patchset, which we prefer not to include in the initial PVM patchset
   introduction.

4) Pave the path: One of the purposes of open-sourcing PVM is to bring
   in a new scenario for possibly introducing PV pagetable interfaces
   and optimizing shadow paging. Moreover, investing development effort
   in shadow paging is the only way to ultimately remove it.

5) Optimizations: We have experimented with numerous optimizations
   including at least two categories: parallel-pagetable and
   enlightened-pagetable. The parallel pagetable overhauls the locking
   mechanism within the shadow paging. The enlightened-pagetable
   introduces PVOPS in the guest to modify the page tables. One set of
   PVOPS, used on 4KiB PTEs, queues the pointers of the modified GPTEs
   in a hypervisor-guest shared ring buffer. Although the overall
   mechanism, including TLB handling, is not simple, the hypervisor
   portion is simpler than the unsync-sp method, and it bypasses many
   unsync-sp related code paths. The other set of PVOPS targets larger
   page table entries and directly issues hypercalls. Should both sets
   of PVOPS be utilized, write-protect for SPs is unneeded and shadow
   paging could be considered as being removed.


> Which is a good segue into my main question: was there any one thing that was
> _the_ motivating factor for taking on the cost+complexity of shadow paging?  And
> as alluded to be Paolo, taking on the downsides of reduced isolation?
>
> It doesn't seem like avoiding L0 changes was the driving decision, since IIUC
> you have plans to make changes there as well.
>
>  : To mitigate the performance problem, we designed several optimizations
>  : for the shadow MMU (not included in the patchset) and also planning to
>  : build a shadow EPT in L0 for L2 PVM guests.
>


Getting every cloud provider to adopt a technology is more challenging
than developing the technology itself. It is easy to compile a list that
includes many technologies for L0 that have been merged into upstream
KVM for quite some time, yet not all major cloud providers use or
support them.

The purpose of PVM includes enabling the use of KVM within various cloud
VMs, allowing for easy operation of businesses with secure containers.
Therefore, it cannot rely on whether cloud providers make such changes
to L0.

The reason we are experimenting with modifications to L0 is because we
have many physical machines. Developing this technology getting help from
L0 for L2 paging could provide us and others who have their own physical
machines with an additional option.


> Performance I can kinda sorta understand, but my gut feeling is that the problems
> with nested virtualization are solvable by adding nested paravirtualization between
> L0<=>L1, with likely lower overall cost+complexity than paravirtualizing L1<=>L2.
>
> The bulk of the pain with nested hardware virtualization lies in having to emulate
> VMX/SVM, and shadow L1's TDP page tables.  Hyper-V's eVMCS takes some of the sting
> off nVMX in particular, but eVMCS is still hobbled by its desire to be almost
> drop-in compatible with VMX.
>
> If we're willing to define a fully PV interface between L0 and L1 hypervisors, I
> suspect we provide performance far, far better than nVMX/nSVM.  E.g. if L0 provides
> a hypercall to map an L2=>L1 GPA, then L0 doesn't need to shadow L1 TDP, and L1
> doesn't even need to maintain hardware-defined page tables, it can use whatever
> software-defined data structure best fits it needs.
>
> And if we limit support to 64-bit L2 kernels and drop support for unnecessary cruft,
> the L1<=>L2 entry/exit paths could be drastically simplified and streamlined.  And
> it should be very doable to concoct an ABI between L0 and L2 that allows L0 to
> directly emulate "hot" instructions from L2, e.g. CPUID, common MSRs, etc.  I/O
> would likely be solvable too, e.g. maybe with a mediated device type solution that
> allows L0 to handle the data path for L2?
>
> The one thing that I don't see line of sight to supporting is taking L0 out of the
> TCB, i.e. running L2 VMs inside TDX/SNP guests.  But for me at least, that alone
> isn't sufficient justification for adding a PV flavor of KVM.


I didn't want to suggest that running PVM inside TDX is an important use
case, but I just used it to emphasize PVM's universally accessibility in
all environments, including inside the notoriously otherwise impossible
environment as TDX as Paolo said in a LWN comment:
https://lwn.net/Articles/865807/

 : TDX cannot be used in a nested VM, and you cannot use nested
 : virtualization inside a TDX virtual machine.

(and actually the support for PVM in TDX/SNP is not completed yet)

Thanks
Lai
Re: [RFC PATCH 00/73] KVM: x86/PVM: Introduce a new hypervisor
Posted by David Woodhouse 1 year, 9 months ago
On Tue, 2024-02-27 at 09:27 -0800, Sean Christopherson wrote:
> 
> The bulk of the pain with nested hardware virtualization lies in having to emulate
> VMX/SVM, and shadow L1's TDP page tables.  Hyper-V's eVMCS takes some of the sting
> off nVMX in particular, but eVMCS is still hobbled by its desire to be almost
> drop-in compatible with VMX.
> 
> If we're willing to define a fully PV interface between L0 and L1 hypervisors, I
> suspect we provide performance far, far better than nVMX/nSVM.  E.g. if L0 provides
> a hypercall to map an L2=>L1 GPA, then L0 doesn't need to shadow L1 TDP, and L1
> doesn't even need to maintain hardware-defined page tables, it can use whatever
> software-defined data structure best fits it needs.

I'd like to understand how, if at all, this intersects with the
requirements we have for pKVM on x86. For example, would pKVM run the
untrusted part of the kernel as a PVM guest using this model? Would the
PV interface of which you speak also map to the calls from the kernel
into the secure pKVM hypervisor... ?