Hello all,
as previously announced here, we are working on an integration that will
expose the HyperV hypervisor to QEMU on Linux hosts. HyperV is a Type 1
hypervisor with a layered architecture that features a "root partition"
alongside VMs as "child partitions" that will interface with the
hypervisor and has access to the hardware. (https://aka.ms/hypervarch)
The effort to run Linux on such a Root Partition and expose HyperV to
such a management partition is called "MSHV". Sometimes we refer to the
root partition as "Dom0 Linux". Today we are targetting nested
virtualization, that is: the creation + management of L2 VMs on an L1
VM (L0 would indicate bare metal).
+-------------+ +----------------+ +--------------+
| | | | | |
| Azure Host | | L1 Linux Dom0 | | L2 Guest VM |
| | | | | |
| OS | | | | |
| | | +------------+ | | |
| | | | Qemu VMM | | | |
| | | +------------+ | | |
| | | +------------+ | | |
| | | | Kernel | | | |
| | | +-----+------+ | | |
| | +-------|--------+ +--------------+
| | +-------v-------------------------+
| | | Microsoft Hypervisor (L1) |
+-------------+ +-------+-------------------------+
|
+-----------------------v-------------------------+
| Microsoft Hypervisor (L0) |
+-------------------------------------------------+
+-------------------------------------------------+
| |
| Hardware |
| |
+-------------------------------------------------+
This submission is a port of the existing MSHV integration that is
shipped in Cloud-Hypervisor and MSHV-specific rust-crates in rust-vmm.
There are various products like AKS Pod Sandboxing and AKS Confidential
Pods built on MSHV and Cloud-Hypervisor. We hope to achieve a seamless
integration into the QEMU accelerator framework, similar to existing
integrations like KVM, HVF or WHPX.
The patch set has been split into chunks that should be applicable and
buildable individually, but only the full set of commits will allow
launching MSHV-accelerated guests on supported kernels and environments.
The toggle to enable the feature at build time would be: `./configure
--enable-mshv`.
When launching a VM, the accelerator `mshv` can be enabled via
`-accel mshv` or `-machine q35,accel=mshv`.
We concluded the porting, but we haven't performed any comprehensive
testing yet. We opted to send our submission early to receive feedback
about the general structure and potential problems of our integration.
Most likely we will uncover problems during testing and address those in
upcoming revisions of the patch set.
The configuration we are using during development:
machine q35 + OVMF + various recent linux distros as guests (fedora
42, ubuntu 22.04)
We would welcome any feedback around the structure and integration
points that we chose, so we can incorporate them into upcoming
revisions.
Some notes/caveats about the initial submission:
- The relevant MSHV kernel code has been accepted for inclusion in the
upcoming 6.15 release, which should be released shortly. To allow
building it on older kernel we vendored the kernel headers that define
the MSHV ABI into the patch set. We might remove it in later
revisions of the patch set, or put it behind a feature toggle. Once
the kernel is released we plan to published a preconfigured Azure
image, which can be used to test the MSHV accelerator.
- QEMU is mapping regions into the guest that might partially overlap in
their userspace_addr range (e.g. for ROMs in early boot). Currently
MSHV will reject such overlaps. We are looking into whether we
can/want to relax that restriction. To work around this we maintain a
list of mapping references and swap in/out regions if there's a GPA
fault and we find a valid candidate region in our list. (see last
commit). Maybe there are alternative, less invasive, suggestions. We'd
be happy to hear those.
- We noticed that when using SeaBIOS, in certain permutations of guest
configuration (> 2GB ram & >1 virtio-blk-pci devices), we run into
unmapped GPA errors. We suspect it has to do with SeaBIOS addressing
memory in the 4GB+ region in those cases. We are investigating, and
will hopefully be able to issue a fix soon. For the time being this
can be worked around by using OVMF as firmware:
- Since the MHSV accelerator requires a HyperV hypervisor to be present,
it would make sense to provide testing infrastructure for integration
testing on Azure. We are looking into options how to implement that.
best,
magnus
Magnus Kulke (25):
accel: Add Meson and config support for MSHV accelerator
target/i386/emulate: allow instruction decoding from stream
target/i386/mshv: Add x86 decoder/emu implementation
hw/intc: Generalize APIC helper names from kvm_* to accel_*
include/hw/hyperv: Add MSHV ABI header definitions
accel/mshv: Add accelerator skeleton
accel/mshv: Register memory region listeners
accel/mshv: Initialize VM partition
accel/mshv: Register guest memory regions with hypervisor
accel/mshv: Add ioeventfd support
accel/mshv: Add basic interrupt injection support
accel/mshv: Add vCPU creation and execution loop
accel/mshv: Add vCPU signal handling
target/i386/mshv: Add CPU create and remove logic
target/i386/mshv: Implement mshv_store_regs()
target/i386/mshv: Implement mshv_get_standard_regs()
target/i386/mshv: Implement mshv_get_special_regs()
target/i386/mshv: Implement mshv_arch_put_registers()
target/i386/mshv: Set local interrupt controller state
target/i386/mshv: Register CPUID entries with MSHV
target/i386/mshv: Register MSRs with MSHV
target/i386/mshv: Integrate x86 instruction decoder/emulator
target/i386/mshv: Write MSRs to the hypervisor
target/i386/mshv: Implement mshv_vcpu_run()
accel/mshv: Add memory remapping workaround
accel/Kconfig | 3 +
accel/accel-irq.c | 95 ++
accel/meson.build | 3 +-
accel/mshv/irq.c | 370 +++++++
accel/mshv/mem.c | 434 ++++++++
accel/mshv/meson.build | 9 +
accel/mshv/mshv-all.c | 731 ++++++++++++
accel/mshv/msr.c | 375 +++++++
accel/mshv/trace-events | 20 +
accel/mshv/trace.h | 1 +
hw/intc/apic.c | 9 +
hw/intc/ioapic.c | 20 +-
hw/virtio/virtio-pci.c | 19 +-
include/hw/hyperv/hvgdk.h | 20 +
include/hw/hyperv/hvhdk.h | 165 +++
include/hw/hyperv/hvhdk_mini.h | 106 ++
include/hw/hyperv/linux-mshv.h | 1038 ++++++++++++++++++
include/system/accel-irq.h | 26 +
include/system/mshv.h | 237 ++++
meson.build | 17 +
meson_options.txt | 2 +
scripts/meson-buildoptions.sh | 3 +
target/i386/cpu.h | 2 +-
target/i386/emulate/meson.build | 7 +-
target/i386/emulate/x86_decode.c | 32 +-
target/i386/emulate/x86_decode.h | 11 +
target/i386/emulate/x86_emu.c | 3 +-
target/i386/emulate/x86_emu.h | 1 +
target/i386/meson.build | 2 +
target/i386/mshv/meson.build | 8 +
target/i386/mshv/mshv-cpu.c | 1768 ++++++++++++++++++++++++++++++
target/i386/mshv/x86.c | 330 ++++++
32 files changed, 5841 insertions(+), 26 deletions(-)
create mode 100644 accel/accel-irq.c
create mode 100644 accel/mshv/irq.c
create mode 100644 accel/mshv/mem.c
create mode 100644 accel/mshv/meson.build
create mode 100644 accel/mshv/mshv-all.c
create mode 100644 accel/mshv/msr.c
create mode 100644 accel/mshv/trace-events
create mode 100644 accel/mshv/trace.h
create mode 100644 include/hw/hyperv/hvgdk.h
create mode 100644 include/hw/hyperv/hvhdk.h
create mode 100644 include/hw/hyperv/hvhdk_mini.h
create mode 100644 include/hw/hyperv/linux-mshv.h
create mode 100644 include/system/accel-irq.h
create mode 100644 include/system/mshv.h
create mode 100644 target/i386/mshv/meson.build
create mode 100644 target/i386/mshv/mshv-cpu.c
create mode 100644 target/i386/mshv/x86.c
--
2.34.1