[PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2

Nikolas Wipper posted 15 patches 1 year, 4 months ago
Documentation/virt/kvm/api.rst                | 131 ++++++++
arch/x86/include/asm/kvm_host.h               |  18 +-
arch/x86/kvm/hyperv.c                         |   3 +-
arch/x86/kvm/kvm_emulate.h                    |   8 +
arch/x86/kvm/mmu.h                            |  10 +-
arch/x86/kvm/mmu/mmu.c                        |   7 +-
arch/x86/kvm/mmu/paging_tmpl.h                |  80 +++--
arch/x86/kvm/x86.c                            | 123 ++++++-
include/linux/kvm_host.h                      |   6 +
include/uapi/linux/kvm.h                      |  33 ++
tools/testing/selftests/kvm/Makefile          |   1 +
.../selftests/kvm/x86_64/kvm_translate2.c     | 310 ++++++++++++++++++
virt/kvm/kvm_main.c                           |  41 +++
13 files changed, 724 insertions(+), 47 deletions(-)
create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c
[PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2
Posted by Nikolas Wipper 1 year, 4 months ago
This series introduces a new ioctl KVM_TRANSLATE2, which expands on
KVM_TRANSLATE. It is required to implement Hyper-V's
HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
implements the core functionality of the hyper-call. The rest of the
required functionality will be implemented in subsequent series.

Other than translating guest virtual addresses, the ioctl allows the
caller to control whether the access and dirty bits are set during the
page walk. It also allows specifying an access mode instead of returning
viable access modes, which enables setting the bits up to the level that
caused a failure. Additionally, the ioctl provides more information about
why the page walk failed, and which page table is responsible. This
functionality is not available within KVM_TRANSLATE, and can't be added
without breaking backwards compatiblity, thus a new ioctl is required.

The ioctl was designed to facilitate as many other use cases as possible
apart from VSM. The error codes were intentionally chosen to be broad
enough to avoid exposing architecture specific details. Even though
HvTranslateVirtualAddress only really needs one flag to set the accessed
and dirty bits whenever possible, that was split into several flags so
that future users can chose more gradually when these bits should be set.
Furthermore, as much information as possible is provided to the caller.

The patch series includes selftests for the ioctl, as well as fuzzy
testing on random garbage guest page table entries. All previously passing
KVM selftests and KVM unit tests still pass.

Series overview:
- 1: Document the new ioctl
- 2-11: Update the page walker in preparation
- 12-14: Implement the ioctl
- 15: Implement testing

This series, alongside the series by Nicolas Saenz Julienne [1]
introducing the core building blocks for VSM and the accompanying QEMU
implementation [2], is capable of booting Windows Server 2019.

Both series are also available on GitHub [3].

[1] https://lore.kernel.org/linux-hyperv/20240609154945.55332-1-nsaenz@amazon.com/
[2] https://github.com/vianpl/qemu/tree/vsm/next
[3] https://github.com/vianpl/linux/tree/vsm/next

Best,
Nikolas

Nikolas Wipper (15):
  KVM: Add API documentation for KVM_TRANSLATE2
  KVM: x86/mmu: Abort page walk if permission checks fail
  KVM: x86/mmu: Introduce exception flag for unmapped GPAs
  KVM: x86/mmu: Store GPA in exception if applicable
  KVM: x86/mmu: Introduce flags parameter to page walker
  KVM: x86/mmu: Implement PWALK_SET_ACCESSED in page walker
  KVM: x86/mmu: Implement PWALK_SET_DIRTY in page walker
  KVM: x86/mmu: Implement PWALK_FORCE_SET_ACCESSED in page walker
  KVM: x86/mmu: Introduce status parameter to page walker
  KVM: x86/mmu: Implement PWALK_STATUS_READ_ONLY_PTE_GPA in page walker
  KVM: x86: Introduce generic gva to gpa translation function
  KVM: Introduce KVM_TRANSLATE2
  KVM: Add KVM_TRANSLATE2 stub
  KVM: x86: Implement KVM_TRANSLATE2
  KVM: selftests: Add test for KVM_TRANSLATE2

 Documentation/virt/kvm/api.rst                | 131 ++++++++
 arch/x86/include/asm/kvm_host.h               |  18 +-
 arch/x86/kvm/hyperv.c                         |   3 +-
 arch/x86/kvm/kvm_emulate.h                    |   8 +
 arch/x86/kvm/mmu.h                            |  10 +-
 arch/x86/kvm/mmu/mmu.c                        |   7 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |  80 +++--
 arch/x86/kvm/x86.c                            | 123 ++++++-
 include/linux/kvm_host.h                      |   6 +
 include/uapi/linux/kvm.h                      |  33 ++
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/kvm_translate2.c     | 310 ++++++++++++++++++
 virt/kvm/kvm_main.c                           |  41 +++
 13 files changed, 724 insertions(+), 47 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c

-- 
2.40.1




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2
Posted by Nikolas Wipper 1 year, 4 months ago
I saw this on another series[*]:

> if KVM_TRANSLATE2 lands (though I'm somewhat curious as to why QEMU doesn't do
> the page walks itself).

The simple reason for keeping this functionality in KVM, is that it already
has a mature, production-level page walker (which is already exposed) and
creating something similar QEMU would take a lot longer and would be much
harder to maintain than just creating an API that leverages the existing
walker.

[*] https://lore.kernel.org/lkml/ZvJseVoT7gN_GBG3@google.com/T/#mb0b23a1f5023192a442db4a16629d9ca74eb6b5e

ps: this is also a gentle ping for review, if this got lost in between
conferences



Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Re: [PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2
Posted by Sean Christopherson 1 year, 1 month ago
On Tue, Sep 10, 2024, Nikolas Wipper wrote:
> This series introduces a new ioctl KVM_TRANSLATE2, which expands on
> KVM_TRANSLATE. It is required to implement Hyper-V's
> HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
> emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
> call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
> implements the core functionality of the hyper-call. The rest of the
> required functionality will be implemented in subsequent series.
> 
> Other than translating guest virtual addresses, the ioctl allows the
> caller to control whether the access and dirty bits are set during the
> page walk. It also allows specifying an access mode instead of returning
> viable access modes, which enables setting the bits up to the level that
> caused a failure. Additionally, the ioctl provides more information about
> why the page walk failed, and which page table is responsible. This
> functionality is not available within KVM_TRANSLATE, and can't be added
> without breaking backwards compatiblity, thus a new ioctl is required.

...

>  Documentation/virt/kvm/api.rst                | 131 ++++++++
>  arch/x86/include/asm/kvm_host.h               |  18 +-
>  arch/x86/kvm/hyperv.c                         |   3 +-
>  arch/x86/kvm/kvm_emulate.h                    |   8 +
>  arch/x86/kvm/mmu.h                            |  10 +-
>  arch/x86/kvm/mmu/mmu.c                        |   7 +-
>  arch/x86/kvm/mmu/paging_tmpl.h                |  80 +++--
>  arch/x86/kvm/x86.c                            | 123 ++++++-
>  include/linux/kvm_host.h                      |   6 +
>  include/uapi/linux/kvm.h                      |  33 ++
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/x86_64/kvm_translate2.c     | 310 ++++++++++++++++++
>  virt/kvm/kvm_main.c                           |  41 +++
>  13 files changed, 724 insertions(+), 47 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c

...

> The simple reason for keeping this functionality in KVM, is that it already
> has a mature, production-level page walker (which is already exposed) and
> creating something similar QEMU would take a lot longer and would be much
> harder to maintain than just creating an API that leverages the existing
> walker.

I'm not convinced that implementing targeted support in QEMU (or any other VMM)
would be at all challenging or a burden to maintain.  I do think duplicating
functionality across multiple VMMs is undesirable, but that's an argument for
creating modular userspace libraries for such functionality.  E.g. I/O APIC
emulation is another one I'd love to move to a common library.

Traversing page tables isn't difficult.  Checking permission bits isn't complex.
Tedious, perhaps.  But not complex.  KVM's rather insane code comes from KVM's
desire to make the checks as performant as possible, because eking out every little
bit of performance matters for legacy shadow paging.  I doubt VSM needs _that_
level of performance.

I say "targeted", because I assume the only use case for VSM is 64-bit non-nested
guests.  QEMU already has a rudimentary supporting for walking guest page tables,
and that code is all of 40 LoC.  Granted, it's heinous and lacks permission checks
and A/D updates, but I would expect a clean implementation with permission checks
and A/D support would clock in around 200 LoC.  Maybe 300.

And ignoring docs and selftests, that's roughly what's being added in this series.
Much of the code being added is quite simple, but there are non-trivial changes
here as well.  E.g. the different ways of setting A/D bits.

My biggest concern is taking on ABI that restricts what KVM can do in its walker.
E.g. I *really* don't like the PKU change.  Yeah, Intel doesn't explicitly define
architectural behavior, but diverging from hardware behavior is rarely a good idea.

Similarly, the behavior of FNAME(protect_clean_gpte)() probably isn't desirable
for the VSM use case.