Documentation/virt/kvm/api.rst | 33 +++- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/mmu.c | 26 +++- arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 27 +++- arch/x86/kvm/mmu/mmu_internal.h | 20 ++- arch/x86/kvm/x86.c | 36 +++-- include/linux/kvm_host.h | 19 ++- include/uapi/linux/kvm.h | 6 +- .../selftests/kvm/demand_paging_test.c | 145 ++++++++++++++++-- .../testing/selftests/kvm/include/kvm_util.h | 5 + .../selftests/kvm/include/userfaultfd_util.h | 2 + tools/testing/selftests/kvm/lib/kvm_util.c | 42 ++++- .../selftests/kvm/lib/userfaultfd_util.c | 2 + .../selftests/kvm/set_memory_region_test.c | 33 ++++ virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 54 ++++++- 17 files changed, 419 insertions(+), 36 deletions(-)
This is a v2 of KVM Userfault, mostly unchanged from v1[5]. Changelog here:
v1->v2:
- For arm64, no longer zap stage 2 when disabling KVM_MEM_USERFAULT
(thanks Oliver).
- Fix the userfault_bitmap validation and casts (thanks kernel test
robot).
- Fix _Atomic cast for the userfault bitmap in the selftest (thanks
kernel test robot).
- Pick up Reviewed-by on doc changes (thanks Bagas).
And here is a trimmed down cover letter from v1, slightly modified
given the small arm64 change:
Please see the RFC[1] for the problem description. In summary,
guest_memfd VMs have no mechanism for doing post-copy live migration.
KVM Userfault provides such a mechanism.
There is a second problem that KVM Userfault solves: userfaultfd-based
post-copy doesn't scale very well. KVM Userfault when used with
userfaultfd can scale much better in the common case that most post-copy
demand fetches are a result of vCPU access violations. This is a
continuation of the solution Anish was working on[3]. This aspect of
KVM Userfault is important for userfaultfd-based live migration when
scaling up to hundreds of vCPUs with ~30us network latency for a
PAGE_SIZE demand-fetch.
The implementation in this series is version than the RFC[1]. It adds...
1. a new memslot flag is added: KVM_MEM_USERFAULT,
2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
4. a new KVM capability KVM_CAP_USERFAULT.
KVM Userfault does not attempt to catch KVM's own accesses to guest
memory. That is left up to userfaultfd.
When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
are zapped, and new faults will check `userfault_bitmap` to see if the
fault should exit to userspace.
When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
permitted.
When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
consistent with dirty log disabling. So on x86, huge mappings will be
reconstructed, but on arm64, they won't be.
KVM Userfault is not compatible with async page faults. Nikita has
proposed a new implementation of async page faults that is more
userspace-driven that *is* compatible with KVM Userfault[4].
See v1 for more performance details[5]. They are unchanged in this v2.
This series is based on the latest kvm/next.
[1]: https://lore.kernel.org/kvm/20240710234222.2333120-1-jthoughton@google.com/
[2]: https://lpc.events/event/18/contributions/1757/
[3]: https://lore.kernel.org/all/20240215235405.368539-1-amoorthy@google.com/
[4]: https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@amazon.com/#t
[5]: https://lore.kernel.org/kvm/20241204191349.1730936-1-jthoughton@google.com/
James Houghton (13):
KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
KVM: arm64: Add support for KVM_MEM_USERFAULT
KVM: selftests: Fix vm_mem_region_set_flags docstring
KVM: selftests: Fix prefault_mem logic
KVM: selftests: Add va_start/end into uffd_desc
KVM: selftests: Add KVM Userfault mode to demand_paging_test
KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
details
Documentation/virt/kvm/api.rst | 33 +++-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/mmu.c | 26 +++-
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 27 +++-
arch/x86/kvm/mmu/mmu_internal.h | 20 ++-
arch/x86/kvm/x86.c | 36 +++--
include/linux/kvm_host.h | 19 ++-
include/uapi/linux/kvm.h | 6 +-
.../selftests/kvm/demand_paging_test.c | 145 ++++++++++++++++--
.../testing/selftests/kvm/include/kvm_util.h | 5 +
.../selftests/kvm/include/userfaultfd_util.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 42 ++++-
.../selftests/kvm/lib/userfaultfd_util.c | 2 +
.../selftests/kvm/set_memory_region_test.c | 33 ++++
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 54 ++++++-
17 files changed, 419 insertions(+), 36 deletions(-)
base-commit: 10b2c8a67c4b8ec15f9d07d177f63b563418e948
--
2.47.1.613.gc27f4b7a9f-goog
On Thu, Jan 09, 2025, James Houghton wrote:
> James Houghton (13):
> KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
> KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
> KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
> KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
Starting with some series-wide feedback, the granularity of these first few
patches is too fine. I normally like to split things up, but honestly, this is
such a small feature that I don't see much point in separating the uAPI from the
infrastructure.
To avoid cyclical dependencies between common KVM and arch code, we can do all
the prep, but not fully enable+advertise support on any architecture until all
targeted architectures are fully ready.
In other words, I think we should squish these into one patch, minus this bit at
the very end of the series (spoiler alert):
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ce7bf5de6d72..0106d6d461a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1545,6 +1545,9 @@ static int check_memory_region_flags(struct kvm *kvm,
!(mem->flags & KVM_MEM_GUEST_MEMFD))
valid_flags |= KVM_MEM_READONLY;
+ if (IS_ENABLED(CONFIG_KVM_GENERIC_PAGE_FAULT))
+ valid_flags |= KVM_MEM_USERFAULT;
+
if (mem->flags & ~valid_flags)
return -EINVAL;
@@ -4824,6 +4827,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
case KVM_CAP_CHECK_EXTENSION_VM:
case KVM_CAP_ENABLE_CAP_VM:
case KVM_CAP_HALT_POLL:
+#ifdef CONFIG_KVM_GENERIC_PAGE_FAULT
+ case KVM_CAP_USERFAULT:
+#endif
return 1;
#ifdef CONFIG_KVM_MMIO
case KVM_CAP_COALESCED_MMIO:
On Thu, Jan 09, 2025, James Houghton wrote: > KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap > KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT > KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot > KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION > KVM: x86/mmu: Add support for KVM_MEM_USERFAULT > KVM: arm64: Add support for KVM_MEM_USERFAULT > KVM: selftests: Fix vm_mem_region_set_flags docstring > KVM: selftests: Fix prefault_mem logic > KVM: selftests: Add va_start/end into uffd_desc > KVM: selftests: Add KVM Userfault mode to demand_paging_test > KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT > KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests > KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT > details > > Documentation/virt/kvm/api.rst | 33 +++- > arch/arm64/kvm/Kconfig | 1 + > arch/arm64/kvm/mmu.c | 26 +++- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/mmu/mmu.c | 27 +++- > arch/x86/kvm/mmu/mmu_internal.h | 20 ++- > arch/x86/kvm/x86.c | 36 +++-- > include/linux/kvm_host.h | 19 ++- > include/uapi/linux/kvm.h | 6 +- > .../selftests/kvm/demand_paging_test.c | 145 ++++++++++++++++-- > .../testing/selftests/kvm/include/kvm_util.h | 5 + > .../selftests/kvm/include/userfaultfd_util.h | 2 + > tools/testing/selftests/kvm/lib/kvm_util.c | 42 ++++- > .../selftests/kvm/lib/userfaultfd_util.c | 2 + > .../selftests/kvm/set_memory_region_test.c | 33 ++++ > virt/kvm/Kconfig | 3 + > virt/kvm/kvm_main.c | 54 ++++++- > 17 files changed, 419 insertions(+), 36 deletions(-) I didn't look at the selftests changes, but nothing in this series scares me. We bikeshedded most of this death this in the "exit on missing" series, so for me at least, the only real question is whether or not we want to add the uAPI. AFAIK, this is best proposal for post-copy guest_memfd support (and not just because it's the only proposal :-D). So... yes? Attached are a variation on the series using the common "struct kvm_page_fault" idea. The documentation change could be squashed with the final enablement patch. Compile tested only. I would not be the least bit surprised if I completely butchered something.
On Tue, May 6, 2025 at 8:13 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jan 09, 2025, James Houghton wrote: > > KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap > > KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT > > KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot > > KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION > > KVM: x86/mmu: Add support for KVM_MEM_USERFAULT > > KVM: arm64: Add support for KVM_MEM_USERFAULT > > KVM: selftests: Fix vm_mem_region_set_flags docstring > > KVM: selftests: Fix prefault_mem logic > > KVM: selftests: Add va_start/end into uffd_desc > > KVM: selftests: Add KVM Userfault mode to demand_paging_test > > KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT > > KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests > > KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT > > details > > > > Documentation/virt/kvm/api.rst | 33 +++- > > arch/arm64/kvm/Kconfig | 1 + > > arch/arm64/kvm/mmu.c | 26 +++- > > arch/x86/kvm/Kconfig | 1 + > > arch/x86/kvm/mmu/mmu.c | 27 +++- > > arch/x86/kvm/mmu/mmu_internal.h | 20 ++- > > arch/x86/kvm/x86.c | 36 +++-- > > include/linux/kvm_host.h | 19 ++- > > include/uapi/linux/kvm.h | 6 +- > > .../selftests/kvm/demand_paging_test.c | 145 ++++++++++++++++-- > > .../testing/selftests/kvm/include/kvm_util.h | 5 + > > .../selftests/kvm/include/userfaultfd_util.h | 2 + > > tools/testing/selftests/kvm/lib/kvm_util.c | 42 ++++- > > .../selftests/kvm/lib/userfaultfd_util.c | 2 + > > .../selftests/kvm/set_memory_region_test.c | 33 ++++ > > virt/kvm/Kconfig | 3 + > > virt/kvm/kvm_main.c | 54 ++++++- > > 17 files changed, 419 insertions(+), 36 deletions(-) > > I didn't look at the selftests changes, but nothing in this series scares me. We > bikeshedded most of this death this in the "exit on missing" series, so for me at > least, the only real question is whether or not we want to add the uAPI. AFAIK, > this is best proposal for post-copy guest_memfd support (and not just because it's > the only proposal :-D). The only thing that I want to call out again is that this UAPI works great for when we are going from userfault --> !userfault. That is, it works well for postcopy (both for guest_memfd and for standard memslots where userfaultfd scalability is a concern). But there is another use case worth bringing up: unmapping pages that the VMM is emulating as poisoned. Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains mapped in userspace), KVM Userfault is the only option (if we don't want to punch holes in memslots). This leaves us with three problems: 1. If using KVM Userfault to emulate poison, we are stuck with small pages in stage 2 for the entire memslot. 2. We must unmap everything when toggling on KVM Userfault just to unmap a single page. 3. If KVM Userfault is already enabled, we have no choice but to toggle KVM Userfault off and on again to unmap the newly poisoned pages (i.e., there is no ioctl to scan the bitmap and unmap newly-userfault pages). All of these are non-issues if we emulate poison by removing memslots, and I think that's possible. But if that proves too slow, we'd need to be a little bit more clever with hugepage recovery and with unmapping newly-userfault pages, both of which I think can be solved by adding some kind of bitmap re-scan ioctl. We can do that later if the need arises. > So... yes? Thanks Sean! > Attached are a variation on the series using the common "struct kvm_page_fault" > idea. The documentation change could be squashed with the final enablement patch. > > Compile tested only. I would not be the least bit surprised if I completely > butchered something. Looks good! The new selftests work just fine.
On Wed, May 28, 2025, James Houghton wrote: > The only thing that I want to call out again is that this UAPI works > great for when we are going from userfault --> !userfault. That is, it > works well for postcopy (both for guest_memfd and for standard > memslots where userfaultfd scalability is a concern). > > But there is another use case worth bringing up: unmapping pages that > the VMM is emulating as poisoned. > > Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for > 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains > mapped in userspace), KVM Userfault is the only option (if we don't > want to punch holes in memslots). This leaves us with three problems: > > 1. If using KVM Userfault to emulate poison, we are stuck with small > pages in stage 2 for the entire memslot. > 2. We must unmap everything when toggling on KVM Userfault just to > unmap a single page. > 3. If KVM Userfault is already enabled, we have no choice but to > toggle KVM Userfault off and on again to unmap the newly poisoned > pages (i.e., there is no ioctl to scan the bitmap and unmap > newly-userfault pages). > > All of these are non-issues if we emulate poison by removing memslots, > and I think that's possible. But if that proves too slow, we'd need to > be a little bit more clever with hugepage recovery and with unmapping > newly-userfault pages, both of which I think can be solved by adding > some kind of bitmap re-scan ioctl. We can do that later if the need > arises. Hmm. On the one hand, punching a hole in a memslot is generally gross, e.g. requires deleting the entire memslot and thus unmapping large swaths of guest memory (or all of guest memory for most x86 VMs). On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM will need to unmap guest memory (or demote the mapping size a la eager page splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1. One thought would be to change the behavior of KVM's processing of the userfault bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead give userspace more explicit control over the mapping size. However, on non-x86 architectures, implementing such a control would require a non-trivial amount of code and complexity, and would incur overhead that doesn't exist today (i.e. we'd need to implement equivalent infrastructure to x86's disallow_lpage tracking). And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for KVM accesses to guest memory. E.g. if the HugeTLB page is still mapped into userspace, then depending on the flow that gets hit, I'm pretty sure that emulating an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO. All in all, given that KVM needs to correctly handle hugepage vs. memslot alignment/size issues no matter what, and that KVM has well-established behavior for handling no-memslot accesses, I'm leaning towards saying userspace should punch a hole in the memslot in order to emulate a poisoned page. The only reason I can think of for preferring a different approach is if userspace can't provide the desired latency/performance characteristics when punching a hole in a memslot. Hopefully reacting to a poisoned page is a fairly slow path?
On Thu, May 29, 2025 at 11:28 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, May 28, 2025, James Houghton wrote: > > The only thing that I want to call out again is that this UAPI works > > great for when we are going from userfault --> !userfault. That is, it > > works well for postcopy (both for guest_memfd and for standard > > memslots where userfaultfd scalability is a concern). > > > > But there is another use case worth bringing up: unmapping pages that > > the VMM is emulating as poisoned. > > > > Normally this can be handled by mm (e.g. with UFFDIO_POISON), but for > > 4K poison within a HugeTLB-backed memslot (if the HugeTLB page remains > > mapped in userspace), KVM Userfault is the only option (if we don't > > want to punch holes in memslots). This leaves us with three problems: > > > > 1. If using KVM Userfault to emulate poison, we are stuck with small > > pages in stage 2 for the entire memslot. > > 2. We must unmap everything when toggling on KVM Userfault just to > > unmap a single page. > > 3. If KVM Userfault is already enabled, we have no choice but to > > toggle KVM Userfault off and on again to unmap the newly poisoned > > pages (i.e., there is no ioctl to scan the bitmap and unmap > > newly-userfault pages). > > > > All of these are non-issues if we emulate poison by removing memslots, > > and I think that's possible. But if that proves too slow, we'd need to > > be a little bit more clever with hugepage recovery and with unmapping > > newly-userfault pages, both of which I think can be solved by adding > > some kind of bitmap re-scan ioctl. We can do that later if the need > > arises. > > Hmm. > > On the one hand, punching a hole in a memslot is generally gross, e.g. requires > deleting the entire memslot and thus unmapping large swaths of guest memory (or > all of guest memory for most x86 VMs). > > On the other hand, unless userspace sets KVM_MEM_USERFAULT from time zero, KVM > will need to unmap guest memory (or demote the mapping size a la eager page > splitting?) when KVM_MEM_USERFAULT is toggled from 0=>1. > > One thought would be to change the behavior of KVM's processing of the userfault > bitmap, such that KVM doesn't infer *anything* about the mapping sizes, and instead > give userspace more explicit control over the mapping size. However, on non-x86 > architectures, implementing such a control would require a non-trivial amount of > code and complexity, and would incur overhead that doesn't exist today (i.e. we'd > need to implement equivalent infrastructure to x86's disallow_lpage tracking). > > And IIUC, another problem with KVM Userfault is that it wouldn't Just Work for > KVM accesses to guest memory. E.g. if the HugeTLB page is still mapped into > userspace, then depending on the flow that gets hit, I'm pretty sure that emulating > an access to the poisoned memory would result in KVM_EXIT_INTERNAL_ERROR, whereas > punching a hole in a memslot would result in a much more friendly KVM_EXIT_MMIO. Oh, yes, of course. KVM Userfault is not enough for memory poison emulation for non-guest-memfd memslots. Like how for these memslots we need userfaultfd to do post-copy properly, for memory poison, we still need userfaultfd (so 4K emulated poison within a HugeTLB memslot is not possible). So yeah in this case (4K poison in a still-mapped HugeTLB page), we would need to punch a hole and get KVM_EXIT_MMIO. SGTM. For guest_memfd memslots, we can handle uaccess to emulated poison like tmpfs: with UFFDIO_POISON (Nikita has already started on UFFDIO_CONTINUE support[1]). We *could* make the gmem page fault handler (what Fuad is implementing) respect KVM Userfault, but that isn't necessary (and would look quite like a reimplementation of userfaultfd). [1]: https://lore.kernel.org/kvm/20250404154352.23078-1-kalyazin@amazon.com/ > All in all, given that KVM needs to correctly handle hugepage vs. memslot > alignment/size issues no matter what, and that KVM has well-established behavior > for handling no-memslot accesses, I'm leaning towards saying userspace should > punch a hole in the memslot in order to emulate a poisoned page. The only reason > I can think of for preferring a different approach is if userspace can't provide > the desired latency/performance characteristics when punching a hole in a memslot. > Hopefully reacting to a poisoned page is a fairly slow path? In general, yes it is. Memory poison is rare. For non-HugeTLB (tmpfs or guest_memfd), I don't think we need to punch a hole, so that's good. For HugeTLB, there are two circumstances that are perhaps concerning: 1. Learning about poison during post-copy? This should be vanishingly rare, as most poison is discovered in the first pre-copy pass. If we didn't do *any* pre-copy passes, then it could be a concern. 2. Learning about poison during pre-copy after shattering? If doing lazy page splitting with incremental dirty log clearing, this isn't a *huge* problem, otherwise it could be. I think userspace has two ways out: (1) don't make super large memslots, or (2) don't use HugeTLB. Just to be clear, this isn't really an issue with KVM Userfault -- in its current form (not preventing KVM's uaccess), it cannot help here.
© 2016 - 2026 Red Hat, Inc.