Documentation/virt/kvm/api.rst | 35 ++++++ arch/x86/include/asm/kvm_host.h | 12 +- arch/x86/kvm/Kconfig | 7 ++ arch/x86/kvm/lapic.c | 2 + arch/x86/kvm/mmu/mmu.c | 68 ++++++++++- arch/x86/kvm/x86.c | 101 +++++++++++++++- arch/x86/kvm/x86.h | 2 + include/linux/kvm_host.h | 30 +++++ include/linux/kvm_types.h | 1 + include/trace/events/kvm.h | 50 +++++--- include/uapi/linux/kvm.h | 12 +- virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/async_pf.c | 2 +- virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++ virt/kvm/async_pf_user.h | 24 ++++ virt/kvm/kvm_main.c | 14 +++ 17 files changed, 535 insertions(+), 26 deletions(-) create mode 100644 virt/kvm/async_pf_user.c create mode 100644 virt/kvm/async_pf_user.h
Async PF [1] allows to run other processes on a vCPU while the host handles a stage-2 fault caused by a process on that vCPU. When using VM-exit-based stage-2 fault handling [2], async PF functionality is lost because KVM does not run the vCPU while a fault is being handled so no other process can execute on the vCPU. This patch series extends VM-exit-based stage-2 fault handling with async PF support by letting userspace handle faults instead of the kernel, hence the "async PF user" name. I circulated the idea with Paolo, Sean, David H, and James H at the LPC, and the only concern I heard was about injecting the "page not present" event via #PF exception in the CoCo case, where it may not work. In my implementation, I reused the existing code for doing that, so the async PF user implementation is on par with the present async PF implementation in this regard, and support for the CoCo case can be added separately. Please note that this series is applied on top of the VM-exit-based stage-2 fault handling RFC [2]. Implementation The following workflow is implemented: - A process in the guest causes a stage-2 fault. - KVM checks whether the fault can be handled asynchronously. If it can, KVM prepares the VM exit info that contains a newly added "async PF flag" raised and an async PF token value corresponding to the fault. - Userspace reads the VM exit info and resumes the vCPU immediately. Meanwhile it processes the fault. - When the fault is resolved, userspace calls a new async ioctl using the token to notify KVM. - KVM communicates to the guest that the process can be resumed. Notes: - No changes to the x86 async PF PV interface are required - The series does not introduce new dependencies on x86 compared to the existing async PF Testing Inspired by [3], I built a Firecracker-based setup, where Firecracker implemented the VM-exit-based fault handling. I observed that a workload consisting of a CPU-bound and memory-bound threads running concurrently was executing faster with async PF user enabled: with 10 ms-long fault processing, it was 26% faster. It is difficult to provide an objective performance comparison between async PF kernel and async PF user, because async PF user can only work with VM-exit-based fault handling, which has its own performance characteristics compared to in-kernel fault handling or UserfaultFD. The patch series is built on top of the VM-exit-based stage-2 fault handling RFC [2]. Patch 1 updates documentation to reflect [2] changes. Patches 2-6 add the implementation of async PF user. Questions: - Are there any general concerns about the approach? - Can we leave the CoCo use case aside for now, or do we need to support it straight away? - What is the desired level of coupling between async PF and async PF user? For now, I kept the coupling to the bare minimum (only the PV-related data structure is shared between the two). [1] https://kvm-forum.qemu.org/2021/sdei_apf_for_arm64_gavin.pdf [2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/ [3] https://lore.kernel.org/all/20200508032919.52147-1-gshan@redhat.com/ Nikita Nikita Kalyazin (6): Documentation: KVM: add userfault KVM exit flag Documentation: KVM: add async pf user doc KVM: x86: add async ioctl support KVM: trace events: add type argument to async pf KVM: x86: async_pf_user: add infrastructure KVM: x86: async_pf_user: hook to fault handling and add ioctl Documentation/virt/kvm/api.rst | 35 ++++++ arch/x86/include/asm/kvm_host.h | 12 +- arch/x86/kvm/Kconfig | 7 ++ arch/x86/kvm/lapic.c | 2 + arch/x86/kvm/mmu/mmu.c | 68 ++++++++++- arch/x86/kvm/x86.c | 101 +++++++++++++++- arch/x86/kvm/x86.h | 2 + include/linux/kvm_host.h | 30 +++++ include/linux/kvm_types.h | 1 + include/trace/events/kvm.h | 50 +++++--- include/uapi/linux/kvm.h | 12 +- virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/async_pf.c | 2 +- virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++ virt/kvm/async_pf_user.h | 24 ++++ virt/kvm/kvm_main.c | 14 +++ 17 files changed, 535 insertions(+), 26 deletions(-) create mode 100644 virt/kvm/async_pf_user.c create mode 100644 virt/kvm/async_pf_user.h base-commit: 15f01813426bf9672e2b24a5bac7b861c25de53b -- 2.40.1
On Mon, Nov 18, 2024 at 4:40 AM Nikita Kalyazin <kalyazin@amazon.com> wrote: > > Async PF [1] allows to run other processes on a vCPU while the host > handles a stage-2 fault caused by a process on that vCPU. When using > VM-exit-based stage-2 fault handling [2], async PF functionality is lost > because KVM does not run the vCPU while a fault is being handled so no > other process can execute on the vCPU. This patch series extends > VM-exit-based stage-2 fault handling with async PF support by letting > userspace handle faults instead of the kernel, hence the "async PF user" > name. > > I circulated the idea with Paolo, Sean, David H, and James H at the LPC, > and the only concern I heard was about injecting the "page not present" > event via #PF exception in the CoCo case, where it may not work. In my > implementation, I reused the existing code for doing that, so the async > PF user implementation is on par with the present async PF > implementation in this regard, and support for the CoCo case can be > added separately. > > Please note that this series is applied on top of the VM-exit-based > stage-2 fault handling RFC [2]. Thanks, Nikita! I'll post a new version of [2] very soon. The new version contains the simplifications we talked about at LPC but is conceptually the same (so this async PF series is motivated the same way), and it shouldn't have many/any conflicts with the main bits of this series. > [2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
On 19/11/2024 01:26, James Houghton wrote: >> Please note that this series is applied on top of the VM-exit-based >> stage-2 fault handling RFC [2]. > > Thanks, Nikita! I'll post a new version of [2] very soon. The new > version contains the simplifications we talked about at LPC but is > conceptually the same (so this async PF series is motivated the same > way), and it shouldn't have many/any conflicts with the main bits of > this series. Great news, looking forward to seeing it! > >> [2] https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> Async PF [1] allows to run other processes on a vCPU while the host
> handles a stage-2 fault caused by a process on that vCPU. When using
> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
> because KVM does not run the vCPU while a fault is being handled so no
> other process can execute on the vCPU. This patch series extends
> VM-exit-based stage-2 fault handling with async PF support by letting
> userspace handle faults instead of the kernel, hence the "async PF user"
> name.
>
> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
> and the only concern I heard was about injecting the "page not present"
> event via #PF exception in the CoCo case, where it may not work. In my
> implementation, I reused the existing code for doing that, so the async
> PF user implementation is on par with the present async PF
> implementation in this regard, and support for the CoCo case can be
> added separately.
>
> Please note that this series is applied on top of the VM-exit-based
> stage-2 fault handling RFC [2].
...
> Nikita Kalyazin (6):
> Documentation: KVM: add userfault KVM exit flag
> Documentation: KVM: add async pf user doc
> KVM: x86: add async ioctl support
> KVM: trace events: add type argument to async pf
> KVM: x86: async_pf_user: add infrastructure
> KVM: x86: async_pf_user: hook to fault handling and add ioctl
>
> Documentation/virt/kvm/api.rst | 35 ++++++
> arch/x86/include/asm/kvm_host.h | 12 +-
> arch/x86/kvm/Kconfig | 7 ++
> arch/x86/kvm/lapic.c | 2 +
> arch/x86/kvm/mmu/mmu.c | 68 ++++++++++-
> arch/x86/kvm/x86.c | 101 +++++++++++++++-
> arch/x86/kvm/x86.h | 2 +
> include/linux/kvm_host.h | 30 +++++
> include/linux/kvm_types.h | 1 +
> include/trace/events/kvm.h | 50 +++++---
> include/uapi/linux/kvm.h | 12 +-
> virt/kvm/Kconfig | 3 +
> virt/kvm/Makefile.kvm | 1 +
> virt/kvm/async_pf.c | 2 +-
> virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++
> virt/kvm/async_pf_user.h | 24 ++++
> virt/kvm/kvm_main.c | 14 +++
> 17 files changed, 535 insertions(+), 26 deletions(-)
I am supportive of the idea, but there is way too much copy+paste in this series.
And it's not just the code itself, it's all the structures and concepts. Off the
top of my head, I can't think of any reason there needs to be a separate queue,
separate lock(s), etc. The only difference between kernel APF and user APF is
what chunk of code is responsible for faulting in the page.
I suspect a good place to start would be something along the lines of the below
diff, and go from there. Given that KVM already needs to special case the fake
"wake all" items, I'm guessing it won't be terribly difficult to teach the core
flows about userspace async #PF.
I'm also not sure that injecting async #PF for all userfaults is desirable. For
in-kernel async #PF, KVM knows that faulting in the memory would sleep. For
userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
be handled via async #PF. The obvious answer is to have userspace only enable
userspace async #PF when it's useful, but "an all or nothing" approach isn't
great uAPI. On the flip side, adding uAPI for a use case that doesn't exist
doesn't make sense either :-/
Exiting to userspace in vCPU context is also kludgy. It makes sense for base
userfault, because the vCPU can't make forward progress until the fault is
resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
James' series. Anyways, it definitely doesn't make sense for async #PF, because
the whole point is to let the vCPU run. Signalling userspace would definitely
add complexity, but only because of the need to communicate the token and wait
for userspace to consume said token. I'll think more on that.
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ee4816b079a..fc31b47cf9c5 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
* success, 'false' on failure (page fault has to be handled synchronously).
*/
bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
- unsigned long hva, struct kvm_arch_async_pf *arch)
+ unsigned long hva, struct kvm_arch_async_pf *arch,
+ bool userfault)
{
struct kvm_async_pf *work;
@@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
work->addr = hva;
work->arch = *arch;
- INIT_WORK(&work->work, async_pf_execute);
-
list_add_tail(&work->queue, &vcpu->async_pf.queue);
vcpu->async_pf.queued++;
work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
- schedule_work(&work->work);
+ if (userfault) {
+ work->userfault = true;
+ } else {
+ INIT_WORK(&work->work, async_pf_execute);
+ schedule_work(&work->work);
+ }
return true;
}
On 11/02/2025 21:17, Sean Christopherson wrote:
> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>> Async PF [1] allows to run other processes on a vCPU while the host
>> handles a stage-2 fault caused by a process on that vCPU. When using
>> VM-exit-based stage-2 fault handling [2], async PF functionality is lost
>> because KVM does not run the vCPU while a fault is being handled so no
>> other process can execute on the vCPU. This patch series extends
>> VM-exit-based stage-2 fault handling with async PF support by letting
>> userspace handle faults instead of the kernel, hence the "async PF user"
>> name.
>>
>> I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
>> and the only concern I heard was about injecting the "page not present"
>> event via #PF exception in the CoCo case, where it may not work. In my
>> implementation, I reused the existing code for doing that, so the async
>> PF user implementation is on par with the present async PF
>> implementation in this regard, and support for the CoCo case can be
>> added separately.
>>
>> Please note that this series is applied on top of the VM-exit-based
>> stage-2 fault handling RFC [2].
>
> ...
>
>> Nikita Kalyazin (6):
>> Documentation: KVM: add userfault KVM exit flag
>> Documentation: KVM: add async pf user doc
>> KVM: x86: add async ioctl support
>> KVM: trace events: add type argument to async pf
>> KVM: x86: async_pf_user: add infrastructure
>> KVM: x86: async_pf_user: hook to fault handling and add ioctl
>>
>> Documentation/virt/kvm/api.rst | 35 ++++++
>> arch/x86/include/asm/kvm_host.h | 12 +-
>> arch/x86/kvm/Kconfig | 7 ++
>> arch/x86/kvm/lapic.c | 2 +
>> arch/x86/kvm/mmu/mmu.c | 68 ++++++++++-
>> arch/x86/kvm/x86.c | 101 +++++++++++++++-
>> arch/x86/kvm/x86.h | 2 +
>> include/linux/kvm_host.h | 30 +++++
>> include/linux/kvm_types.h | 1 +
>> include/trace/events/kvm.h | 50 +++++---
>> include/uapi/linux/kvm.h | 12 +-
>> virt/kvm/Kconfig | 3 +
>> virt/kvm/Makefile.kvm | 1 +
>> virt/kvm/async_pf.c | 2 +-
>> virt/kvm/async_pf_user.c | 197 ++++++++++++++++++++++++++++++++
>> virt/kvm/async_pf_user.h | 24 ++++
>> virt/kvm/kvm_main.c | 14 +++
>> 17 files changed, 535 insertions(+), 26 deletions(-)
>
> I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean,
Yes, like I mentioned in the cover letter, I left the new implementation
isolated on purpose to make the scope of the change clear. There is
certainly lots of duplication that should be removed later on.
> And it's not just the code itself, it's all the structures and concepts. Off the
> top of my head, I can't think of any reason there needs to be a separate queue,
> separate lock(s), etc. The only difference between kernel APF and user APF is
> what chunk of code is responsible for faulting in the page.
There are two queues involved:
- "queue": stores in-flight faults. APF-kernel uses it to cancel all
works if needed. APF-user does not have a way to "cancel" userspace
works, but it uses the queue to look up the struct by the token when
userspace reports a completion.
- "ready": stores completed faults until KVM finds a chance to tell
guest about them.
I agree that the "ready" queue can be shared between APF-kernel and
-user as it's used in the same way. As for the "queue" queue, do you
think it's ok to process its elements differently based on the "type" of
them in a single loop [1] instead of having two separate queues?
[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
> I suspect a good place to start would be something along the lines of the below
> diff, and go from there. Given that KVM already needs to special case the fake
> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> flows about userspace async #PF.
That sounds sensible. I can certainly approach it in a "bottom up" way
by sparingly adding handling where it's different in APF-user rather
than adding it side by side and trying to merge common parts.
> I'm also not sure that injecting async #PF for all userfaults is desirable. For
> in-kernel async #PF, KVM knows that faulting in the memory would sleep. For
> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> be handled via async #PF. The obvious answer is to have userspace only enable
> userspace async #PF when it's useful, but "an all or nothing" approach isn't
> great uAPI. On the flip side, adding uAPI for a use case that doesn't exist
> doesn't make sense either :-/
I wasn't able to locate the code that would check whether faulting would
sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]).
Please let me know if I'm missing something here.
[2]
https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
> Exiting to userspace in vCPU context is also kludgy. It makes sense for base
> userfault, because the vCPU can't make forward progress until the fault is
> resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
Even though we exit to userspace, in case of APF-user, userspace is
supposed to VM enter straight after scheduling the async job, which is
then executed concurrently with the vCPU.
> James' series. Anyways, it definitely doesn't make sense for async #PF, because
> the whole point is to let the vCPU run. Signalling userspace would definitely
> add complexity, but only because of the need to communicate the token and wait
> for userspace to consume said token. I'll think more on that.
By signalling userspace you mean a new non-exit-to-userspace mechanism
similar to UFFD? What advantage can you see in it over exiting to
userspace (which already exists in James's series)?
Thanks,
Nikita
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 0ee4816b079a..fc31b47cf9c5 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> * success, 'false' on failure (page fault has to be handled synchronously).
> */
> bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> - unsigned long hva, struct kvm_arch_async_pf *arch)
> + unsigned long hva, struct kvm_arch_async_pf *arch,
> + bool userfault)
> {
> struct kvm_async_pf *work;
>
> @@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> work->addr = hva;
> work->arch = *arch;
>
> - INIT_WORK(&work->work, async_pf_execute);
> -
> list_add_tail(&work->queue, &vcpu->async_pf.queue);
> vcpu->async_pf.queued++;
> work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);
>
> - schedule_work(&work->work);
> + if (userfault) {
> + work->userfault = true;
> + } else {
> + INIT_WORK(&work->work, async_pf_execute);
> + schedule_work(&work->work);
> + }
>
> return true;
> }
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> On 11/02/2025 21:17, Sean Christopherson wrote:
> > On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> > And it's not just the code itself, it's all the structures and concepts. Off the
> > top of my head, I can't think of any reason there needs to be a separate queue,
> > separate lock(s), etc. The only difference between kernel APF and user APF is
> > what chunk of code is responsible for faulting in the page.
>
> There are two queues involved:
> - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
> if needed. APF-user does not have a way to "cancel" userspace works, but it
> uses the queue to look up the struct by the token when userspace reports a
> completion.
> - "ready": stores completed faults until KVM finds a chance to tell guest
> about them.
>
> I agree that the "ready" queue can be shared between APF-kernel and -user as
> it's used in the same way. As for the "queue" queue, do you think it's ok
> to process its elements differently based on the "type" of them in a single
> loop [1] instead of having two separate queues?
Yes.
> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
>
> > I suspect a good place to start would be something along the lines of the below
> > diff, and go from there. Given that KVM already needs to special case the fake
> > "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> > flows about userspace async #PF.
>
> That sounds sensible. I can certainly approach it in a "bottom up" way by
> sparingly adding handling where it's different in APF-user rather than
> adding it side by side and trying to merge common parts.
>
> > I'm also not sure that injecting async #PF for all userfaults is desirable. For
> > in-kernel async #PF, KVM knows that faulting in the memory would sleep. For
> > userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> > be handled via async #PF. The obvious answer is to have userspace only enable
> > userspace async #PF when it's useful, but "an all or nothing" approach isn't
> > great uAPI. On the flip side, adding uAPI for a use case that doesn't exist
> > doesn't make sense either :-/
>
> I wasn't able to locate the code that would check whether faulting would
> sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]). Please let
> me know if I'm missing something here.
kvm_can_do_async_pf() will be reached if and only if faulting in the memory
requires waiting. If a page is swapped out, but faulting it back in doesn't
require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.
/*
* If resolving the page failed because I/O is needed to fault-in the
* page, then either set up an asynchronous #PF to do the I/O, or if
* doing an async #PF isn't possible, retry with I/O allowed. All
* other failures are terminal, i.e. retrying won't help.
*/
if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
return RET_PF_CONTINUE;
if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(fault->addr, fault->gfn);
if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
return RET_PF_RETRY;
} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
return RET_PF_RETRY;
}
}
The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task. If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.
But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
userspace faults are either synchronous or asynchronous.
> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
>
> > Exiting to userspace in vCPU context is also kludgy. It makes sense for base
> > userfault, because the vCPU can't make forward progress until the fault is
> > resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
>
> Even though we exit to userspace, in case of APF-user, userspace is supposed
> to VM enter straight after scheduling the async job, which is then executed
> concurrently with the vCPU.
>
> > James' series. Anyways, it definitely doesn't make sense for async #PF, because
> > the whole point is to let the vCPU run. Signalling userspace would definitely
> > add complexity, but only because of the need to communicate the token and wait
> > for userspace to consume said token. I'll think more on that.
>
> By signalling userspace you mean a new non-exit-to-userspace mechanism
> similar to UFFD?
Yes.
> What advantage can you see in it over exiting to userspace (which already exists
> in James's series)?
It doesn't exit to userspace :-)
If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace. Whether
or not such an optimization is worth the complexity is an entirely different
question though.
On 19/02/2025 15:17, Sean Christopherson wrote:
> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>> On 11/02/2025 21:17, Sean Christopherson wrote:
>>> On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
>>> And it's not just the code itself, it's all the structures and concepts. Off the
>>> top of my head, I can't think of any reason there needs to be a separate queue,
>>> separate lock(s), etc. The only difference between kernel APF and user APF is
>>> what chunk of code is responsible for faulting in the page.
>>
>> There are two queues involved:
>> - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
>> if needed. APF-user does not have a way to "cancel" userspace works, but it
>> uses the queue to look up the struct by the token when userspace reports a
>> completion.
>> - "ready": stores completed faults until KVM finds a chance to tell guest
>> about them.
>>
>> I agree that the "ready" queue can be shared between APF-kernel and -user as
>> it's used in the same way. As for the "queue" queue, do you think it's ok
>> to process its elements differently based on the "type" of them in a single
>> loop [1] instead of having two separate queues?
>
> Yes.
>
>> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
>>
>>> I suspect a good place to start would be something along the lines of the below
>>> diff, and go from there. Given that KVM already needs to special case the fake
>>> "wake all" items, I'm guessing it won't be terribly difficult to teach the core
>>> flows about userspace async #PF.
>>
>> That sounds sensible. I can certainly approach it in a "bottom up" way by
>> sparingly adding handling where it's different in APF-user rather than
>> adding it side by side and trying to merge common parts.
>>
>>> I'm also not sure that injecting async #PF for all userfaults is desirable. For
>>> in-kernel async #PF, KVM knows that faulting in the memory would sleep. For
>>> userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
>>> be handled via async #PF. The obvious answer is to have userspace only enable
>>> userspace async #PF when it's useful, but "an all or nothing" approach isn't
>>> great uAPI. On the flip side, adding uAPI for a use case that doesn't exist
>>> doesn't make sense either :-/
>>
>> I wasn't able to locate the code that would check whether faulting would
>> sleep in APF-kernel. KVM spins APF-kernel whenever it can ([2]). Please let
>> me know if I'm missing something here.
>
> kvm_can_do_async_pf() will be reached if and only if faulting in the memory
> requires waiting. If a page is swapped out, but faulting it back in doesn't
> require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
> then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.
>
> /*
> * If resolving the page failed because I/O is needed to fault-in the
> * page, then either set up an asynchronous #PF to do the I/O, or if
> * doing an async #PF isn't possible, retry with I/O allowed. All
> * other failures are terminal, i.e. retrying won't help.
> */
> if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
> return RET_PF_CONTINUE;
>
> if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
> trace_kvm_try_async_get_page(fault->addr, fault->gfn);
> if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
> trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
> kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> return RET_PF_RETRY;
> } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> return RET_PF_RETRY;
> }
> }
>
> The conundrum with userspace async #PF is that if userspace is given only a single
> bit per gfn to force an exit, then KVM won't be able to differentiate between
> "faults" that will be handled synchronously by the vCPU task, and faults that
> usersepace will hand off to an I/O task. If the fault is handled synchronously,
> KVM will needlessly inject a not-present #PF and a present IRQ.
Right, but from the guest's point of view, async PF means "it will
probably take a while for the host to get the page, so I may consider
doing something else in the meantime (ie schedule another process if
available)". If we are exiting to userspace, it isn't going to be quick
anyway, so we can consider all such faults "long" and warranting the
execution of the async PF protocol. So always injecting a not-present
#PF and page ready IRQ doesn't look too wrong in that case.
> But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
> userspace faults are either synchronous or asynchronous.
Yes, pretty much. The user will be choosing the extreme that is more
performant for their specific usecase.
>> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
>>
>>> Exiting to userspace in vCPU context is also kludgy. It makes sense for base
>>> userfault, because the vCPU can't make forward progress until the fault is
>>> resolved. Actually, I'm not even sure it makes sense there. I'll follow-up in
>>
>> Even though we exit to userspace, in case of APF-user, userspace is supposed
>> to VM enter straight after scheduling the async job, which is then executed
>> concurrently with the vCPU.
>>
>>> James' series. Anyways, it definitely doesn't make sense for async #PF, because
>>> the whole point is to let the vCPU run. Signalling userspace would definitely
>>> add complexity, but only because of the need to communicate the token and wait
>>> for userspace to consume said token. I'll think more on that.
>>
>> By signalling userspace you mean a new non-exit-to-userspace mechanism
>> similar to UFFD?
>
> Yes.
>
>> What advantage can you see in it over exiting to userspace (which already exists
>> in James's series)?
>
> It doesn't exit to userspace :-)
>
> If userspace simply wakes a different task in response to the exit, then KVM
> should be able to wake said task, e.g. by signalling an eventfd, and resume the
> guest much faster than if the vCPU task needs to roundtrip to userspace. Whether
> or not such an optimization is worth the complexity is an entirely different
> question though.
This reminds me of the discussion about VMA-less UFFD that was coming up
several times, such as [1], but AFAIK hasn't materialised into something
actionable. I may be wrong, but James was looking into that and
couldn't figure out a way to scale it sufficiently for his use case and
had to stick with the VM-exit-based approach. Can you see a world where
VM-exit userfaults coexist with no-VM-exit way of handling async PFs?
[1]: https://lore.kernel.org/kvm/ZqwKuzfAs7pvdHAN@x1n/
On Thu, Feb 20, 2025, Nikita Kalyazin wrote: > On 19/02/2025 15:17, Sean Christopherson wrote: > > On Wed, Feb 12, 2025, Nikita Kalyazin wrote: > > The conundrum with userspace async #PF is that if userspace is given only a single > > bit per gfn to force an exit, then KVM won't be able to differentiate between > > "faults" that will be handled synchronously by the vCPU task, and faults that > > usersepace will hand off to an I/O task. If the fault is handled synchronously, > > KVM will needlessly inject a not-present #PF and a present IRQ. > > Right, but from the guest's point of view, async PF means "it will probably > take a while for the host to get the page, so I may consider doing something > else in the meantime (ie schedule another process if available)". Except in this case, the guest never gets a chance to run, i.e. it can't do something else. From the guest point of view, if KVM doesn't inject what is effectively a spurious async #PF, the VM-Exiting instruction simply took a (really) long time to execute. > If we are exiting to userspace, it isn't going to be quick anyway, so we can > consider all such faults "long" and warranting the execution of the async PF > protocol. So always injecting a not-present #PF and page ready IRQ doesn't > look too wrong in that case. There is no "wrong", it's simply wasteful. The fact that the userspace exit is "long" is completely irrelevant. Decompressing zswap is also slow, but it is done on the current CPU, i.e. is not background I/O, and so doesn't trigger async #PFs. In the guest, if host userspace resolves the fault before redoing KVM_RUN, the vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion of that #PF. > > > What advantage can you see in it over exiting to userspace (which already exists > > > in James's series)? > > > > It doesn't exit to userspace :-) > > > > If userspace simply wakes a different task in response to the exit, then KVM > > should be able to wake said task, e.g. by signalling an eventfd, and resume the > > guest much faster than if the vCPU task needs to roundtrip to userspace. Whether > > or not such an optimization is worth the complexity is an entirely different > > question though. > > This reminds me of the discussion about VMA-less UFFD that was coming up > several times, such as [1], but AFAIK hasn't materialised into something > actionable. I may be wrong, but James was looking into that and couldn't > figure out a way to scale it sufficiently for his use case and had to stick > with the VM-exit-based approach. Can you see a world where VM-exit > userfaults coexist with no-VM-exit way of handling async PFs? The issue with UFFD is that it's difficult to provide a generic "point of contact", whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide per-vCPU buffers/structures to aid communication. That said, supporting "exitless" KVM userfault would most definitely be premature optimization without strong evidence it would benefit a real world use case.
On 20/02/2025 18:49, Sean Christopherson wrote: > On Thu, Feb 20, 2025, Nikita Kalyazin wrote: >> On 19/02/2025 15:17, Sean Christopherson wrote: >>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote: >>> The conundrum with userspace async #PF is that if userspace is given only a single >>> bit per gfn to force an exit, then KVM won't be able to differentiate between >>> "faults" that will be handled synchronously by the vCPU task, and faults that >>> usersepace will hand off to an I/O task. If the fault is handled synchronously, >>> KVM will needlessly inject a not-present #PF and a present IRQ. >> >> Right, but from the guest's point of view, async PF means "it will probably >> take a while for the host to get the page, so I may consider doing something >> else in the meantime (ie schedule another process if available)". > > Except in this case, the guest never gets a chance to run, i.e. it can't do > something else. From the guest point of view, if KVM doesn't inject what is > effectively a spurious async #PF, the VM-Exiting instruction simply took a (really) > long time to execute. Sorry, I didn't get that. If userspace learns from the kvm_run::memory_fault::flags that the exit is due to an async PF, it should call kvm run immediately, inject the not-present PF and allow the guest to reschedule. What do you mean by "the guest never gets a chance to run"? >> If we are exiting to userspace, it isn't going to be quick anyway, so we can >> consider all such faults "long" and warranting the execution of the async PF >> protocol. So always injecting a not-present #PF and page ready IRQ doesn't >> look too wrong in that case. > > There is no "wrong", it's simply wasteful. The fact that the userspace exit is > "long" is completely irrelevant. Decompressing zswap is also slow, but it is > done on the current CPU, i.e. is not background I/O, and so doesn't trigger async > #PFs. > > In the guest, if host userspace resolves the fault before redoing KVM_RUN, the > vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion > of that #PF. Is this practically likely? At least in our scenario (Firecracker snapshot restore) and probably in live migration postcopy, if a vCPU hits a fault, it's probably because the content of the page is somewhere remote (eg on the source machine or wherever the snapshot data is stored) and isn't going to be available quickly. Conversely, if the page content is available, it must have already been prepopulated into guest memory pagecache, the bit in the bitmap is cleared and no exit to userspace occurs. >>>> What advantage can you see in it over exiting to userspace (which already exists >>>> in James's series)? >>> >>> It doesn't exit to userspace :-) >>> >>> If userspace simply wakes a different task in response to the exit, then KVM >>> should be able to wake said task, e.g. by signalling an eventfd, and resume the >>> guest much faster than if the vCPU task needs to roundtrip to userspace. Whether >>> or not such an optimization is worth the complexity is an entirely different >>> question though. >> >> This reminds me of the discussion about VMA-less UFFD that was coming up >> several times, such as [1], but AFAIK hasn't materialised into something >> actionable. I may be wrong, but James was looking into that and couldn't >> figure out a way to scale it sufficiently for his use case and had to stick >> with the VM-exit-based approach. Can you see a world where VM-exit >> userfaults coexist with no-VM-exit way of handling async PFs? > > The issue with UFFD is that it's difficult to provide a generic "point of contact", > whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide > per-vCPU buffers/structures to aid communication. > > That said, supporting "exitless" KVM userfault would most definitely be premature > optimization without strong evidence it would benefit a real world use case. Does that mean that the "exitless" solution for async PF is a long-term one (if required), while the short-term would still be "exitful" (if we find a way to do it sensibly)?
On Fri, Feb 21, 2025, Nikita Kalyazin wrote: > On 20/02/2025 18:49, Sean Christopherson wrote: > > On Thu, Feb 20, 2025, Nikita Kalyazin wrote: > > > On 19/02/2025 15:17, Sean Christopherson wrote: > > > > On Wed, Feb 12, 2025, Nikita Kalyazin wrote: > > > > The conundrum with userspace async #PF is that if userspace is given only a single > > > > bit per gfn to force an exit, then KVM won't be able to differentiate between > > > > "faults" that will be handled synchronously by the vCPU task, and faults that > > > > usersepace will hand off to an I/O task. If the fault is handled synchronously, > > > > KVM will needlessly inject a not-present #PF and a present IRQ. > > > > > > Right, but from the guest's point of view, async PF means "it will probably > > > take a while for the host to get the page, so I may consider doing something > > > else in the meantime (ie schedule another process if available)". > > > > Except in this case, the guest never gets a chance to run, i.e. it can't do > > something else. From the guest point of view, if KVM doesn't inject what is > > effectively a spurious async #PF, the VM-Exiting instruction simply took a (really) > > long time to execute. > > Sorry, I didn't get that. If userspace learns from the > kvm_run::memory_fault::flags that the exit is due to an async PF, it should > call kvm run immediately, inject the not-present PF and allow the guest to > reschedule. What do you mean by "the guest never gets a chance to run"? What I'm saying is that, as proposed, the API doesn't precisely tell userspace an exit happened due to an "async #PF". KVM has absolutely zero clue as to whether or not userspace is going to do an async #PF, or if userspace wants to intercept the fault for some entirely different purpose. > > > If we are exiting to userspace, it isn't going to be quick anyway, so we can > > > consider all such faults "long" and warranting the execution of the async PF > > > protocol. So always injecting a not-present #PF and page ready IRQ doesn't > > > look too wrong in that case. > > > > There is no "wrong", it's simply wasteful. The fact that the userspace exit is > > "long" is completely irrelevant. Decompressing zswap is also slow, but it is > > done on the current CPU, i.e. is not background I/O, and so doesn't trigger async > > #PFs. > > > > In the guest, if host userspace resolves the fault before redoing KVM_RUN, the > > vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion > > of that #PF. > > Is this practically likely? Yes, I think's it's quite possible. > At least in our scenario (Firecracker snapshot restore) and probably in live > migration postcopy, if a vCPU hits a fault, it's probably because the content > of the page is somewhere remote (eg on the source machine or wherever the > snapshot data is stored) and isn't going to be available quickly. Unless the remote page was already requested, e.g. by a different vCPU, or by a prefetching algorithim. > Conversely, if the page content is available, it must have already been > prepopulated into guest memory pagecache, the bit in the bitmap is cleared > and no exit to userspace occurs. But that doesn't happen instantaneously. Even if the VMM somehow atomically receives the page and marks it present, it's still possible for marking the page present to race with KVM checking the bitmap. > > > > > What advantage can you see in it over exiting to userspace (which already exists > > > > > in James's series)? > > > > > > > > It doesn't exit to userspace :-) > > > > > > > > If userspace simply wakes a different task in response to the exit, then KVM > > > > should be able to wake said task, e.g. by signalling an eventfd, and resume the > > > > guest much faster than if the vCPU task needs to roundtrip to userspace. Whether > > > > or not such an optimization is worth the complexity is an entirely different > > > > question though. > > > > > > This reminds me of the discussion about VMA-less UFFD that was coming up > > > several times, such as [1], but AFAIK hasn't materialised into something > > > actionable. I may be wrong, but James was looking into that and couldn't > > > figure out a way to scale it sufficiently for his use case and had to stick > > > with the VM-exit-based approach. Can you see a world where VM-exit > > > userfaults coexist with no-VM-exit way of handling async PFs? > > > > The issue with UFFD is that it's difficult to provide a generic "point of contact", > > whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide > > per-vCPU buffers/structures to aid communication. > > > > That said, supporting "exitless" KVM userfault would most definitely be premature > > optimization without strong evidence it would benefit a real world use case. > > Does that mean that the "exitless" solution for async PF is a long-term one > (if required), while the short-term would still be "exitful" (if we find a > way to do it sensibly)? My question on exitless support was purely exploratory, just ignore it for now.
On 26/02/2025 00:58, Sean Christopherson wrote:
> On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
>> On 20/02/2025 18:49, Sean Christopherson wrote:
>>> On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
>>>> On 19/02/2025 15:17, Sean Christopherson wrote:
>>>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
>>>>> The conundrum with userspace async #PF is that if userspace is given only a single
>>>>> bit per gfn to force an exit, then KVM won't be able to differentiate between
>>>>> "faults" that will be handled synchronously by the vCPU task, and faults that
>>>>> usersepace will hand off to an I/O task. If the fault is handled synchronously,
>>>>> KVM will needlessly inject a not-present #PF and a present IRQ.
>>>>
>>>> Right, but from the guest's point of view, async PF means "it will probably
>>>> take a while for the host to get the page, so I may consider doing something
>>>> else in the meantime (ie schedule another process if available)".
>>>
>>> Except in this case, the guest never gets a chance to run, i.e. it can't do
>>> something else. From the guest point of view, if KVM doesn't inject what is
>>> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
>>> long time to execute.
>>
>> Sorry, I didn't get that. If userspace learns from the
>> kvm_run::memory_fault::flags that the exit is due to an async PF, it should
>> call kvm run immediately, inject the not-present PF and allow the guest to
>> reschedule. What do you mean by "the guest never gets a chance to run"?
>
> What I'm saying is that, as proposed, the API doesn't precisely tell userspace
> an exit happened due to an "async #PF". KVM has absolutely zero clue as to
> whether or not userspace is going to do an async #PF, or if userspace wants to
> intercept the fault for some entirely different purpose.
Userspace is supposed to know whether the PF is async from the dedicated
flag added in the memory_fault structure:
KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER. It will be set when KVM managed to
inject page-not-present. Are you saying it isn't sufficient?
@@ -4396,6 +4412,35 @@ static int __kvm_faultin_pfn(struct kvm_vcpu
*vcpu, struct kvm_page_fault *fault
{
bool async;
+ /* Pre-check for userfault and bail out early. */
+ if (gfn_has_userfault(fault->slot->kvm, fault->gfn)) {
+ bool report_async = false;
+ u32 token = 0;
+
+ if (vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM &&
+ !fault->prefetch && kvm_can_do_async_pf(vcpu)) {
+ trace_kvm_try_async_get_page(fault->addr, fault->gfn, 1);
+ if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
+ trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn, 1);
+ kvm_make_request(KVM_REQ_APF_HALT, vcpu);
+ return RET_PF_RETRY;
+ } else if (kvm_can_deliver_async_pf(vcpu) &&
+ kvm_arch_setup_async_pf_user(vcpu, fault, &token)) {
+ report_async = true;
+ }
+ }
+
+ fault->pfn = KVM_PFN_ERR_USERFAULT;
+ kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+
+ if (report_async) {
+ vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER;
+ vcpu->run->memory_fault.async_pf_user_token = token;
+ }
+
+ return -EFAULT;
+ }
+
>>>> If we are exiting to userspace, it isn't going to be quick anyway, so we can
>>>> consider all such faults "long" and warranting the execution of the async PF
>>>> protocol. So always injecting a not-present #PF and page ready IRQ doesn't
>>>> look too wrong in that case.
>>>
>>> There is no "wrong", it's simply wasteful. The fact that the userspace exit is
>>> "long" is completely irrelevant. Decompressing zswap is also slow, but it is
>>> done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
>>> #PFs.
>>>
>>> In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
>>> vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
>>> of that #PF.
>>
>> Is this practically likely?
>
> Yes, I think's it's quite possible.
>
>> At least in our scenario (Firecracker snapshot restore) and probably in live
>> migration postcopy, if a vCPU hits a fault, it's probably because the content
>> of the page is somewhere remote (eg on the source machine or wherever the
>> snapshot data is stored) and isn't going to be available quickly.
>
> Unless the remote page was already requested, e.g. by a different vCPU, or by a
> prefetching algorithim.
>
>> Conversely, if the page content is available, it must have already been
>> prepopulated into guest memory pagecache, the bit in the bitmap is cleared
>> and no exit to userspace occurs.
>
> But that doesn't happen instantaneously. Even if the VMM somehow atomically
> receives the page and marks it present, it's still possible for marking the page
> present to race with KVM checking the bitmap.
That looks like a generic problem of the VM-exit fault handling. Eg
when one vCPU exits, userspace handles the fault and races setting the
bitmap with another vCPU that is about to fault the same page, which may
cause a spurious exit.
On the other hand, is it malignant? The only downside is additional
overhead of the async PF protocol, but if the race occurs infrequently,
it shouldn't be a problem.
>>>>>> What advantage can you see in it over exiting to userspace (which already exists
>>>>>> in James's series)?
>>>>>
>>>>> It doesn't exit to userspace :-)
>>>>>
>>>>> If userspace simply wakes a different task in response to the exit, then KVM
>>>>> should be able to wake said task, e.g. by signalling an eventfd, and resume the
>>>>> guest much faster than if the vCPU task needs to roundtrip to userspace. Whether
>>>>> or not such an optimization is worth the complexity is an entirely different
>>>>> question though.
>>>>
>>>> This reminds me of the discussion about VMA-less UFFD that was coming up
>>>> several times, such as [1], but AFAIK hasn't materialised into something
>>>> actionable. I may be wrong, but James was looking into that and couldn't
>>>> figure out a way to scale it sufficiently for his use case and had to stick
>>>> with the VM-exit-based approach. Can you see a world where VM-exit
>>>> userfaults coexist with no-VM-exit way of handling async PFs?
>>>
>>> The issue with UFFD is that it's difficult to provide a generic "point of contact",
>>> whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
>>> per-vCPU buffers/structures to aid communication.
>>>
>>> That said, supporting "exitless" KVM userfault would most definitely be premature
>>> optimization without strong evidence it would benefit a real world use case.
>>
>> Does that mean that the "exitless" solution for async PF is a long-term one
>> (if required), while the short-term would still be "exitful" (if we find a
>> way to do it sensibly)?
>
> My question on exitless support was purely exploratory, just ignore it for now.
On Wed, Feb 26, 2025, Nikita Kalyazin wrote:
> On 26/02/2025 00:58, Sean Christopherson wrote:
> > On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
> > > On 20/02/2025 18:49, Sean Christopherson wrote:
> > > > On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
> > > > > On 19/02/2025 15:17, Sean Christopherson wrote:
> > > > > > On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> > > > > > The conundrum with userspace async #PF is that if userspace is given only a single
> > > > > > bit per gfn to force an exit, then KVM won't be able to differentiate between
> > > > > > "faults" that will be handled synchronously by the vCPU task, and faults that
> > > > > > usersepace will hand off to an I/O task. If the fault is handled synchronously,
> > > > > > KVM will needlessly inject a not-present #PF and a present IRQ.
> > > > >
> > > > > Right, but from the guest's point of view, async PF means "it will probably
> > > > > take a while for the host to get the page, so I may consider doing something
> > > > > else in the meantime (ie schedule another process if available)".
> > > >
> > > > Except in this case, the guest never gets a chance to run, i.e. it can't do
> > > > something else. From the guest point of view, if KVM doesn't inject what is
> > > > effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
> > > > long time to execute.
> > >
> > > Sorry, I didn't get that. If userspace learns from the
> > > kvm_run::memory_fault::flags that the exit is due to an async PF, it should
> > > call kvm run immediately, inject the not-present PF and allow the guest to
> > > reschedule. What do you mean by "the guest never gets a chance to run"?
> >
> > What I'm saying is that, as proposed, the API doesn't precisely tell userspace
^^^^^^^^^
KVM
> > an exit happened due to an "async #PF". KVM has absolutely zero clue as to
> > whether or not userspace is going to do an async #PF, or if userspace wants to
> > intercept the fault for some entirely different purpose.
>
> Userspace is supposed to know whether the PF is async from the dedicated
> flag added in the memory_fault structure:
> KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER. It will be set when KVM managed to
> inject page-not-present. Are you saying it isn't sufficient?
Gah, sorry, typo. The API doesn't tell *KVM* that userfault exit is due to an
async #PF.
> > Unless the remote page was already requested, e.g. by a different vCPU, or by a
> > prefetching algorithim.
> >
> > > Conversely, if the page content is available, it must have already been
> > > prepopulated into guest memory pagecache, the bit in the bitmap is cleared
> > > and no exit to userspace occurs.
> >
> > But that doesn't happen instantaneously. Even if the VMM somehow atomically
> > receives the page and marks it present, it's still possible for marking the page
> > present to race with KVM checking the bitmap.
>
> That looks like a generic problem of the VM-exit fault handling. Eg when
Heh, it's a generic "problem" for faults in general. E.g. modern x86 CPUs will
take "spurious" page faults on write accesses if a PTE is writable in memory but
the CPU has a read-only mapping cached in its TLB.
It's all a matter of cost. E.g. pre-Nehalem Intel CPUs didn't take such spurious
read-only faults as they would re-walk the in-memory page tables, but that ended
up being a net negative because the cost of re-walking for all read-only faults
outweighed the benefits of avoiding spurious faults in the unlikely scenario the
fault had already been fixed.
For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing
unwanted context switches in the guest, in addition to the raw overhead of the
faults, interrupts, and exits.
> one vCPU exits, userspace handles the fault and races setting the bitmap
> with another vCPU that is about to fault the same page, which may cause a
> spurious exit.
>
> On the other hand, is it malignant? The only downside is additional
> overhead of the async PF protocol, but if the race occurs infrequently, it
> shouldn't be a problem.
When it comes to uAPI, I want to try and avoid statements along the lines of
"IF 'x' holds true, then 'y' SHOULDN'T be a problem". If this didn't impact uAPI,
I wouldn't care as much, i.e. I'd be much more willing iterate as needed.
I'm not saying we should go straight for a complex implementation. Quite the
opposite. But I do want us to consider the possible ramifications of using a
single bit for all userfaults, so that we can at least try to design something
that is extensible and won't be a pain to maintain.
On 27/02/2025 16:44, Sean Christopherson wrote: > On Wed, Feb 26, 2025, Nikita Kalyazin wrote: >> On 26/02/2025 00:58, Sean Christopherson wrote: >>> On Fri, Feb 21, 2025, Nikita Kalyazin wrote: >>>> On 20/02/2025 18:49, Sean Christopherson wrote: >>>>> On Thu, Feb 20, 2025, Nikita Kalyazin wrote: >>>>>> On 19/02/2025 15:17, Sean Christopherson wrote: >>>>>>> On Wed, Feb 12, 2025, Nikita Kalyazin wrote: >>>>>>> The conundrum with userspace async #PF is that if userspace is given only a single >>>>>>> bit per gfn to force an exit, then KVM won't be able to differentiate between >>>>>>> "faults" that will be handled synchronously by the vCPU task, and faults that >>>>>>> usersepace will hand off to an I/O task. If the fault is handled synchronously, >>>>>>> KVM will needlessly inject a not-present #PF and a present IRQ. >>>>>> >>>>>> Right, but from the guest's point of view, async PF means "it will probably >>>>>> take a while for the host to get the page, so I may consider doing something >>>>>> else in the meantime (ie schedule another process if available)". >>>>> >>>>> Except in this case, the guest never gets a chance to run, i.e. it can't do >>>>> something else. From the guest point of view, if KVM doesn't inject what is >>>>> effectively a spurious async #PF, the VM-Exiting instruction simply took a (really) >>>>> long time to execute. >>>> >>>> Sorry, I didn't get that. If userspace learns from the >>>> kvm_run::memory_fault::flags that the exit is due to an async PF, it should >>>> call kvm run immediately, inject the not-present PF and allow the guest to >>>> reschedule. What do you mean by "the guest never gets a chance to run"? >>> >>> What I'm saying is that, as proposed, the API doesn't precisely tell userspace > ^^^^^^^^^ > KVM >>> an exit happened due to an "async #PF". KVM has absolutely zero clue as to >>> whether or not userspace is going to do an async #PF, or if userspace wants to >>> intercept the fault for some entirely different purpose. >> >> Userspace is supposed to know whether the PF is async from the dedicated >> flag added in the memory_fault structure: >> KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER. It will be set when KVM managed to >> inject page-not-present. Are you saying it isn't sufficient? > > Gah, sorry, typo. The API doesn't tell *KVM* that userfault exit is due to an > async #PF. > >>> Unless the remote page was already requested, e.g. by a different vCPU, or by a >>> prefetching algorithim. >>> >>>> Conversely, if the page content is available, it must have already been >>>> prepopulated into guest memory pagecache, the bit in the bitmap is cleared >>>> and no exit to userspace occurs. >>> >>> But that doesn't happen instantaneously. Even if the VMM somehow atomically >>> receives the page and marks it present, it's still possible for marking the page >>> present to race with KVM checking the bitmap. >> >> That looks like a generic problem of the VM-exit fault handling. Eg when > > Heh, it's a generic "problem" for faults in general. E.g. modern x86 CPUs will > take "spurious" page faults on write accesses if a PTE is writable in memory but > the CPU has a read-only mapping cached in its TLB. > > It's all a matter of cost. E.g. pre-Nehalem Intel CPUs didn't take such spurious > read-only faults as they would re-walk the in-memory page tables, but that ended > up being a net negative because the cost of re-walking for all read-only faults > outweighed the benefits of avoiding spurious faults in the unlikely scenario the > fault had already been fixed. > > For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing > unwanted context switches in the guest, in addition to the raw overhead of the > faults, interrupts, and exits. > >> one vCPU exits, userspace handles the fault and races setting the bitmap >> with another vCPU that is about to fault the same page, which may cause a >> spurious exit. >> >> On the other hand, is it malignant? The only downside is additional >> overhead of the async PF protocol, but if the race occurs infrequently, it >> shouldn't be a problem. > > When it comes to uAPI, I want to try and avoid statements along the lines of > "IF 'x' holds true, then 'y' SHOULDN'T be a problem". If this didn't impact uAPI, > I wouldn't care as much, i.e. I'd be much more willing iterate as needed. > > I'm not saying we should go straight for a complex implementation. Quite the > opposite. But I do want us to consider the possible ramifications of using a > single bit for all userfaults, so that we can at least try to design something > that is extensible and won't be a pain to maintain. So you would've liked more the "two-bit per gfn" approach as in: provide 2 interception points, for sync and async exits, with the former chosen by userspace when it "knows" that the content is already in memory? What makes it a conundrum then? It looks like an incremental change to what has already been proposed. There is a complication that 2-bit operations aren't atomic, but even 1 bit is racy between KVM and userspace.
On Thu, Feb 27, 2025, Nikita Kalyazin wrote: > On 27/02/2025 16:44, Sean Christopherson wrote: > > When it comes to uAPI, I want to try and avoid statements along the lines of > > "IF 'x' holds true, then 'y' SHOULDN'T be a problem". If this didn't impact uAPI, > > I wouldn't care as much, i.e. I'd be much more willing iterate as needed. > > > > I'm not saying we should go straight for a complex implementation. Quite the > > opposite. But I do want us to consider the possible ramifications of using a > > single bit for all userfaults, so that we can at least try to design something > > that is extensible and won't be a pain to maintain. > > So you would've liked more the "two-bit per gfn" approach as in: provide 2 > interception points, for sync and async exits, with the former chosen by > userspace when it "knows" that the content is already in memory? No, all I'm saying is I want people think about what the future will look like, to minimize the chances of ending up with a mess.
© 2016 - 2026 Red Hat, Inc.