KVM: ioctl for populating guest_memfd

[RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 1 month ago

Firecracker currently allows to populate guest memory from a separate
process via UserfaultFD [1].  This helps keep the VMM codebase and
functionality concise and generic, while offloading the logic of
obtaining guest memory to another process.  UserfaultFD is currently not
supported for guest_memfd, because it binds to a VMA, while guest_memfd
does not need to (or cannot) be necessarily mapped to userspace,
especially for private memory.  [2] proposes an alternative to
UserfaultFD for intercepting stage-2 faults, while this series
conceptually compliments it with the ability to populate guest memory
backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.

Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a
vendor-agnostic implementation of `post_populate` callback.

Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
has been prohibited by [3], but I have not been able to locate the exact
justification for the requirement.

Questions:
 - Does exposing a generic population interface via ioctl look
   sensible in this form?
 - Is there a path where "only VMM can call KVM API" requirement is
   relaxed? If not, what is the recommended efficient alternative for
   populating guest memory from outside the VMM?

[1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
[2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78

Nikita

Nikita Kalyazin (4):
  KVM: guest_memfd: add generic post_populate callback
  KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd
  KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm
  KVM: document KVM_GUEST_MEMFD_POPULATE ioctl

 Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
 include/linux/kvm_host.h       |  3 +++
 include/uapi/linux/kvm.h       |  9 +++++++++
 virt/kvm/guest_memfd.c         | 28 ++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c            | 19 ++++++++++++++++++-
 5 files changed, 81 insertions(+), 1 deletion(-)


base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da
-- 
2.40.1

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Paolo Bonzini 5 days, 4 hours ago

On 10/24/24 11:54, Nikita Kalyazin wrote:
> Firecracker currently allows to populate guest memory from a separate
> process via UserfaultFD [1].  This helps keep the VMM codebase and
> functionality concise and generic, while offloading the logic of
> obtaining guest memory to another process.  UserfaultFD is currently not
> supported for guest_memfd, because it binds to a VMA, while guest_memfd
> does not need to (or cannot) be necessarily mapped to userspace,
> especially for private memory.  [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
> 
> Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a
> vendor-agnostic implementation of `post_populate` callback.
> 
> Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
> has been prohibited by [3], but I have not been able to locate the exact
> justification for the requirement.

The justification is that the "struct kvm" has a long-lived tie to a 
host process's address space.

Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from 
different processes would make things very messy, because it is not 
clear which mm you are working with: the MMU notifier is registered for 
kvm->mm, but some functions such as get_user_pages do not take an mm for 
example and always operate on current->mm.

In your case, it should be enough to add a ioctl on the guestmemfd 
instead?  But the real question is, what are you using 
KVM_X86_SW_PROTECTED_VM for?

Paolo

> Questions:
>   - Does exposing a generic population interface via ioctl look
>     sensible in this form?
>   - Is there a path where "only VMM can call KVM API" requirement is
>     relaxed? If not, what is the recommended efficient alternative for
>     populating guest memory from outside the VMM?
> 
> [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md
> [2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78
> 
> Nikita
> 
> Nikita Kalyazin (4):
>    KVM: guest_memfd: add generic post_populate callback
>    KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd
>    KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm
>    KVM: document KVM_GUEST_MEMFD_POPULATE ioctl
> 
>   Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++
>   include/linux/kvm_host.h       |  3 +++
>   include/uapi/linux/kvm.h       |  9 +++++++++
>   virt/kvm/guest_memfd.c         | 28 ++++++++++++++++++++++++++++
>   virt/kvm/kvm_main.c            | 19 ++++++++++++++++++-
>   5 files changed, 81 insertions(+), 1 deletion(-)
> 
> 
> base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 5 days ago

On 20/11/2024 13:55, Paolo Bonzini wrote:
>> Patch 4 allows to call the ioctl from a separate (non-VMM) process.  It
>> has been prohibited by [3], but I have not been able to locate the exact
>> justification for the requirement.
> 
> The justification is that the "struct kvm" has a long-lived tie to a
> host process's address space.
> 
> Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from
> different processes would make things very messy, because it is not
> clear which mm you are working with: the MMU notifier is registered for
> kvm->mm, but some functions such as get_user_pages do not take an mm for
> example and always operate on current->mm.

That's fair, thanks for the explanation.

> In your case, it should be enough to add a ioctl on the guestmemfd
> instead?

That's right. That would be sufficient indeed.  Is that something that 
could be considered?  Would that be some non-KVM API, with guest_memfd 
moving to an mm library?

 > But the real question is, what are you using
 > KVM_X86_SW_PROTECTED_VM for?

The concrete use case is VM restoration from a snapshot in Firecracker 
[1].  In the current setup, the VMM registers a UFFD against the guest 
memory and sends the UFFD handle to an external process that knows how 
to obtain the snapshotted memory.  We would like to preserve the 
semantics, but also remove the guest memory from the direct map [2]. 
Mimicing this with guest_memfd would be sending some form of a 
guest_memfd handle to that process that would be using it to populate 
guest_memfd.

[1]: 
https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md#userfaultfd
[2]: 
https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/T/

> Paolo

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 5 days, 6 hours ago

On 24/10/2024 10:54, Nikita Kalyazin wrote:
> [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.

+David
+Sean
+mm

While measuring memory population performance of guest_memfd using this 
series, I noticed that guest_memfd population takes longer than my 
baseline, which is filling anonymous private memory via UFFDIO_COPY.

I am using x86_64 for my measurements and 3 GiB memory region:
  - anon/private UFFDIO_COPY:  940 ms
  - guest_memfd:              1371 ms (+46%)

It turns out that the effect is observable not only for guest_memfd, but 
also for any type of shared memory, eg memfd or anonymous memory mapped 
as shared.

Below are measurements of a plain mmap(MAP_POPULATE) operation:

mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
  vs
mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

Results:
  - MAP_PRIVATE: 968 ms
  - MAP_SHARED: 1646 ms

I am seeing this effect on a range of kernels. The oldest I used was 
5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).

When profiling with perf, I observe the following hottest operations 
(kvm-next). Attaching full distributions at the end of the email.

MAP_PRIVATE:
- 19.72% clear_page_erms, rep stos %al,%es:(%rdi)

MAP_SHARED:
- 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic 
setting of the PG_uptodate bit
- 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that 
sets the PG_uptodate bit regularly.
, while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that 
sets the PG_uptodate bit atomically.

While this logic is intuitive, its performance effect is more 
significant that I would expect.

The questions are:
  - Is this a well-known behaviour?
  - Is there a way to mitigate that, ie make shared memory (including 
guest_memfd) population faster/comparable to private memory?

Nikita


Appendix: full call tree obtained via perf

MAP_RPIVATE:

       - 87.97% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 77.94% handle_mm_fault
                - 76.90% __handle_mm_fault
                   - 72.70% do_anonymous_page
                      - 31.92% vma_alloc_folio_noprof
                         - 30.74% alloc_pages_mpol_noprof
                            - 29.60% __alloc_pages_noprof
                               - 28.40% get_page_from_freelist
                                    19.72% clear_page_erms
                                  - 3.00% __rmqueue_pcplist
                                       __mod_zone_page_state
                                    1.18% _raw_spin_trylock
                      - 20.03% __pte_offset_map_lock
                         - 15.96% _raw_spin_lock
                              1.50% preempt_count_add
                         - 2.27% __pte_offset_map
                              __rcu_read_lock
                      - 7.22% __folio_batch_add_and_move
                         - 4.68% folio_batch_move_lru
                            - 3.77% lru_add
                               + 0.95% __mod_zone_page_state
                                 0.86% __mod_node_page_state
                           0.84% folios_put_refs
                           0.55% check_preemption_disabled
                      - 2.85% folio_add_new_anon_rmap
                         - __folio_mod_stat
                              __mod_node_page_state
                   - 1.15% pte_offset_map_nolock
                        __pte_offset_map
             - 7.59% follow_page_pte
                - 4.56% __pte_offset_map_lock
                   - 2.27% _raw_spin_lock
                        preempt_count_add
                     1.13% __pte_offset_map
                  0.75% folio_mark_accessed

MAP_SHARED:

       - 77.89% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 72.11% handle_mm_fault
                - 71.67% __handle_mm_fault
                   - 69.62% do_fault
                      - 44.61% __do_fault
                         - shmem_fault
                            - 43.94% shmem_get_folio_gfp
                               - 17.20% 
shmem_alloc_and_add_folio.constprop.0
                                  - 5.10% shmem_alloc_folio
                                     - 4.58% folio_alloc_mpol_noprof
                                        - alloc_pages_mpol_noprof
                                           - 4.00% __alloc_pages_noprof
                                              - 3.31% get_page_from_freelist
                                                   1.24% __rmqueue_pcplist
                                  - 5.07% shmem_add_to_page_cache
                                     - 1.44% __mod_node_page_state
                                          0.61% check_preemption_disabled
                                       0.78% xas_store
                                       0.74% xas_find_conflict
                                       0.66% _raw_spin_lock_irq
                                  - 3.96% __folio_batch_add_and_move
                                     - 2.41% folio_batch_move_lru
                                          1.88% lru_add
                                  - 1.56% shmem_inode_acct_blocks
                                     - 1.24% __dquot_alloc_space
                                        - 0.77% inode_add_bytes
                                             _raw_spin_lock
                                  - 0.77% shmem_recalc_inode
                                       _raw_spin_lock
                                 10.98% clear_page_erms
                               - 1.17% filemap_get_entry
                                    0.78% xas_load
                      - 20.26% filemap_map_pages
                         - 12.23% next_uptodate_folio
                            - 1.27% xas_find
                                 xas_load
                         - 1.16% __pte_offset_map_lock
                              0.59% _raw_spin_lock
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
                     0.57% pte_offset_map_nolock
             - 4.11% follow_page_pte
                - 2.36% __pte_offset_map_lock
                   - 1.32% _raw_spin_lock
                        preempt_count_add
                     0.54% __pte_offset_map

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by David Hildenbrand 5 days, 4 hours ago

On 20.11.24 13:09, Nikita Kalyazin wrote:
> On 24/10/2024 10:54, Nikita Kalyazin wrote:
>> [2] proposes an alternative to
>> UserfaultFD for intercepting stage-2 faults, while this series
>> conceptually compliments it with the ability to populate guest memory
>> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
> 
> +David
> +Sean
> +mm

Hi!

> 
> While measuring memory population performance of guest_memfd using this
> series, I noticed that guest_memfd population takes longer than my
> baseline, which is filling anonymous private memory via UFFDIO_COPY.
> 
> I am using x86_64 for my measurements and 3 GiB memory region:
>    - anon/private UFFDIO_COPY:  940 ms
>    - guest_memfd:              1371 ms (+46%)
> 
> It turns out that the effect is observable not only for guest_memfd, but
> also for any type of shared memory, eg memfd or anonymous memory mapped
> as shared.
 > Below are measurements of a plain mmap(MAP_POPULATE) operation:>
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>    vs
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> 
> Results:
>    - MAP_PRIVATE: 968 ms
>    - MAP_SHARED: 1646 ms

At least here it is expected to some degree: as soon as the page cache 
is involved map/unmap gets slower, because we are effectively 
maintaining two datastructures (page tables + page cache) instead of 
only a single one (page cache)

Can you make sure that THP/large folios don't interfere in your 
experiments (e.g., madvise(MADV_NOHUGEPAGE))?

> 
> I am seeing this effect on a range of kernels. The oldest I used was
> 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).
> 
> When profiling with perf, I observe the following hottest operations
> (kvm-next). Attaching full distributions at the end of the email.
> 
> MAP_PRIVATE:
> - 19.72% clear_page_erms, rep stos %al,%es:(%rdi)
> 
> MAP_SHARED:
> - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic
> setting of the PG_uptodate bit
> - 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Interesting.
> 
> Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that
> sets the PG_uptodate bit regularly.
> , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that
> sets the PG_uptodate bit atomically.
> 
> While this logic is intuitive, its performance effect is more
> significant that I would expect.

Yes. How much of the performance difference would remain if you hack out 
the atomic op just to play with it? I suspect there will still be some 
difference.

> 
> The questions are:
>    - Is this a well-known behaviour?
>    - Is there a way to mitigate that, ie make shared memory (including
> guest_memfd) population faster/comparable to private memory?

Likely. But your experiment measures above something different than what 
guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I 
would assume guest_memfd will be faster than MAP_POPULATE.

How do you end up allocating memory for guest_memfd? Using simple 
fallocate()?

Note that we might improve allocation times with guest_memfd when 
allocating larger folios.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by David Hildenbrand 5 days, 3 hours ago

>>
>> The questions are:
>>     - Is this a well-known behaviour?
>>     - Is there a way to mitigate that, ie make shared memory (including
>> guest_memfd) population faster/comparable to private memory?
> 
> Likely. But your experiment measures above something different than what
> guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I
> would assume guest_memfd will be faster than MAP_POPULATE.
> 
> How do you end up allocating memory for guest_memfd? Using simple
> fallocate()?

Heh, now I spot that your comment was as reply to a series.

If your ioctl is supposed to to more than "allocating memory" like 
MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. 
Because for allocating memory, we would want to use fallocate() instead. 
I assume you want to "allocate+copy"?

I'll note that, as we're moving into the direction of moving 
guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" 
ioctls, and think about something generic.

Any clue how your new ioctl will interact with the WIP to have shared 
memory as part of guest_memfd? For example, could it be reasonable to 
"populate" the shared memory first (via VMA) and then convert that 
"allocated+filled" memory to private?

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 5 days, 2 hours ago


On 20/11/2024 15:13, David Hildenbrand wrote:
 > Hi!

Hi! :)

 >> Results:
 >>    - MAP_PRIVATE: 968 ms
 >>    - MAP_SHARED: 1646 ms
 >
 > At least here it is expected to some degree: as soon as the page cache
 > is involved map/unmap gets slower, because we are effectively
 > maintaining two datastructures (page tables + page cache) instead of
 > only a single one (page cache)
 >
 > Can you make sure that THP/large folios don't interfere in your
 > experiments (e.g., madvise(MADV_NOHUGEPAGE))?

I was using transparent_hugepage=never command line argument in my testing.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Is that sufficient to exclude the THP/large folio factor?

 >> While this logic is intuitive, its performance effect is more
 >> significant that I would expect.
 >
 > Yes. How much of the performance difference would remain if you hack out
 > the atomic op just to play with it? I suspect there will still be some
 > difference.

I have tried that, but could not see any noticeable difference in the 
overall results.

It looks like a big portion of the bottleneck has moved from 
shmem_get_folio_gfp/folio_mark_uptodate to 
finish_fault/__pte_offset_map_lock somehow.  I have no good explanation 
for why:

Orig:
                   - 69.62% do_fault
                      + 44.61% __do_fault
                      + 20.26% filemap_map_pages
                      + 3.48% finish_fault
Hacked:
                   - 67.39% do_fault
                      + 32.45% __do_fault
                      + 21.87% filemap_map_pages
                      + 11.97% finish_fault

Orig:
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
Hacked:
                      - 11.97% finish_fault
                         - 8.59% __pte_offset_map_lock
                            - 6.27% _raw_spin_lock
                                 preempt_count_add
                              1.00% __pte_offset_map
                         - 1.28% set_pte_range
                            - folio_add_file_rmap_ptes
                                 __mod_node_page_state

 > Note that we might improve allocation times with guest_memfd when
 > allocating larger folios.

I suppose it may not always be an option depending on requirements to 
consistency of the allocation latency.  Eg if a large folio isn't 
available at the time, the performance would degrade to the base case 
(please correct me if I'm missing something).

> Heh, now I spot that your comment was as reply to a series.

Yeah, sorry if it wasn't obvious.

> If your ioctl is supposed to to more than "allocating memory" like
> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
> Because for allocating memory, we would want to use fallocate() instead.
> I assume you want to "allocate+copy"?

Yes, the ultimate use case is "allocate+copy".

> I'll note that, as we're moving into the direction of moving
> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
> ioctls, and think about something generic.

Good point, thanks.  Are we at the stage where some concrete API has 
been proposed yet? I might have missed that.

> Any clue how your new ioctl will interact with the WIP to have shared
> memory as part of guest_memfd? For example, could it be reasonable to
> "populate" the shared memory first (via VMA) and then convert that
> "allocated+filled" memory to private?

No, I can't immediately see why it shouldn't work.  My main concern 
would probably still be about the latency of the population stage as I 
can't see why it would improve compared to what we have now, because my 
feeling is this is linked with the sharedness property of guest_memfd.

> Cheers,
> 
> David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by David Hildenbrand 5 days, 2 hours ago

On 20.11.24 16:58, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 15:13, David Hildenbrand wrote:
>   > Hi!
> 
> Hi! :)
> 
>   >> Results:
>   >>    - MAP_PRIVATE: 968 ms
>   >>    - MAP_SHARED: 1646 ms
>   >
>   > At least here it is expected to some degree: as soon as the page cache
>   > is involved map/unmap gets slower, because we are effectively
>   > maintaining two datastructures (page tables + page cache) instead of
>   > only a single one (page cache)
>   >
>   > Can you make sure that THP/large folios don't interfere in your
>   > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
> 
> I was using transparent_hugepage=never command line argument in my testing.
> 
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> 
> Is that sufficient to exclude the THP/large folio factor?

Yes!

> 
>   >> While this logic is intuitive, its performance effect is more
>   >> significant that I would expect.
>   >
>   > Yes. How much of the performance difference would remain if you hack out
>   > the atomic op just to play with it? I suspect there will still be some
>   > difference.
> 
> I have tried that, but could not see any noticeable difference in the
> overall results.
> 
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow.  I have no good explanation
> for why:

That's what I assumed. The profiling results can be rather fuzzy and 
misleading with micro-benchmarks. :(

> 
> Orig:
>                     - 69.62% do_fault
>                        + 44.61% __do_fault
>                        + 20.26% filemap_map_pages
>                        + 3.48% finish_fault
> Hacked:
>                     - 67.39% do_fault
>                        + 32.45% __do_fault
>                        + 21.87% filemap_map_pages
>                        + 11.97% finish_fault
> 
> Orig:
>                        - 3.48% finish_fault
>                           - 1.28% set_pte_range
>                                0.96% folio_add_file_rmap_ptes
>                           - 0.91% __pte_offset_map_lock
>                                0.54% _raw_spin_lock
> Hacked:
>                        - 11.97% finish_fault
>                           - 8.59% __pte_offset_map_lock
>                              - 6.27% _raw_spin_lock
>                                   preempt_count_add
>                                1.00% __pte_offset_map
>                           - 1.28% set_pte_range
>                              - folio_add_file_rmap_ptes
>                                   __mod_node_page_state
> 
>   > Note that we might improve allocation times with guest_memfd when
>   > allocating larger folios.
> 
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency.  Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).

Yes, there are cons to that.

> 
>> Heh, now I spot that your comment was as reply to a series.
> 
> Yeah, sorry if it wasn't obvious.
> 
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
> 
> Yes, the ultimate use case is "allocate+copy".
> 
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
> 
> Good point, thanks.  Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.

People are working on it, and we're figuring out some remaining details 
(e.g., page_type to intercept folio_put() ). I assume we'll see a new 
RFC soonish (famous last words), but it's not been proposed yet.

> 
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
> 
> No, I can't immediately see why it shouldn't work.  My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
 > feeling is this is linked with the sharedness property of guest_memfd.

If the problem is the "pagecache" overhead, then yes, it will be a 
harder nut to crack. But maybe there are some low-hanging fruits to 
optimize? Finding the main cause for the added overhead would be 
interesting.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by David Hildenbrand 5 days, 1 hour ago

>> No, I can't immediately see why it shouldn't work.  My main concern
>> would probably still be about the latency of the population stage as I
>> can't see why it would improve compared to what we have now, because my
>   > feeling is this is linked with the sharedness property of guest_memfd.
> 
> If the problem is the "pagecache" overhead, then yes, it will be a
> harder nut to crack. But maybe there are some low-hanging fruits to
> optimize? Finding the main cause for the added overhead would be
> interesting.

Can you compare uffdio_copy() when using anonymous memory vs. shmem? 
That's likely the best we could currently achieve with guest_memfd.

There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure 
if that is of any help; it SEGFAULTS for me right now with a (likely) 
division by 0.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 5 days, 1 hour ago

On 20/11/2024 16:44, David Hildenbrand wrote:
>> If the problem is the "pagecache" overhead, then yes, it will be a
>> harder nut to crack. But maybe there are some low-hanging fruits to
>> optimize? Finding the main cause for the added overhead would be
>> interesting.

Agreed, knowing the exact root cause would be really nice.

> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
> That's likely the best we could currently achieve with guest_memfd.

Yeah, I was doing that too. It was about ~28% slower in my setup, while 
with guest_memfd it was ~34% slower.  The variance of the data was quite 
high so the difference may well be just noise.  In other words, I'd be 
much happier if we could bring guest_memfd (or even shmem) performance 
closer to the anon/private than if we just equalised guest_memfd with 
shmem (which are probably already pretty close).

> There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure
> if that is of any help; it SEGFAULTS for me right now with a (likely)
> division by 0.

Thanks for the pointer, will take a look!

> Cheers,
> 
> David / dhildenb
>

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by David Hildenbrand 5 days ago

On 20.11.24 18:21, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 16:44, David Hildenbrand wrote:
>>> If the problem is the "pagecache" overhead, then yes, it will be a
>>> harder nut to crack. But maybe there are some low-hanging fruits to
>>> optimize? Finding the main cause for the added overhead would be
>>> interesting.
> 
> Agreed, knowing the exact root cause would be really nice.
> 
>> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
>> That's likely the best we could currently achieve with guest_memfd.
> 
> Yeah, I was doing that too. It was about ~28% slower in my setup, while
> with guest_memfd it was ~34% slower. 

I looked into uffdio_copy() for shmem and we still walk+modify page 
tables. In theory, we could try hacking that out: for filling the 
pagecache we would only need the vma properties, not the page table 
properties; that would then really resemble "only modify the pagecache".

That would likely resemble what we would expect with guest_memfd: work 
only on the pagecache and not the page tables. So it's rather surprising 
that guest_memfd is slower than that, as it currently doesn't mess with 
user page tables at all.

  The variance of the data was quite
> high so the difference may well be just noise.  In other words, I'd be
> much happier if we could bring guest_memfd (or even shmem) performance
> closer to the anon/private than if we just equalised guest_memfd with
> shmem (which are probably already pretty close).

Makes sense. Best we can do is:

anon: work only on page tables
shmem/guest_memfd: work only on pageacache

So at least "only one treelike structure to update".

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd

Posted by Nikita Kalyazin 4 days, 1 hour ago

On 20/11/2024 18:29, David Hildenbrand wrote:
 > Any clue how your new ioctl will interact with the WIP to have shared
 > memory as part of guest_memfd? For example, could it be reasonable to
 > "populate" the shared memory first (via VMA) and then convert that
 > "allocated+filled" memory to private?

Patrick and I synced internally on this.  What may actually work for 
guest_memfd population is the following.

Non-CoCo use case:
  - fallocate syscall to fill the page cache, no page content 
initialisation (like it is now)
  - pwrite syscall to initialise the content + mark up-to-date (mark 
prepared), no specific preparation logic is required

The pwrite will have "once" semantics until a subsequent 
fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the 
page is already prepared and return EIO/ENOSPC or something.

SEV-SNP use case (no changes):
  - fallocate as above
  - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare

We don't think fallocate/pwrite have dependencies on current->mm 
assumptions that Paolo mentioned in [1], so they should be safe to be 
called on guest_memfd from a non-VMM process.

[1]: 
https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c

 > Makes sense. Best we can do is:
 >
 > anon: work only on page tables
 > shmem/guest_memfd: work only on pageacache
 >
 > So at least "only one treelike structure to update".

This seems to hold with the above reasoning.

 > --
> Cheers,
> 
> David / dhildenb