Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++ include/linux/kvm_host.h | 3 +++ include/uapi/linux/kvm.h | 9 +++++++++ virt/kvm/guest_memfd.c | 28 ++++++++++++++++++++++++++++ virt/kvm/kvm_main.c | 19 ++++++++++++++++++- 5 files changed, 81 insertions(+), 1 deletion(-)
Firecracker currently allows to populate guest memory from a separate process via UserfaultFD [1]. This helps keep the VMM codebase and functionality concise and generic, while offloading the logic of obtaining guest memory to another process. UserfaultFD is currently not supported for guest_memfd, because it binds to a VMA, while guest_memfd does not need to (or cannot) be necessarily mapped to userspace, especially for private memory. [2] proposes an alternative to UserfaultFD for intercepting stage-2 faults, while this series conceptually compliments it with the ability to populate guest memory backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs. Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a vendor-agnostic implementation of `post_populate` callback. Patch 4 allows to call the ioctl from a separate (non-VMM) process. It has been prohibited by [3], but I have not been able to locate the exact justification for the requirement. Questions: - Does exposing a generic population interface via ioctl look sensible in this form? - Is there a path where "only VMM can call KVM API" requirement is relaxed? If not, what is the recommended efficient alternative for populating guest memory from outside the VMM? [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md [2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/ [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78 Nikita Nikita Kalyazin (4): KVM: guest_memfd: add generic post_populate callback KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm KVM: document KVM_GUEST_MEMFD_POPULATE ioctl Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++ include/linux/kvm_host.h | 3 +++ include/uapi/linux/kvm.h | 9 +++++++++ virt/kvm/guest_memfd.c | 28 ++++++++++++++++++++++++++++ virt/kvm/kvm_main.c | 19 ++++++++++++++++++- 5 files changed, 81 insertions(+), 1 deletion(-) base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da -- 2.40.1
On 10/24/24 11:54, Nikita Kalyazin wrote: > Firecracker currently allows to populate guest memory from a separate > process via UserfaultFD [1]. This helps keep the VMM codebase and > functionality concise and generic, while offloading the logic of > obtaining guest memory to another process. UserfaultFD is currently not > supported for guest_memfd, because it binds to a VMA, while guest_memfd > does not need to (or cannot) be necessarily mapped to userspace, > especially for private memory. [2] proposes an alternative to > UserfaultFD for intercepting stage-2 faults, while this series > conceptually compliments it with the ability to populate guest memory > backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs. > > Patches 1-3 add a new ioctl, `KVM_GUEST_MEMFD_POPULATE`, that uses a > vendor-agnostic implementation of `post_populate` callback. > > Patch 4 allows to call the ioctl from a separate (non-VMM) process. It > has been prohibited by [3], but I have not been able to locate the exact > justification for the requirement. The justification is that the "struct kvm" has a long-lived tie to a host process's address space. Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from different processes would make things very messy, because it is not clear which mm you are working with: the MMU notifier is registered for kvm->mm, but some functions such as get_user_pages do not take an mm for example and always operate on current->mm. In your case, it should be enough to add a ioctl on the guestmemfd instead? But the real question is, what are you using KVM_X86_SW_PROTECTED_VM for? Paolo > Questions: > - Does exposing a generic population interface via ioctl look > sensible in this form? > - Is there a path where "only VMM can call KVM API" requirement is > relaxed? If not, what is the recommended efficient alternative for > populating guest memory from outside the VMM? > > [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md > [2]: https://lore.kernel.org/kvm/CADrL8HUHRMwUPhr7jLLBgD9YLFAnVHc=N-C=8er-x6GUtV97pQ@mail.gmail.com/T/ > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d4e4c4fca5be806b888d606894d914847e82d78 > > Nikita > > Nikita Kalyazin (4): > KVM: guest_memfd: add generic post_populate callback > KVM: add KVM_GUEST_MEMFD_POPULATE ioctl for guest_memfd > KVM: allow KVM_GUEST_MEMFD_POPULATE in another mm > KVM: document KVM_GUEST_MEMFD_POPULATE ioctl > > Documentation/virt/kvm/api.rst | 23 +++++++++++++++++++++++ > include/linux/kvm_host.h | 3 +++ > include/uapi/linux/kvm.h | 9 +++++++++ > virt/kvm/guest_memfd.c | 28 ++++++++++++++++++++++++++++ > virt/kvm/kvm_main.c | 19 ++++++++++++++++++- > 5 files changed, 81 insertions(+), 1 deletion(-) > > > base-commit: c8d430db8eec7d4fd13a6bea27b7086a54eda6da
On 20/11/2024 13:55, Paolo Bonzini wrote: >> Patch 4 allows to call the ioctl from a separate (non-VMM) process. It >> has been prohibited by [3], but I have not been able to locate the exact >> justification for the requirement. > > The justification is that the "struct kvm" has a long-lived tie to a > host process's address space. > > Invoking ioctls like KVM_SET_USER_MEMORY_REGION and KVM_RUN from > different processes would make things very messy, because it is not > clear which mm you are working with: the MMU notifier is registered for > kvm->mm, but some functions such as get_user_pages do not take an mm for > example and always operate on current->mm. That's fair, thanks for the explanation. > In your case, it should be enough to add a ioctl on the guestmemfd > instead? That's right. That would be sufficient indeed. Is that something that could be considered? Would that be some non-KVM API, with guest_memfd moving to an mm library? > But the real question is, what are you using > KVM_X86_SW_PROTECTED_VM for? The concrete use case is VM restoration from a snapshot in Firecracker [1]. In the current setup, the VMM registers a UFFD against the guest memory and sends the UFFD handle to an external process that knows how to obtain the snapshotted memory. We would like to preserve the semantics, but also remove the guest memory from the direct map [2]. Mimicing this with guest_memfd would be sending some form of a guest_memfd handle to that process that would be using it to populate guest_memfd. [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md#userfaultfd [2]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/T/ > Paolo
On 24/10/2024 10:54, Nikita Kalyazin wrote: > [2] proposes an alternative to > UserfaultFD for intercepting stage-2 faults, while this series > conceptually compliments it with the ability to populate guest memory > backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs. +David +Sean +mm While measuring memory population performance of guest_memfd using this series, I noticed that guest_memfd population takes longer than my baseline, which is filling anonymous private memory via UFFDIO_COPY. I am using x86_64 for my measurements and 3 GiB memory region: - anon/private UFFDIO_COPY: 940 ms - guest_memfd: 1371 ms (+46%) It turns out that the effect is observable not only for guest_memfd, but also for any type of shared memory, eg memfd or anonymous memory mapped as shared. Below are measurements of a plain mmap(MAP_POPULATE) operation: mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); vs mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE, -1, 0); Results: - MAP_PRIVATE: 968 ms - MAP_SHARED: 1646 ms I am seeing this effect on a range of kernels. The oldest I used was 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb). When profiling with perf, I observe the following hottest operations (kvm-next). Attaching full distributions at the end of the email. MAP_PRIVATE: - 19.72% clear_page_erms, rep stos %al,%es:(%rdi) MAP_SHARED: - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic setting of the PG_uptodate bit - 10.98% clear_page_erms, rep stos %al,%es:(%rdi) Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that sets the PG_uptodate bit regularly. , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that sets the PG_uptodate bit atomically. While this logic is intuitive, its performance effect is more significant that I would expect. The questions are: - Is this a well-known behaviour? - Is there a way to mitigate that, ie make shared memory (including guest_memfd) population faster/comparable to private memory? Nikita Appendix: full call tree obtained via perf MAP_RPIVATE: - 87.97% __mmap entry_SYSCALL_64_after_hwframe do_syscall_64 vm_mmap_pgoff __mm_populate populate_vma_page_range - __get_user_pages - 77.94% handle_mm_fault - 76.90% __handle_mm_fault - 72.70% do_anonymous_page - 31.92% vma_alloc_folio_noprof - 30.74% alloc_pages_mpol_noprof - 29.60% __alloc_pages_noprof - 28.40% get_page_from_freelist 19.72% clear_page_erms - 3.00% __rmqueue_pcplist __mod_zone_page_state 1.18% _raw_spin_trylock - 20.03% __pte_offset_map_lock - 15.96% _raw_spin_lock 1.50% preempt_count_add - 2.27% __pte_offset_map __rcu_read_lock - 7.22% __folio_batch_add_and_move - 4.68% folio_batch_move_lru - 3.77% lru_add + 0.95% __mod_zone_page_state 0.86% __mod_node_page_state 0.84% folios_put_refs 0.55% check_preemption_disabled - 2.85% folio_add_new_anon_rmap - __folio_mod_stat __mod_node_page_state - 1.15% pte_offset_map_nolock __pte_offset_map - 7.59% follow_page_pte - 4.56% __pte_offset_map_lock - 2.27% _raw_spin_lock preempt_count_add 1.13% __pte_offset_map 0.75% folio_mark_accessed MAP_SHARED: - 77.89% __mmap entry_SYSCALL_64_after_hwframe do_syscall_64 vm_mmap_pgoff __mm_populate populate_vma_page_range - __get_user_pages - 72.11% handle_mm_fault - 71.67% __handle_mm_fault - 69.62% do_fault - 44.61% __do_fault - shmem_fault - 43.94% shmem_get_folio_gfp - 17.20% shmem_alloc_and_add_folio.constprop.0 - 5.10% shmem_alloc_folio - 4.58% folio_alloc_mpol_noprof - alloc_pages_mpol_noprof - 4.00% __alloc_pages_noprof - 3.31% get_page_from_freelist 1.24% __rmqueue_pcplist - 5.07% shmem_add_to_page_cache - 1.44% __mod_node_page_state 0.61% check_preemption_disabled 0.78% xas_store 0.74% xas_find_conflict 0.66% _raw_spin_lock_irq - 3.96% __folio_batch_add_and_move - 2.41% folio_batch_move_lru 1.88% lru_add - 1.56% shmem_inode_acct_blocks - 1.24% __dquot_alloc_space - 0.77% inode_add_bytes _raw_spin_lock - 0.77% shmem_recalc_inode _raw_spin_lock 10.98% clear_page_erms - 1.17% filemap_get_entry 0.78% xas_load - 20.26% filemap_map_pages - 12.23% next_uptodate_folio - 1.27% xas_find xas_load - 1.16% __pte_offset_map_lock 0.59% _raw_spin_lock - 3.48% finish_fault - 1.28% set_pte_range 0.96% folio_add_file_rmap_ptes - 0.91% __pte_offset_map_lock 0.54% _raw_spin_lock 0.57% pte_offset_map_nolock - 4.11% follow_page_pte - 2.36% __pte_offset_map_lock - 1.32% _raw_spin_lock preempt_count_add 0.54% __pte_offset_map
On 20.11.24 13:09, Nikita Kalyazin wrote: > On 24/10/2024 10:54, Nikita Kalyazin wrote: >> [2] proposes an alternative to >> UserfaultFD for intercepting stage-2 faults, while this series >> conceptually compliments it with the ability to populate guest memory >> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs. > > +David > +Sean > +mm Hi! > > While measuring memory population performance of guest_memfd using this > series, I noticed that guest_memfd population takes longer than my > baseline, which is filling anonymous private memory via UFFDIO_COPY. > > I am using x86_64 for my measurements and 3 GiB memory region: > - anon/private UFFDIO_COPY: 940 ms > - guest_memfd: 1371 ms (+46%) > > It turns out that the effect is observable not only for guest_memfd, but > also for any type of shared memory, eg memfd or anonymous memory mapped > as shared. > Below are measurements of a plain mmap(MAP_POPULATE) operation:> > mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE | > MAP_ANONYMOUS | MAP_POPULATE, -1, 0); > vs > mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED | > MAP_ANONYMOUS | MAP_POPULATE, -1, 0); > > Results: > - MAP_PRIVATE: 968 ms > - MAP_SHARED: 1646 ms At least here it is expected to some degree: as soon as the page cache is involved map/unmap gets slower, because we are effectively maintaining two datastructures (page tables + page cache) instead of only a single one (page cache) Can you make sure that THP/large folios don't interfere in your experiments (e.g., madvise(MADV_NOHUGEPAGE))? > > I am seeing this effect on a range of kernels. The oldest I used was > 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb). > > When profiling with perf, I observe the following hottest operations > (kvm-next). Attaching full distributions at the end of the email. > > MAP_PRIVATE: > - 19.72% clear_page_erms, rep stos %al,%es:(%rdi) > > MAP_SHARED: > - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic > setting of the PG_uptodate bit > - 10.98% clear_page_erms, rep stos %al,%es:(%rdi) Interesting. > > Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that > sets the PG_uptodate bit regularly. > , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that > sets the PG_uptodate bit atomically. > > While this logic is intuitive, its performance effect is more > significant that I would expect. Yes. How much of the performance difference would remain if you hack out the atomic op just to play with it? I suspect there will still be some difference. > > The questions are: > - Is this a well-known behaviour? > - Is there a way to mitigate that, ie make shared memory (including > guest_memfd) population faster/comparable to private memory? Likely. But your experiment measures above something different than what guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I would assume guest_memfd will be faster than MAP_POPULATE. How do you end up allocating memory for guest_memfd? Using simple fallocate()? Note that we might improve allocation times with guest_memfd when allocating larger folios. -- Cheers, David / dhildenb
>> >> The questions are: >> - Is this a well-known behaviour? >> - Is there a way to mitigate that, ie make shared memory (including >> guest_memfd) population faster/comparable to private memory? > > Likely. But your experiment measures above something different than what > guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I > would assume guest_memfd will be faster than MAP_POPULATE. > > How do you end up allocating memory for guest_memfd? Using simple > fallocate()? Heh, now I spot that your comment was as reply to a series. If your ioctl is supposed to to more than "allocating memory" like MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. Because for allocating memory, we would want to use fallocate() instead. I assume you want to "allocate+copy"? I'll note that, as we're moving into the direction of moving guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" ioctls, and think about something generic. Any clue how your new ioctl will interact with the WIP to have shared memory as part of guest_memfd? For example, could it be reasonable to "populate" the shared memory first (via VMA) and then convert that "allocated+filled" memory to private? -- Cheers, David / dhildenb
On 20/11/2024 15:13, David Hildenbrand wrote: > Hi! Hi! :) >> Results: >> - MAP_PRIVATE: 968 ms >> - MAP_SHARED: 1646 ms > > At least here it is expected to some degree: as soon as the page cache > is involved map/unmap gets slower, because we are effectively > maintaining two datastructures (page tables + page cache) instead of > only a single one (page cache) > > Can you make sure that THP/large folios don't interfere in your > experiments (e.g., madvise(MADV_NOHUGEPAGE))? I was using transparent_hugepage=never command line argument in my testing. $ cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] Is that sufficient to exclude the THP/large folio factor? >> While this logic is intuitive, its performance effect is more >> significant that I would expect. > > Yes. How much of the performance difference would remain if you hack out > the atomic op just to play with it? I suspect there will still be some > difference. I have tried that, but could not see any noticeable difference in the overall results. It looks like a big portion of the bottleneck has moved from shmem_get_folio_gfp/folio_mark_uptodate to finish_fault/__pte_offset_map_lock somehow. I have no good explanation for why: Orig: - 69.62% do_fault + 44.61% __do_fault + 20.26% filemap_map_pages + 3.48% finish_fault Hacked: - 67.39% do_fault + 32.45% __do_fault + 21.87% filemap_map_pages + 11.97% finish_fault Orig: - 3.48% finish_fault - 1.28% set_pte_range 0.96% folio_add_file_rmap_ptes - 0.91% __pte_offset_map_lock 0.54% _raw_spin_lock Hacked: - 11.97% finish_fault - 8.59% __pte_offset_map_lock - 6.27% _raw_spin_lock preempt_count_add 1.00% __pte_offset_map - 1.28% set_pte_range - folio_add_file_rmap_ptes __mod_node_page_state > Note that we might improve allocation times with guest_memfd when > allocating larger folios. I suppose it may not always be an option depending on requirements to consistency of the allocation latency. Eg if a large folio isn't available at the time, the performance would degrade to the base case (please correct me if I'm missing something). > Heh, now I spot that your comment was as reply to a series. Yeah, sorry if it wasn't obvious. > If your ioctl is supposed to to more than "allocating memory" like > MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. > Because for allocating memory, we would want to use fallocate() instead. > I assume you want to "allocate+copy"? Yes, the ultimate use case is "allocate+copy". > I'll note that, as we're moving into the direction of moving > guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" > ioctls, and think about something generic. Good point, thanks. Are we at the stage where some concrete API has been proposed yet? I might have missed that. > Any clue how your new ioctl will interact with the WIP to have shared > memory as part of guest_memfd? For example, could it be reasonable to > "populate" the shared memory first (via VMA) and then convert that > "allocated+filled" memory to private? No, I can't immediately see why it shouldn't work. My main concern would probably still be about the latency of the population stage as I can't see why it would improve compared to what we have now, because my feeling is this is linked with the sharedness property of guest_memfd. > Cheers, > > David / dhildenb
On 20.11.24 16:58, Nikita Kalyazin wrote: > > > On 20/11/2024 15:13, David Hildenbrand wrote: > > Hi! > > Hi! :) > > >> Results: > >> - MAP_PRIVATE: 968 ms > >> - MAP_SHARED: 1646 ms > > > > At least here it is expected to some degree: as soon as the page cache > > is involved map/unmap gets slower, because we are effectively > > maintaining two datastructures (page tables + page cache) instead of > > only a single one (page cache) > > > > Can you make sure that THP/large folios don't interfere in your > > experiments (e.g., madvise(MADV_NOHUGEPAGE))? > > I was using transparent_hugepage=never command line argument in my testing. > > $ cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > > Is that sufficient to exclude the THP/large folio factor? Yes! > > >> While this logic is intuitive, its performance effect is more > >> significant that I would expect. > > > > Yes. How much of the performance difference would remain if you hack out > > the atomic op just to play with it? I suspect there will still be some > > difference. > > I have tried that, but could not see any noticeable difference in the > overall results. > > It looks like a big portion of the bottleneck has moved from > shmem_get_folio_gfp/folio_mark_uptodate to > finish_fault/__pte_offset_map_lock somehow. I have no good explanation > for why: That's what I assumed. The profiling results can be rather fuzzy and misleading with micro-benchmarks. :( > > Orig: > - 69.62% do_fault > + 44.61% __do_fault > + 20.26% filemap_map_pages > + 3.48% finish_fault > Hacked: > - 67.39% do_fault > + 32.45% __do_fault > + 21.87% filemap_map_pages > + 11.97% finish_fault > > Orig: > - 3.48% finish_fault > - 1.28% set_pte_range > 0.96% folio_add_file_rmap_ptes > - 0.91% __pte_offset_map_lock > 0.54% _raw_spin_lock > Hacked: > - 11.97% finish_fault > - 8.59% __pte_offset_map_lock > - 6.27% _raw_spin_lock > preempt_count_add > 1.00% __pte_offset_map > - 1.28% set_pte_range > - folio_add_file_rmap_ptes > __mod_node_page_state > > > Note that we might improve allocation times with guest_memfd when > > allocating larger folios. > > I suppose it may not always be an option depending on requirements to > consistency of the allocation latency. Eg if a large folio isn't > available at the time, the performance would degrade to the base case > (please correct me if I'm missing something). Yes, there are cons to that. > >> Heh, now I spot that your comment was as reply to a series. > > Yeah, sorry if it wasn't obvious. > >> If your ioctl is supposed to to more than "allocating memory" like >> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. >> Because for allocating memory, we would want to use fallocate() instead. >> I assume you want to "allocate+copy"? > > Yes, the ultimate use case is "allocate+copy". > >> I'll note that, as we're moving into the direction of moving >> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" >> ioctls, and think about something generic. > > Good point, thanks. Are we at the stage where some concrete API has > been proposed yet? I might have missed that. People are working on it, and we're figuring out some remaining details (e.g., page_type to intercept folio_put() ). I assume we'll see a new RFC soonish (famous last words), but it's not been proposed yet. > >> Any clue how your new ioctl will interact with the WIP to have shared >> memory as part of guest_memfd? For example, could it be reasonable to >> "populate" the shared memory first (via VMA) and then convert that >> "allocated+filled" memory to private? > > No, I can't immediately see why it shouldn't work. My main concern > would probably still be about the latency of the population stage as I > can't see why it would improve compared to what we have now, because my > feeling is this is linked with the sharedness property of guest_memfd. If the problem is the "pagecache" overhead, then yes, it will be a harder nut to crack. But maybe there are some low-hanging fruits to optimize? Finding the main cause for the added overhead would be interesting. -- Cheers, David / dhildenb
>> No, I can't immediately see why it shouldn't work. My main concern >> would probably still be about the latency of the population stage as I >> can't see why it would improve compared to what we have now, because my > > feeling is this is linked with the sharedness property of guest_memfd. > > If the problem is the "pagecache" overhead, then yes, it will be a > harder nut to crack. But maybe there are some low-hanging fruits to > optimize? Finding the main cause for the added overhead would be > interesting. Can you compare uffdio_copy() when using anonymous memory vs. shmem? That's likely the best we could currently achieve with guest_memfd. There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure if that is of any help; it SEGFAULTS for me right now with a (likely) division by 0. -- Cheers, David / dhildenb
On 20/11/2024 16:44, David Hildenbrand wrote: >> If the problem is the "pagecache" overhead, then yes, it will be a >> harder nut to crack. But maybe there are some low-hanging fruits to >> optimize? Finding the main cause for the added overhead would be >> interesting. Agreed, knowing the exact root cause would be really nice. > Can you compare uffdio_copy() when using anonymous memory vs. shmem? > That's likely the best we could currently achieve with guest_memfd. Yeah, I was doing that too. It was about ~28% slower in my setup, while with guest_memfd it was ~34% slower. The variance of the data was quite high so the difference may well be just noise. In other words, I'd be much happier if we could bring guest_memfd (or even shmem) performance closer to the anon/private than if we just equalised guest_memfd with shmem (which are probably already pretty close). > There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure > if that is of any help; it SEGFAULTS for me right now with a (likely) > division by 0. Thanks for the pointer, will take a look! > Cheers, > > David / dhildenb >
On 20.11.24 18:21, Nikita Kalyazin wrote: > > > On 20/11/2024 16:44, David Hildenbrand wrote: >>> If the problem is the "pagecache" overhead, then yes, it will be a >>> harder nut to crack. But maybe there are some low-hanging fruits to >>> optimize? Finding the main cause for the added overhead would be >>> interesting. > > Agreed, knowing the exact root cause would be really nice. > >> Can you compare uffdio_copy() when using anonymous memory vs. shmem? >> That's likely the best we could currently achieve with guest_memfd. > > Yeah, I was doing that too. It was about ~28% slower in my setup, while > with guest_memfd it was ~34% slower. I looked into uffdio_copy() for shmem and we still walk+modify page tables. In theory, we could try hacking that out: for filling the pagecache we would only need the vma properties, not the page table properties; that would then really resemble "only modify the pagecache". That would likely resemble what we would expect with guest_memfd: work only on the pagecache and not the page tables. So it's rather surprising that guest_memfd is slower than that, as it currently doesn't mess with user page tables at all. The variance of the data was quite > high so the difference may well be just noise. In other words, I'd be > much happier if we could bring guest_memfd (or even shmem) performance > closer to the anon/private than if we just equalised guest_memfd with > shmem (which are probably already pretty close). Makes sense. Best we can do is: anon: work only on page tables shmem/guest_memfd: work only on pageacache So at least "only one treelike structure to update". -- Cheers, David / dhildenb
On 20/11/2024 18:29, David Hildenbrand wrote: > Any clue how your new ioctl will interact with the WIP to have shared > memory as part of guest_memfd? For example, could it be reasonable to > "populate" the shared memory first (via VMA) and then convert that > "allocated+filled" memory to private? Patrick and I synced internally on this. What may actually work for guest_memfd population is the following. Non-CoCo use case: - fallocate syscall to fill the page cache, no page content initialisation (like it is now) - pwrite syscall to initialise the content + mark up-to-date (mark prepared), no specific preparation logic is required The pwrite will have "once" semantics until a subsequent fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the page is already prepared and return EIO/ENOSPC or something. SEV-SNP use case (no changes): - fallocate as above - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare We don't think fallocate/pwrite have dependencies on current->mm assumptions that Paolo mentioned in [1], so they should be safe to be called on guest_memfd from a non-VMM process. [1]: https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c > Makes sense. Best we can do is: > > anon: work only on page tables > shmem/guest_memfd: work only on pageacache > > So at least "only one treelike structure to update". This seems to hold with the above reasoning. > -- > Cheers, > > David / dhildenb
© 2016 - 2024 Red Hat, Inc.