.../testing/selftests/kvm/guest_memfd_test.c | 86 +++++++++++++++++-- virt/kvm/guest_memfd.c | 62 ++++++++++++- 2 files changed, 141 insertions(+), 7 deletions(-)
[ based on kvm/next ] Implement guest_memfd allocation and population via the write syscall. This is useful in non-CoCo use cases where the host can access guest memory. Even though the same can also be achieved via userspace mapping and memcpying from userspace, write provides a more performant option because it does not need to set page tables and it does not cause a page fault for every page like memcpy would. Note that memcpy cannot be accelerated via MADV_POPULATE_WRITE as it is not supported by guest_memfd and relies on GUP. Populating 512MiB of guest_memfd on a x86 machine: - via memcpy: 436 ms - via write: 202 ms (-54%) v5: - Replace the call to the unexported filemap_remove_folio with zeroing the bytes that could not be copied - Fix checkpatch findings v4: - https://lore.kernel.org/kvm/20250828153049.3922-1-kalyazin@amazon.com - Switch from implementing the write callback to write_iter - Remove conditional compilation v3: - https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com - David/Mike D: Only compile support for the write syscall if CONFIG_KVM_GMEM_SHARED_MEM (now gone) is enabled. v2: - https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com - Switch from an ioctl to the write syscall to implement population v1: - https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com Nikita Kalyazin (2): KVM: guest_memfd: add generic population via write KVM: selftests: update guest_memfd write tests .../testing/selftests/kvm/guest_memfd_test.c | 86 +++++++++++++++++-- virt/kvm/guest_memfd.c | 62 ++++++++++++- 2 files changed, 141 insertions(+), 7 deletions(-) base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383 -- 2.50.1
On Tue, Sep 2, 2025 at 4:20 AM Kalyazin, Nikita <kalyazin@amazon.co.uk> wrote: > > [ based on kvm/next ] > > Implement guest_memfd allocation and population via the write syscall. > This is useful in non-CoCo use cases where the host can access guest > memory. Even though the same can also be achieved via userspace mapping > and memcpying from userspace, write provides a more performant option > because it does not need to set page tables and it does not cause a page > fault for every page like memcpy would. Note that memcpy cannot be > accelerated via MADV_POPULATE_WRITE as it is not supported by > guest_memfd and relies on GUP. > > Populating 512MiB of guest_memfd on a x86 machine: > - via memcpy: 436 ms > - via write: 202 ms (-54%) Silly question: can you remind me why this speed-up is important? Also, I think we can get the same effect as MADV_POPULATE_WRITE just by making a second VMA for the memory file and reading the first byte of each page. Is that a viable strategy for your use case? Seems fine to me to allow write() for guest_memfd anyway. :)
On 10/09/2025 22:37, James Houghton wrote: > On Tue, Sep 2, 2025 at 4:20 AM Kalyazin, Nikita <kalyazin@amazon.co.uk> wrote: >> >> [ based on kvm/next ] >> >> Implement guest_memfd allocation and population via the write syscall. >> This is useful in non-CoCo use cases where the host can access guest >> memory. Even though the same can also be achieved via userspace mapping >> and memcpying from userspace, write provides a more performant option >> because it does not need to set page tables and it does not cause a page >> fault for every page like memcpy would. Note that memcpy cannot be >> accelerated via MADV_POPULATE_WRITE as it is not supported by >> guest_memfd and relies on GUP. >> >> Populating 512MiB of guest_memfd on a x86 machine: >> - via memcpy: 436 ms >> - via write: 202 ms (-54%) > > Silly question: can you remind me why this speed-up is important? The speed-up is important for the Firecracker use case [1] because it is likely for the population to stand on the hot path of the snapshot restore process. Even though we aim to prepopulate the guest memory before it gets accessed by the guest, for large VMs the guest has a good chance to hit a page that isn't yet populated triggering on-demand fault handling which is much slower, and we'd like to avoid those as much as we can. [1]: https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/handling-page-faults-on-snapshot-resume.md > > Also, I think we can get the same effect as MADV_POPULATE_WRITE just > by making a second VMA for the memory file and reading the first byte > of each page. Is that a viable strategy for your use case? If I understand correctly what you mean, it doesn't look much different from the memcpy option I mention above. All those one-byte read accesses will trigger user mapping faults for every page, and they are quite slow. write() allows to avoid them completely. > > Seems fine to me to allow write() for guest_memfd anyway. :) Glad to hear that!
© 2016 - 2025 Red Hat, Inc.