include/linux/pagemap.h | 31 +++++ mm/filemap.c | 263 ++++++++++++++++++++++++++++++++++++++++ virt/kvm/guest_memfd.c | 176 ++++++++++++++++++++++----- 3 files changed, 437 insertions(+), 33 deletions(-)
Based on David's suggestion for speeding up guest_memfd memory population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2], this adds `filemap_grab_folios` that grabs multiple folios at a time. Motivation When profiling guest_memfd population and comparing the results with population of anonymous memory via UFFDIO_COPY, I observed that the former was up to 20% slower, mainly due to adding newly allocated pages to the pagecache. As far as I can see, the two main contributors to it are pagecache locking and tree traversals needed for every folio. The RFC attempts to partially mitigate those by adding multiple folios at a time to the pagecache. Testing With the change applied, I was able to observe a 10.3% (708 to 635 ms) speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a custom Firecracker version, both on Intel Ice Lake. Limitations While `filemap_grab_folios` handles THP/large folios internally and deals with reclaim artifacts in the pagecache (shadows), for simplicity reasons, the RFC does not support those as it demonstrates the optimisation applied to guest_memfd, which only uses small folios and does not support reclaim at the moment. Implementation I am aware of existing filemap APIs operating on folio batches, however I was not able to find one for the use case in question. I was also thinking about making use of the `folio_batch` struct, but was not able to convince myself that it was useful. Instead, a plain array of folio pointers is allocated on stack and passed down the callchain. A bitmap is used to keep track of indexes whose folios were already present in the pagecache to prevent allocations. This does not look very clean to me and I am more than open to hearing about better approaches. Not being an expert in xarray, I do not know an idiomatic way to advance the index if `xas_next` is called directly after instantiation of the state that was never walked, so I used a call to `xas_set`. While the series focuses on optimising _adding_ folios to the pagecache, I was also experimenting with batching of pagecache _querying_. Specifically, I tried to make use of `filemap_get_folios` instead of `filemap_get_entry`, but I could not observe any visible speedup. The series is applied on top of [1] as the 1st patch implements `filemap_grab_folios`, while the 2nd patch makes use of it in the guest_memfd's write syscall as a first user. Questions: - Does the approach look reasonable in general? - Can the API be kept specialised to the non-reclaim-supported case or does it need to be generic? - Would it be sensible to add a specialised small-folio-only version of `filemap_grab_folios` at the beginning and extend it to large folios later on? - Are there better ways to implement batching or even achieve the optimisaton goal in another way? [1]: https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@amazon.com/T/ [2]: https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0 Thanks Nikita Nikita Kalyazin (2): mm: filemap: add filemap_grab_folios KVM: guest_memfd: use filemap_grab_folios in write include/linux/pagemap.h | 31 +++++ mm/filemap.c | 263 ++++++++++++++++++++++++++++++++++++++++ virt/kvm/guest_memfd.c | 176 ++++++++++++++++++++++----- 3 files changed, 437 insertions(+), 33 deletions(-) base-commit: 643cff38ebe84c39fbd5a0fc3ab053cd941b9f94 -- 2.40.1
On 10.01.25 16:46, Nikita Kalyazin wrote: > Based on David's suggestion for speeding up guest_memfd memory > population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2], > this adds `filemap_grab_folios` that grabs multiple folios at a time. > Hi, > Motivation > > When profiling guest_memfd population and comparing the results with > population of anonymous memory via UFFDIO_COPY, I observed that the > former was up to 20% slower, mainly due to adding newly allocated pages > to the pagecache. As far as I can see, the two main contributors to it > are pagecache locking and tree traversals needed for every folio. The > RFC attempts to partially mitigate those by adding multiple folios at a > time to the pagecache. > > Testing > > With the change applied, I was able to observe a 10.3% (708 to 635 ms) > speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to > 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a > custom Firecracker version, both on Intel Ice Lake. Does that mean that it's still 10% slower (based on the 20% above), or were the 20% from a different micro-benchmark? > > Limitations > > While `filemap_grab_folios` handles THP/large folios internally and > deals with reclaim artifacts in the pagecache (shadows), for simplicity > reasons, the RFC does not support those as it demonstrates the > optimisation applied to guest_memfd, which only uses small folios and > does not support reclaim at the moment. It might be worth pointing out that, while support for larger folios is in the works, there will be scenarios where small folios are unavoidable in the future (mixture of shared and private memory). How hard would it be to just naturally support large folios as well? We do have memfd_pin_folios() that can deal with that and provides a slightly similar interface (struct folio **folios). For reference, the interface is: long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, struct folio **folios, unsigned int max_folios, pgoff_t *offset) Maybe what you propose could even be used to further improve memfd_pin_folios() internally? However, it must do this FOLL_PIN thingy, so it must process each and every folio it processed. -- Cheers, David / dhildenb
On 10/01/2025 17:01, David Hildenbrand wrote: > On 10.01.25 16:46, Nikita Kalyazin wrote: >> Based on David's suggestion for speeding up guest_memfd memory >> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2], >> this adds `filemap_grab_folios` that grabs multiple folios at a time. >> > > Hi, Hi :) > >> Motivation >> >> When profiling guest_memfd population and comparing the results with >> population of anonymous memory via UFFDIO_COPY, I observed that the >> former was up to 20% slower, mainly due to adding newly allocated pages >> to the pagecache. As far as I can see, the two main contributors to it >> are pagecache locking and tree traversals needed for every folio. The >> RFC attempts to partially mitigate those by adding multiple folios at a >> time to the pagecache. >> >> Testing >> >> With the change applied, I was able to observe a 10.3% (708 to 635 ms) >> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to >> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a >> custom Firecracker version, both on Intel Ice Lake. > > Does that mean that it's still 10% slower (based on the 20% above), or > were the 20% from a different micro-benchmark? Yes, it is still slower: - isolated/selftest: 2.3% - Firecracker setup: 8.9% Not sure why the values are so different though. I'll try to find an explanation. >> >> Limitations >> >> While `filemap_grab_folios` handles THP/large folios internally and >> deals with reclaim artifacts in the pagecache (shadows), for simplicity >> reasons, the RFC does not support those as it demonstrates the >> optimisation applied to guest_memfd, which only uses small folios and >> does not support reclaim at the moment. > > It might be worth pointing out that, while support for larger folios is > in the works, there will be scenarios where small folios are unavoidable > in the future (mixture of shared and private memory). > > How hard would it be to just naturally support large folios as well? I don't think it's going to be impossible. It's just one more dimension that needs to be handled. `__filemap_add_folio` logic is already rather complex, and processing multiple folios while also splitting when necessary correctly looks substantially convoluted to me. So my idea was to discuss/validate the multi-folio approach first before rolling the sleeves up. > We do have memfd_pin_folios() that can deal with that and provides a > slightly similar interface (struct folio **folios). > > For reference, the interface is: > > long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, > struct folio **folios, unsigned int max_folios, > pgoff_t *offset) > > Maybe what you propose could even be used to further improve > memfd_pin_folios() internally? However, it must do this FOLL_PIN thingy, > so it must process each and every folio it processed. Thanks for the pointer. Yeah, I see what you mean. I guess, it can potentially allocate/add folios in a batch and then pin them? Although swap/readahead logic may make it more difficult to implement. > -- > Cheers, > > David / dhildenb
On 10.01.25 19:54, Nikita Kalyazin wrote: > On 10/01/2025 17:01, David Hildenbrand wrote: >> On 10.01.25 16:46, Nikita Kalyazin wrote: >>> Based on David's suggestion for speeding up guest_memfd memory >>> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2], >>> this adds `filemap_grab_folios` that grabs multiple folios at a time. >>> >> >> Hi, > > Hi :) > >> >>> Motivation >>> >>> When profiling guest_memfd population and comparing the results with >>> population of anonymous memory via UFFDIO_COPY, I observed that the >>> former was up to 20% slower, mainly due to adding newly allocated pages >>> to the pagecache. As far as I can see, the two main contributors to it >>> are pagecache locking and tree traversals needed for every folio. The >>> RFC attempts to partially mitigate those by adding multiple folios at a >>> time to the pagecache. >>> >>> Testing >>> >>> With the change applied, I was able to observe a 10.3% (708 to 635 ms) >>> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to >>> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a >>> custom Firecracker version, both on Intel Ice Lake. >> >> Does that mean that it's still 10% slower (based on the 20% above), or >> were the 20% from a different micro-benchmark? > > Yes, it is still slower: > - isolated/selftest: 2.3% > - Firecracker setup: 8.9% > > Not sure why the values are so different though. I'll try to find an > explanation. The 2.3% looks very promising. > >>> >>> Limitations >>> >>> While `filemap_grab_folios` handles THP/large folios internally and >>> deals with reclaim artifacts in the pagecache (shadows), for simplicity >>> reasons, the RFC does not support those as it demonstrates the >>> optimisation applied to guest_memfd, which only uses small folios and >>> does not support reclaim at the moment. >> >> It might be worth pointing out that, while support for larger folios is >> in the works, there will be scenarios where small folios are unavoidable >> in the future (mixture of shared and private memory). >> >> How hard would it be to just naturally support large folios as well? > > I don't think it's going to be impossible. It's just one more dimension > that needs to be handled. `__filemap_add_folio` logic is already rather > complex, and processing multiple folios while also splitting when > necessary correctly looks substantially convoluted to me. So my idea > was to discuss/validate the multi-folio approach first before rolling > the sleeves up. We should likely try making this as generic as possible, meaning we'll support roughly what filemap_grab_folio() would have supported (e.g., also large folios). Now I find filemap_get_folios_contig() [thas is already used in memfd code], and wonder if that could be reused/extended fairly easily. -- Cheers, David / dhildenb
On 13/01/2025 12:20, David Hildenbrand wrote: > On 10.01.25 19:54, Nikita Kalyazin wrote: >> On 10/01/2025 17:01, David Hildenbrand wrote: >>> On 10.01.25 16:46, Nikita Kalyazin wrote: >>>> Based on David's suggestion for speeding up guest_memfd memory >>>> population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2], >>>> this adds `filemap_grab_folios` that grabs multiple folios at a time. >>>> >>> >>> Hi, >> >> Hi :) >> >>> >>>> Motivation >>>> >>>> When profiling guest_memfd population and comparing the results with >>>> population of anonymous memory via UFFDIO_COPY, I observed that the >>>> former was up to 20% slower, mainly due to adding newly allocated pages >>>> to the pagecache. As far as I can see, the two main contributors to it >>>> are pagecache locking and tree traversals needed for every folio. The >>>> RFC attempts to partially mitigate those by adding multiple folios at a >>>> time to the pagecache. >>>> >>>> Testing >>>> >>>> With the change applied, I was able to observe a 10.3% (708 to 635 ms) >>>> speedup in a selftest that populated 3GiB guest_memfd and a 9.5% >>>> (990 to >>>> 904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a >>>> custom Firecracker version, both on Intel Ice Lake. >>> >>> Does that mean that it's still 10% slower (based on the 20% above), or >>> were the 20% from a different micro-benchmark? >> >> Yes, it is still slower: >> - isolated/selftest: 2.3% >> - Firecracker setup: 8.9% >> >> Not sure why the values are so different though. I'll try to find an >> explanation. > > The 2.3% looks very promising. It does. I sorted out my Firecracker setup and saw a similar figure there, which made me more confident. >> >>>> >>>> Limitations >>>> >>>> While `filemap_grab_folios` handles THP/large folios internally and >>>> deals with reclaim artifacts in the pagecache (shadows), for simplicity >>>> reasons, the RFC does not support those as it demonstrates the >>>> optimisation applied to guest_memfd, which only uses small folios and >>>> does not support reclaim at the moment. >>> >>> It might be worth pointing out that, while support for larger folios is >>> in the works, there will be scenarios where small folios are unavoidable >>> in the future (mixture of shared and private memory). >>> >>> How hard would it be to just naturally support large folios as well? >> >> I don't think it's going to be impossible. It's just one more dimension >> that needs to be handled. `__filemap_add_folio` logic is already rather >> complex, and processing multiple folios while also splitting when >> necessary correctly looks substantially convoluted to me. So my idea >> was to discuss/validate the multi-folio approach first before rolling >> the sleeves up. > > We should likely try making this as generic as possible, meaning we'll > support roughly what filemap_grab_folio() would have supported (e.g., > also large folios). > > Now I find filemap_get_folios_contig() [thas is already used in memfd > code], > and wonder if that could be reused/extended fairly easily. Fair, I will see into how it could be made generic. > -- > Cheers, > > David / dhildenb >
© 2016 - 2026 Red Hat, Inc.