KVM: gmem: 2MB THP support and preparedness tracking changes

[PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Michael Roth 1 year, 1 month ago

This patchset is also available at:

  https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1

and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
a snapshot of his patches[1] to provide tracking of whether or not
sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
before guest access:

  d55475f23cea KVM: gmem: track preparedness a page at a time
  64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
  17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
  e3449f6841ef KVM: gmem: allocate private data for the gmem inode 

  [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/

This series addresses some of the pending review comments for those patches
(feel free to squash/rework as-needed), and implements a first real user in
the form of a reworked version of Sean's original 2MB THP support for gmem.

It is still a bit up in the air as to whether or not gmem should support
THP at all rather than moving straight to 2MB/1GB hugepages in the form of
something like HugeTLB folios[2] or the lower-level PFN range allocator
presented by Yu Zhao during the guest_memfd call last week. The main
arguments against THP, as I understand it, is that THPs will become
split over time due to hole-punching and rarely have an opportunity to get 
rebuilt due to lack of memory migration support for current CoCo hypervisor
implementations like SNP (and adding the migration support to resolve that
not necessarily resulting in a net-gain performance-wise). The current
plan for SNP, as discussed during the first guest_memfd call, is to
implement something similar to 2MB HugeTLB, and disallow hole-punching
at sub-2MB granularity.

However, there have also been some discussions during recent PUCK calls
where the KVM maintainers have some still expressed some interest in pulling
in gmem THP support in a more official capacity. The thinking there is that
hole-punching is a userspace policy, and that it could in theory avoid
holepunching for sub-2MB GFN ranges to avoid degradation over time.
And if there's a desire to enforce this from the kernel-side by blocking
sub-2MB hole-punching from the host-side, this would provide similar
semantics/behavior to the 2MB HugeTLB-like approach above.

So maybe there is still some room for discussion about these approaches.

Outside that, there are a number of other development areas where it would
be useful to at least have some experimental 2MB support in place so that
those efforts can be pursued in parallel, such as the preparedness
tracking touched on here, and exploring how that will intersect with other
development areas like using gmem for both shared and private memory, mmap
support, guest_memfd library, etc., so my hopes are that this approach
could be useful for that purpose at least, even if only as an out-of-tree
stop-gap.

Thoughts/comments welcome!

[2] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/


Testing
-------

Currently, this series does not default to enabling 2M support, but it
can instead be switched on/off dynamically via a module parameter:

  echo 1 >/sys/module/kvm/parameters/gmem_2m_enabled
  echo 0 >/sys/module/kvm/parameters/gmem_2m_enabled

This can be useful for simulating things like host pressure where we start
getting a mix of 4K/2MB allocations. I've used this to help test that the
preparedness-tracking still handles things properly in these situations.

But if we do decide to pull in THP support upstream it would make more
sense to drop the parameter completely.


----------------------------------------------------------------
Michael Roth (4):
      KVM: gmem: Don't rely on __kvm_gmem_get_pfn() for preparedness
      KVM: gmem: Don't clear pages that have already been prepared
      KVM: gmem: Hold filemap invalidate lock while allocating/preparing folios
      KVM: SEV: Improve handling of large ranges in gmem prepare callback

Sean Christopherson (1):
      KVM: Add hugepage support for dedicated guest memory

 arch/x86/kvm/svm/sev.c   | 163 ++++++++++++++++++++++++++------------------
 include/linux/kvm_host.h |   2 +
 virt/kvm/guest_memfd.c   | 173 ++++++++++++++++++++++++++++++++++-------------
 virt/kvm/kvm_main.c      |   4 ++
 4 files changed, 228 insertions(+), 114 deletions(-)

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Vishal Annapurve 12 months ago

On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@amd.com> wrote:
>
> This patchset is also available at:
>
>   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
>
> and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> a snapshot of his patches[1] to provide tracking of whether or not
> sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> before guest access:
>
>   d55475f23cea KVM: gmem: track preparedness a page at a time
>   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
>   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
>   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
>
>   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
>
> This series addresses some of the pending review comments for those patches
> (feel free to squash/rework as-needed), and implements a first real user in
> the form of a reworked version of Sean's original 2MB THP support for gmem.
>

Looking at the work targeted by Fuad to add in-place memory conversion
support via [1] and Ackerley in future to address hugetlb page
support, can the state tracking for preparedness be simplified as?
i) prepare guest memfd ranges when "first time an offset with
mappability = GUEST is allocated or first time an allocated offset has
mappability = GUEST". Some scenarios that would lead to guest memfd
range preparation:
     - Create file with default mappability to host, fallocate, convert
     - Create file with default mappability to Guest, guest faults on
private memory
ii) Unprepare guest memfd ranges when "first time an offset with
mappability = GUEST is deallocated or first time an allocated offset
has lost mappability = GUEST attribute", some scenarios that would
lead to guest memfd range unprepare:
     -  Truncation
     -  Conversion
iii) To handle scenarios with hugepages, page splitting/merging in
guest memfd can also signal change in page granularities.

[1] https://lore.kernel.org/kvm/20250117163001.2326672-1-tabba@google.com/

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Michael Roth 11 months, 3 weeks ago

On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote:
> On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > This patchset is also available at:
> >
> >   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> >
> > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> > a snapshot of his patches[1] to provide tracking of whether or not
> > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> > before guest access:
> >
> >   d55475f23cea KVM: gmem: track preparedness a page at a time
> >   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
> >   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
> >   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> >
> >   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> >
> > This series addresses some of the pending review comments for those patches
> > (feel free to squash/rework as-needed), and implements a first real user in
> > the form of a reworked version of Sean's original 2MB THP support for gmem.
> >
> 
> Looking at the work targeted by Fuad to add in-place memory conversion
> support via [1] and Ackerley in future to address hugetlb page
> support, can the state tracking for preparedness be simplified as?
> i) prepare guest memfd ranges when "first time an offset with
> mappability = GUEST is allocated or first time an allocated offset has
> mappability = GUEST". Some scenarios that would lead to guest memfd
> range preparation:
>      - Create file with default mappability to host, fallocate, convert
>      - Create file with default mappability to Guest, guest faults on
> private memory

Yes, this seems like a compelling approach. One aspect that still
remains is knowing *when* the preparation has been done, so that the
next time a private page is accessed, either to re-fault into the guest
(e.g. because it was originally mapped 2MB and then a sub-page got
converted to shared so the still-private pages need to get re-faulted
in as 4K), or maybe some other path where KVM needs to grab the private
PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think
the GHCB AP_CREATION path for bringing up APs might do this).

We could just keep re-checking the RMP table to see if the PFN was
already set to private in the RMP table, but I think one of the design
goals of the preparedness tracking was to have gmem itself be aware of
this and not farm it out to platform-specific data structures/tracking.

So as a proof of concept I've been experimenting with using Fuad's
series ([1] in your response) and adding an additional GUEST_PREPARED
state so that it can be tracked via the same mappability xarray (or
whatever data structure we end up using for mappability-tracking).
In that case GUEST becomes sort of a transient state that can be set
in advance of actual allocation/fault-time.

That seems to have a lot of nice characteristics, because (in that
series at least) guest-mappable (as opposed to all-mappable)
specifically corresponds to private guest pages, which for SNP require
preparation before they can be mapped into the nested page table so
it seems like a natural fit.

> ii) Unprepare guest memfd ranges when "first time an offset with
> mappability = GUEST is deallocated or first time an allocated offset
> has lost mappability = GUEST attribute", some scenarios that would
> lead to guest memfd range unprepare:
>      -  Truncation
>      -  Conversion

Similar story here: it seems like a good fit. Truncation already does
the unprepare via .free_folio->kvm_arch_gmem_invalidate callback, and
if we rework THP to behave similar to HugeTLB in that we only free back
the full 2MB folio rather than splitting it like in this series, I think
that might be sufficient for truncation. If userspace tries to truncate
a subset of a 2MB private folio we could no-op and just leave it in
GUEST_PREPARED. If we stick with THP, my thinking is we tell userspace
what the max granularity is, and userspace will know that it must
truncate with that same granularity if it actually wants to free memory.
It sounds like the HugeTLB would similarly be providing this sort of
information. What's nice is that if we stick with best-effort THP-based
allocator, and allow best-effort allocator to fall back to smaller page
sizes, this scheme would still work, since we'd still always be able to
free folios without splitting. But I'll try to get a better idea of what
this looks like in practice.

For conversion, we'd need to hook in an additional
kvm_arch_gmem_invalidate() somewhere to make sure the folio is
host-owned in the RMP table before transitioning to host/all-mappable,
but that seems pretty straightforward.

> iii) To handle scenarios with hugepages, page splitting/merging in
> guest memfd can also signal change in page granularities.

Not yet clear to me if extra handling for prepare/unprepare is needed
here, but it does seem like an option if needed.

Thanks,

Mike

> 
> [1] https://lore.kernel.org/kvm/20250117163001.2326672-1-tabba@google.com/

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Yan Zhao 11 months ago

On Wed, Feb 19, 2025 at 07:09:57PM -0600, Michael Roth wrote:
> On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote:
> > On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@amd.com> wrote:
> > >
> > > This patchset is also available at:
> > >
> > >   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> > >
> > > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> > > a snapshot of his patches[1] to provide tracking of whether or not
> > > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> > > before guest access:
> > >
> > >   d55475f23cea KVM: gmem: track preparedness a page at a time
> > >   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
> > >   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
> > >   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> > >
> > >   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> > >
> > > This series addresses some of the pending review comments for those patches
> > > (feel free to squash/rework as-needed), and implements a first real user in
> > > the form of a reworked version of Sean's original 2MB THP support for gmem.
> > >
> > 
> > Looking at the work targeted by Fuad to add in-place memory conversion
> > support via [1] and Ackerley in future to address hugetlb page
> > support, can the state tracking for preparedness be simplified as?
> > i) prepare guest memfd ranges when "first time an offset with
> > mappability = GUEST is allocated or first time an allocated offset has
> > mappability = GUEST". Some scenarios that would lead to guest memfd
> > range preparation:
> >      - Create file with default mappability to host, fallocate, convert
> >      - Create file with default mappability to Guest, guest faults on
> > private memory
> 
> Yes, this seems like a compelling approach. One aspect that still
> remains is knowing *when* the preparation has been done, so that the
> next time a private page is accessed, either to re-fault into the guest
> (e.g. because it was originally mapped 2MB and then a sub-page got
> converted to shared so the still-private pages need to get re-faulted
> in as 4K), or maybe some other path where KVM needs to grab the private
> PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think
> the GHCB AP_CREATION path for bringing up APs might do this).
> 
> We could just keep re-checking the RMP table to see if the PFN was
> already set to private in the RMP table, but I think one of the design
> goals of the preparedness tracking was to have gmem itself be aware of
> this and not farm it out to platform-specific data structures/tracking.
> 
> So as a proof of concept I've been experimenting with using Fuad's
> series ([1] in your response) and adding an additional GUEST_PREPARED
> state so that it can be tracked via the same mappability xarray (or
> whatever data structure we end up using for mappability-tracking).
> In that case GUEST becomes sort of a transient state that can be set
> in advance of actual allocation/fault-time.
Hi Michael,

We are currently working on enabling 2M huge pages on TDX.
We noticed this series and hope if could also work with TDX huge pages.

While disallowing <2M page conversion is also not ideal for TDX, we also think
that it would be great if we could start with 2M and non-in-place conversion
first. In that case, is memory fragmentation caused by partial discarding a
problem for you [1]? Is page promotion a must in your initial huge page support?

Do you have any repo containing your latest POC?

Thanks
Yan

[1] https://lore.kernel.org/all/Z9PyLE%2FLCrSr2jCM@yzhao56-desk.sh.intel.com/

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by David Hildenbrand 1 year, 1 month ago

On 12.12.24 07:36, Michael Roth wrote:
> This patchset is also available at:
> 
>    https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> 
> and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> a snapshot of his patches[1] to provide tracking of whether or not
> sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> before guest access:
> 
>    d55475f23cea KVM: gmem: track preparedness a page at a time
>    64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
>    17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
>    e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> 
>    [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> 
> This series addresses some of the pending review comments for those patches
> (feel free to squash/rework as-needed), and implements a first real user in
> the form of a reworked version of Sean's original 2MB THP support for gmem.
> 
> It is still a bit up in the air as to whether or not gmem should support
> THP at all rather than moving straight to 2MB/1GB hugepages in the form of
> something like HugeTLB folios[2] or the lower-level PFN range allocator
> presented by Yu Zhao during the guest_memfd call last week. The main
> arguments against THP, as I understand it, is that THPs will become
> split over time due to hole-punching and rarely have an opportunity to get
> rebuilt due to lack of memory migration support for current CoCo hypervisor
> implementations like SNP (and adding the migration support to resolve that
> not necessarily resulting in a net-gain performance-wise). The current
> plan for SNP, as discussed during the first guest_memfd call, is to
> implement something similar to 2MB HugeTLB, and disallow hole-punching
> at sub-2MB granularity.
> 
> However, there have also been some discussions during recent PUCK calls
> where the KVM maintainers have some still expressed some interest in pulling
> in gmem THP support in a more official capacity. The thinking there is that
> hole-punching is a userspace policy, and that it could in theory avoid
> holepunching for sub-2MB GFN ranges to avoid degradation over time.
> And if there's a desire to enforce this from the kernel-side by blocking
> sub-2MB hole-punching from the host-side, this would provide similar
> semantics/behavior to the 2MB HugeTLB-like approach above.
> 
> So maybe there is still some room for discussion about these approaches.
> 
> Outside that, there are a number of other development areas where it would
> be useful to at least have some experimental 2MB support in place so that
> those efforts can be pursued in parallel, such as the preparedness
> tracking touched on here, and exploring how that will intersect with other
> development areas like using gmem for both shared and private memory, mmap
> support, guest_memfd library, etc., so my hopes are that this approach
> could be useful for that purpose at least, even if only as an out-of-tree
> stop-gap.
> 
> Thoughts/comments welcome!

Sorry for the late reply, it's been a couple of crazy weeks, and I'm 
trying to give at least some feedback on stuff in my inbox before even 
more will pile up over Christmas :) . Let me summarize my thoughts:

THPs in Linux rely on the following principle:

(1) We try allocating a THP, if that fails we rely on khugepaged to fix
     it up later (shmem+anon). So id we cannot grab a free THP, we
     deffer it to a later point.

(2) We try to be as transparent as possible: punching a hole will
     usually destroy the THP (either immediately for shmem/pagecache or
     deferred for anon memory) to free up the now-free pages. That's
     different to hugetlb, where partial hole-punching will always zero-
     out the memory only; the partial memory will not get freed up and
     will get reused later.

     Destroying a THP for shmem/pagecache only works if there are no
     unexpected page references, so there can be cases where we fail to
     free up memory. For the pagecache that's not really
     an issue, because memory reclaim will fix that up at some point. For
     shmem, there  were discussions to do scan for 0ed pages and free
     them up during memory reclaim, just like we do now for anon memory
      as well.

(3) Memory compaction is vital for guaranteeing that we will be able to
     create THPs the longer the system was running,

With guest_memfd we cannot rely on any daemon to fix it up as in (1) for 
us later (would require page memory migration support).

We use truncate_inode_pages_range(), which will split a THP into small 
pages if you partially punch-hole it, so (2) would apply; splitting 
might fail as well in some cases if there are unexpected references.

I wonder what would happen if user space would punch a hole in private 
memory, making truncate_inode_pages_range() overwrite it with 0s if 
splitting the THP failed (memory write to private pages under TDX?). 
Maybe something similar would happen if a private page would get 0-ed 
out when freeing+reallocating it, not sure how that is handled.

guest_memfd currently actively works against (3) as soon as we (A) 
fallback to allocating small pages or (B) split a THP due to hole 
punching, as the remaining fragments cannot get reassembled anymore.

I assume there is some truth to "hole-punching is a userspace policy", 
but this mechanism will actively work against itself as soon as you 
start falling back to small pages in any way.

So I'm wondering if a better start would be to (A) always allocate huge 
pages from the buddy (no fallback) and (B) partial punches are either 
disallowed or only zero-out the memory. But even a sequence of partial 
punches that cover the whole huge page will not end up freeing all parts 
if splitting failed at some point, which I quite dislike ...

But then we'd need memory preallocation, and I suspect to make this 
really useful -- just like with 2M/1G "hugetlb" support -- in-place 
shared<->private conversion will be a requirement. ... at which point 
we'd have reached the state where it's almost the 2M hugetlb support.

This is not a very strong push back, more a "this does not quite sound 
right to me" and I have the feeling that this might get in the way of 
in-place shared<->private conversion; I might be wrong about the latter 
though.

With memory compaction working for guest_memfd, it would all be easier.

Note that I'm not quite sure about the "2MB" interface, should it be a 
"PMD-size" interface?

-- 
Cheers,

David / dhildenb

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Shah, Amit 1 year, 1 month ago

On Fri, 2024-12-20 at 12:31 +0100, David Hildenbrand wrote:
> On 12.12.24 07:36, Michael Roth wrote:
> > This patchset is also available at:
> > 
> >    https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> > 
> > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which
> > includes
> > a snapshot of his patches[1] to provide tracking of whether or not
> > sub-pages of a huge folio need to have kvm_arch_gmem_prepare()
> > hooks issued
> > before guest access:
> > 
> >    d55475f23cea KVM: gmem: track preparedness a page at a time
> >    64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the
> > file
> >    17df70a5ea65 KVM: gmem: add a complete set of functions to query
> > page preparedness
> >    e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> > 
> >    [1]
> > https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> > 
> > This series addresses some of the pending review comments for those
> > patches
> > (feel free to squash/rework as-needed), and implements a first real
> > user in
> > the form of a reworked version of Sean's original 2MB THP support
> > for gmem.
> > 
> > It is still a bit up in the air as to whether or not gmem should
> > support
> > THP at all rather than moving straight to 2MB/1GB hugepages in the
> > form of
> > something like HugeTLB folios[2] or the lower-level PFN range
> > allocator
> > presented by Yu Zhao during the guest_memfd call last week. The
> > main
> > arguments against THP, as I understand it, is that THPs will become
> > split over time due to hole-punching and rarely have an opportunity
> > to get
> > rebuilt due to lack of memory migration support for current CoCo
> > hypervisor
> > implementations like SNP (and adding the migration support to
> > resolve that
> > not necessarily resulting in a net-gain performance-wise). The
> > current
> > plan for SNP, as discussed during the first guest_memfd call, is to
> > implement something similar to 2MB HugeTLB, and disallow hole-
> > punching
> > at sub-2MB granularity.
> > 
> > However, there have also been some discussions during recent PUCK
> > calls
> > where the KVM maintainers have some still expressed some interest
> > in pulling
> > in gmem THP support in a more official capacity. The thinking there
> > is that
> > hole-punching is a userspace policy, and that it could in theory
> > avoid
> > holepunching for sub-2MB GFN ranges to avoid degradation over time.
> > And if there's a desire to enforce this from the kernel-side by
> > blocking
> > sub-2MB hole-punching from the host-side, this would provide
> > similar
> > semantics/behavior to the 2MB HugeTLB-like approach above.
> > 
> > So maybe there is still some room for discussion about these
> > approaches.
> > 
> > Outside that, there are a number of other development areas where
> > it would
> > be useful to at least have some experimental 2MB support in place
> > so that
> > those efforts can be pursued in parallel, such as the preparedness
> > tracking touched on here, and exploring how that will intersect
> > with other
> > development areas like using gmem for both shared and private
> > memory, mmap
> > support, guest_memfd library, etc., so my hopes are that this
> > approach
> > could be useful for that purpose at least, even if only as an out-
> > of-tree
> > stop-gap.
> > 
> > Thoughts/comments welcome!
> 
> Sorry for the late reply, it's been a couple of crazy weeks, and I'm 
> trying to give at least some feedback on stuff in my inbox before
> even 
> more will pile up over Christmas :) . Let me summarize my thoughts:

My turn for the lateness - back from a break.

I should also preface that Mike is off for at least a month more, but
he will return to continue working on this.  In the meantime, I've had
a chat with him about this work to keep the discussion alive on the
lists.

> THPs in Linux rely on the following principle:
> 
> (1) We try allocating a THP, if that fails we rely on khugepaged to
> fix
>      it up later (shmem+anon). So id we cannot grab a free THP, we
>      deffer it to a later point.
> 
> (2) We try to be as transparent as possible: punching a hole will
>      usually destroy the THP (either immediately for shmem/pagecache
> or
>      deferred for anon memory) to free up the now-free pages. That's
>      different to hugetlb, where partial hole-punching will always
> zero-
>      out the memory only; the partial memory will not get freed up
> and
>      will get reused later.
> 
>      Destroying a THP for shmem/pagecache only works if there are no
>      unexpected page references, so there can be cases where we fail
> to
>      free up memory. For the pagecache that's not really
>      an issue, because memory reclaim will fix that up at some point.
> For
>      shmem, there  were discussions to do scan for 0ed pages and free
>      them up during memory reclaim, just like we do now for anon
> memory
>       as well.
> 
> (3) Memory compaction is vital for guaranteeing that we will be able
> to
>      create THPs the longer the system was running,
> 
> 
> With guest_memfd we cannot rely on any daemon to fix it up as in (1)
> for 
> us later (would require page memory migration support).

True.  And not having a huge page when requested to begin with (as in 1
above) beats the purpose entirely -- the point is to speed up SEV-SNP
setup and guests by having fewer pages to work with.

> We use truncate_inode_pages_range(), which will split a THP into
> small 
> pages if you partially punch-hole it, so (2) would apply; splitting 
> might fail as well in some cases if there are unexpected references.
> 
> I wonder what would happen if user space would punch a hole in
> private 
> memory, making truncate_inode_pages_range() overwrite it with 0s if 
> splitting the THP failed (memory write to private pages under TDX?). 
> Maybe something similar would happen if a private page would get 0-ed
> out when freeing+reallocating it, not sure how that is handled.
> 
> 
> guest_memfd currently actively works against (3) as soon as we (A) 
> fallback to allocating small pages or (B) split a THP due to hole 
> punching, as the remaining fragments cannot get reassembled anymore.
> 
> I assume there is some truth to "hole-punching is a userspace
> policy", 
> but this mechanism will actively work against itself as soon as you 
> start falling back to small pages in any way.
> 
> 
> 
> So I'm wondering if a better start would be to (A) always allocate
> huge 
> pages from the buddy (no fallback) and 

that sounds fine..

> (B) partial punches are either
> disallowed or only zero-out the memory. But even a sequence of
> partial 
> punches that cover the whole huge page will not end up freeing all
> parts 
> if splitting failed at some point, which I quite dislike ...

... this  basically just looks like hugetlb support (i.e. without the
"transparent" part), isn't it?

> But then we'd need memory preallocation, and I suspect to make this 
> really useful -- just like with 2M/1G "hugetlb" support -- in-place 
> shared<->private conversion will be a requirement. ... at which point
> we'd have reached the state where it's almost the 2M hugetlb support.

Right, exactly.

> This is not a very strong push back, more a "this does not quite
> sound 
> right to me" and I have the feeling that this might get in the way of
> in-place shared<->private conversion; I might be wrong about the
> latter 
> though.

TBH my 2c are that getting hugepage supported, and disabling THP for
SEV-SNP guests will work fine.

But as Mike mentioned above, this series is to add a user on top of
Paolo's work - and that seems more straightforward to experiment with
and figure out hugepage support in general while getting all the other
hugepage details done in parallel.

> With memory compaction working for guest_memfd, it would all be
> easier.

... btw do you know how well this is coming along?

> Note that I'm not quite sure about the "2MB" interface, should it be
> a 
> "PMD-size" interface?

I think Mike and I touched upon this aspect too - and I may be
misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
in increments -- and then fitting in PMD sizes when we've had enough of
those.  That is to say he didn't want to preclude it, or gate the PMD
work on enabling all sizes first.

		Amit

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by David Hildenbrand 1 year ago

>> Sorry for the late reply, it's been a couple of crazy weeks, and I'm
>> trying to give at least some feedback on stuff in my inbox before
>> even
>> more will pile up over Christmas :) . Let me summarize my thoughts:
> 
> My turn for the lateness - back from a break.
> 
> I should also preface that Mike is off for at least a month more, but
> he will return to continue working on this.  In the meantime, I've had
> a chat with him about this work to keep the discussion alive on the
> lists.

So now it's my turn to being late again ;) As promised during the last 
call, a few points from my side.

> 
>> THPs in Linux rely on the following principle:
>>
>> (1) We try allocating a THP, if that fails we rely on khugepaged to
>> fix
>>       it up later (shmem+anon). So id we cannot grab a free THP, we
>>       deffer it to a later point.
>>
>> (2) We try to be as transparent as possible: punching a hole will
>>       usually destroy the THP (either immediately for shmem/pagecache
>> or
>>       deferred for anon memory) to free up the now-free pages. That's
>>       different to hugetlb, where partial hole-punching will always
>> zero-
>>       out the memory only; the partial memory will not get freed up
>> and
>>       will get reused later.
>>
>>       Destroying a THP for shmem/pagecache only works if there are no
>>       unexpected page references, so there can be cases where we fail
>> to
>>       free up memory. For the pagecache that's not really
>>       an issue, because memory reclaim will fix that up at some point.
>> For
>>       shmem, there  were discussions to do scan for 0ed pages and free
>>       them up during memory reclaim, just like we do now for anon
>> memory
>>        as well.
>>
>> (3) Memory compaction is vital for guaranteeing that we will be able
>> to
>>       create THPs the longer the system was running,
>>
>>
>> With guest_memfd we cannot rely on any daemon to fix it up as in (1)
>> for
>> us later (would require page memory migration support).
> 
> True.  And not having a huge page when requested to begin with (as in 1
> above) beats the purpose entirely -- the point is to speed up SEV-SNP
> setup and guests by having fewer pages to work with.

Right.

> 
>> We use truncate_inode_pages_range(), which will split a THP into
>> small
>> pages if you partially punch-hole it, so (2) would apply; splitting
>> might fail as well in some cases if there are unexpected references.
>>
>> I wonder what would happen if user space would punch a hole in
>> private
>> memory, making truncate_inode_pages_range() overwrite it with 0s if
>> splitting the THP failed (memory write to private pages under TDX?).
>> Maybe something similar would happen if a private page would get 0-ed
>> out when freeing+reallocating it, not sure how that is handled.
>>
>>
>> guest_memfd currently actively works against (3) as soon as we (A)
>> fallback to allocating small pages or (B) split a THP due to hole
>> punching, as the remaining fragments cannot get reassembled anymore.
>>
>> I assume there is some truth to "hole-punching is a userspace
>> policy",
>> but this mechanism will actively work against itself as soon as you
>> start falling back to small pages in any way.
>>
>>
>>
>> So I'm wondering if a better start would be to (A) always allocate
>> huge
>> pages from the buddy (no fallback) and
> 
> that sounds fine..
> 
>> (B) partial punches are either
>> disallowed or only zero-out the memory. But even a sequence of
>> partial
>> punches that cover the whole huge page will not end up freeing all
>> parts
>> if splitting failed at some point, which I quite dislike ...
> 
> ... this  basically just looks like hugetlb support (i.e. without the
> "transparent" part), isn't it?

Yes, just using a different allocator until we have a predictable 
allocator with reserves.

Note that I am not sure how much "transparent" here really applies, 
given the differences to THPs ...

> 
>> But then we'd need memory preallocation, and I suspect to make this
>> really useful -- just like with 2M/1G "hugetlb" support -- in-place
>> shared<->private conversion will be a requirement. ... at which point
>> we'd have reached the state where it's almost the 2M hugetlb support.
> 
> Right, exactly.
> 
>> This is not a very strong push back, more a "this does not quite
>> sound
>> right to me" and I have the feeling that this might get in the way of
>> in-place shared<->private conversion; I might be wrong about the
>> latter
>> though.

As discussed in the last bi-weekly MM meeting (and in contrast to what I 
assumed), Vishal was right: we should be able to support in-place 
shared<->private conversion as long as we can split a large folio when 
any page of it is getting converted to shared.

(split is possible if there are no unexpected folio references; private 
pages cannot be GUP'ed, so it is feasible)

So similar to the hugetlb work, that split would happen and would be a 
bit "easier", because ordinary folios (in contrast to hugetlb) are 
prepared to be split.

So supporting larger folios for private memory might not make in-place 
conversion significantly harder; the important part is that shared 
folios may only be small.

The split would just mean that we start exposing individual small folios 
to the core-mm, not that we would allow page migration for the shared 
parts etc. So the "whole 2M chunk" will remain allocated to guest_memfd.

> 
> TBH my 2c are that getting hugepage supported, and disabling THP for
> SEV-SNP guests will work fine.

Likely it will not be that easy as soon as hugetlb reserves etc. will 
come into play.

> 
> But as Mike mentioned above, this series is to add a user on top of
> Paolo's work - and that seems more straightforward to experiment with
> and figure out hugepage support in general while getting all the other
> hugepage details done in parallel.

I would suggest to not call this "THP". Maybe we can call it "2M folio 
support" for gmem.

Similar to other FSes, we could just not limit ourselves to 2M folios, 
and simply allocate any large folios. But sticking to 2M might be 
beneficial in regards to memory fragmentation (below).

> 
>> With memory compaction working for guest_memfd, it would all be
>> easier.
> 
> ... btw do you know how well this is coming along?

People have been talking about that, but I suspect this is very 
long-term material.

> 
>> Note that I'm not quite sure about the "2MB" interface, should it be
>> a
>> "PMD-size" interface?
> 
> I think Mike and I touched upon this aspect too - and I may be
> misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> in increments -- and then fitting in PMD sizes when we've had enough of
> those.  That is to say he didn't want to preclude it, or gate the PMD
> work on enabling all sizes first.

Starting with 2M is reasonable for now. The real question is how we want 
to deal with

(a) Not being able to allocate a 2M folio reliably
(b) Partial discarding

Using only (unmovable) 2M folios would effectively not cause any real 
memory fragmentation in the system, because memory compaction operates 
on 2M pageblocks on x86. So that feels quite compelling.

Ideally we'd have a 2M pagepool from which guest_memfd would allocate 
pages and to which it would putback pages. Yes, this sound similar to 
hugetlb, but might be much easier to implement, because we are not 
limited by some of the hugetlb design decisions (HVO, not being able to 
partially map them, etc.).

-- 
Cheers,

David / dhildenb

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Yan Zhao 11 months ago

On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
>(split is possible if there are no unexpected folio references; private 
>pages cannot be GUP'ed, so it is feasible)
...
> > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > a
> > > "PMD-size" interface?
> > 
> > I think Mike and I touched upon this aspect too - and I may be
> > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > in increments -- and then fitting in PMD sizes when we've had enough of
> > those.  That is to say he didn't want to preclude it, or gate the PMD
> > work on enabling all sizes first.
> 
> Starting with 2M is reasonable for now. The real question is how we want to
> deal with
Hi David,

I'm just trying to understand the background of in-place conversion.

Regarding to the two issues you mentioned with THP and non-in-place-conversion,
I have some questions (still based on starting with 2M):

> (a) Not being able to allocate a 2M folio reliably
If we start with fault in private pages from guest_memfd (not in page pool way)
and shared pages anonymously, is it correct to say that this is only a concern
when memory is under pressure?

> (b) Partial discarding
For shared pages, page migration and folio split are possible for shared THP?

For private pages, as you pointed out earlier, if we can ensure there are no
unexpected folio references for private memory, splitting a private huge folio
should succeed. Are you concerned about the memory fragmentation after repeated
partial conversions of private pages to and from shared?

Thanks
Yan

> Using only (unmovable) 2M folios would effectively not cause any real memory
> fragmentation in the system, because memory compaction operates on 2M
> pageblocks on x86. So that feels quite compelling.
> 
> Ideally we'd have a 2M pagepool from which guest_memfd would allocate pages
> and to which it would putback pages. Yes, this sound similar to hugetlb, but
> might be much easier to implement, because we are not limited by some of the
> hugetlb design decisions (HVO, not being able to partially map them, etc.).

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by David Hildenbrand 11 months ago

On 14.03.25 10:09, Yan Zhao wrote:
> On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
>> (split is possible if there are no unexpected folio references; private
>> pages cannot be GUP'ed, so it is feasible)
> ...
>>>> Note that I'm not quite sure about the "2MB" interface, should it be
>>>> a
>>>> "PMD-size" interface?
>>>
>>> I think Mike and I touched upon this aspect too - and I may be
>>> misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
>>> in increments -- and then fitting in PMD sizes when we've had enough of
>>> those.  That is to say he didn't want to preclude it, or gate the PMD
>>> work on enabling all sizes first.
>>
>> Starting with 2M is reasonable for now. The real question is how we want to
>> deal with
> Hi David,
> 

Hi!

> I'm just trying to understand the background of in-place conversion.
> 
> Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> I have some questions (still based on starting with 2M):
> 
>> (a) Not being able to allocate a 2M folio reliably
> If we start with fault in private pages from guest_memfd (not in page pool way)
> and shared pages anonymously, is it correct to say that this is only a concern
> when memory is under pressure?

Usually, fragmentation starts being a problem under memory pressure, and 
memory pressure can show up simply because the page cache makes us of as 
much memory as it wants.

As soon as we start allocating a 2 MB page for guest_memfd, to then 
split it up + free only some parts back to the buddy (on private->shared 
conversion), we create fragmentation that cannot get resolved as long as 
the remaining private pages are not freed. A new conversion from 
shared->private on the previously freed parts will allocate other 
unmovable pages (not the freed ones) and make fragmentation worse.

In-place conversion improves that quite a lot, because guest_memfd tself 
will not cause unmovable fragmentation. Of course, under memory 
pressure, when and cannot allocate a 2M page for guest_memfd, it's 
unavoidable. But then, we already had fragmentation (and did not really 
cause any new one).

We discussed in the upstream call, that if guest_memfd (primarily) only 
allocates 2M pages and frees 2M pages, it will not cause fragmentation 
itself, which is pretty nice.

> 
>> (b) Partial discarding
> For shared pages, page migration and folio split are possible for shared THP?

I assume by "shared" you mean "not guest_memfd, but some other memory we 
use as an overlay" -- so no in-place conversion.

Yes, that should be possible as long as nothing else prevents 
migration/split (e.g., longterm pinning)

> 
> For private pages, as you pointed out earlier, if we can ensure there are no
> unexpected folio references for private memory, splitting a private huge folio
> should succeed. 

Yes, and maybe (hopefully) we'll reach a point where private parts will 
not have a refcount at all (initially, frozen refcount, discussed during 
the last upstream call).

Are you concerned about the memory fragmentation after repeated
> partial conversions of private pages to and from shared?

Not only repeated, even just a single partial conversion. But of course, 
repeated partial conversions will make it worse (e.g., never getting a 
private huge page back when there was a partial conversion).

-- 
Cheers,

David / dhildenb

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Yan Zhao 11 months ago

On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> On 14.03.25 10:09, Yan Zhao wrote:
> > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > (split is possible if there are no unexpected folio references; private
> > > pages cannot be GUP'ed, so it is feasible)
> > ...
> > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > a
> > > > > "PMD-size" interface?
> > > > 
> > > > I think Mike and I touched upon this aspect too - and I may be
> > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > work on enabling all sizes first.
> > > 
> > > Starting with 2M is reasonable for now. The real question is how we want to
> > > deal with
> > Hi David,
> > 
> 
> Hi!
> 
> > I'm just trying to understand the background of in-place conversion.
> > 
> > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > I have some questions (still based on starting with 2M):
> > 
> > > (a) Not being able to allocate a 2M folio reliably
> > If we start with fault in private pages from guest_memfd (not in page pool way)
> > and shared pages anonymously, is it correct to say that this is only a concern
> > when memory is under pressure?
> 
> Usually, fragmentation starts being a problem under memory pressure, and
> memory pressure can show up simply because the page cache makes us of as
> much memory as it wants.
> 
> As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> up + free only some parts back to the buddy (on private->shared conversion),
> we create fragmentation that cannot get resolved as long as the remaining
> private pages are not freed. A new conversion from shared->private on the
> previously freed parts will allocate other unmovable pages (not the freed
> ones) and make fragmentation worse.
Ah, I see. The problem of fragmentation is because memory allocated by
guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
still unmovable. 

I previously thought fragmentation would only impact the guest by providing no
new huge pages. So if a confidential VM does not support merging small PTEs into
a huge PMD entry in its private page table, even if the new huge memory range is
physically contiguous after a private->shared->private conversion, the guest
still cannot bring back huge pages.

> In-place conversion improves that quite a lot, because guest_memfd tself
> will not cause unmovable fragmentation. Of course, under memory pressure,
> when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> then, we already had fragmentation (and did not really cause any new one).
> 
> We discussed in the upstream call, that if guest_memfd (primarily) only
> allocates 2M pages and frees 2M pages, it will not cause fragmentation
> itself, which is pretty nice.
Makes sense.

> > 
> > > (b) Partial discarding
> > For shared pages, page migration and folio split are possible for shared THP?
> 
> I assume by "shared" you mean "not guest_memfd, but some other memory we use
Yes, not guest_memfd, in the case of non-in-place conversion.

> as an overlay" -- so no in-place conversion.
> 
> Yes, that should be possible as long as nothing else prevents
> migration/split (e.g., longterm pinning)
> 
> > 
> > For private pages, as you pointed out earlier, if we can ensure there are no
> > unexpected folio references for private memory, splitting a private huge folio
> > should succeed.
> 
> Yes, and maybe (hopefully) we'll reach a point where private parts will not
> have a refcount at all (initially, frozen refcount, discussed during the
> last upstream call).
Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
and found that partial splitting could work.

> Are you concerned about the memory fragmentation after repeated
> > partial conversions of private pages to and from shared?
> 
> Not only repeated, even just a single partial conversion. But of course,
> repeated partial conversions will make it worse (e.g., never getting a
> private huge page back when there was a partial conversion).
Thanks for the explanation!

Do you think there's any chance for guest_memfd to support non-in-place
conversion first?

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Yan Zhao 10 months, 3 weeks ago

On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
> On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> > On 14.03.25 10:09, Yan Zhao wrote:
> > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > > (split is possible if there are no unexpected folio references; private
> > > > pages cannot be GUP'ed, so it is feasible)
> > > ...
> > > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > > a
> > > > > > "PMD-size" interface?
> > > > > 
> > > > > I think Mike and I touched upon this aspect too - and I may be
> > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > > work on enabling all sizes first.
> > > > 
> > > > Starting with 2M is reasonable for now. The real question is how we want to
> > > > deal with
> > > Hi David,
> > > 
> > 
> > Hi!
> > 
> > > I'm just trying to understand the background of in-place conversion.
> > > 
> > > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > > I have some questions (still based on starting with 2M):
> > > 
> > > > (a) Not being able to allocate a 2M folio reliably
> > > If we start with fault in private pages from guest_memfd (not in page pool way)
> > > and shared pages anonymously, is it correct to say that this is only a concern
> > > when memory is under pressure?
> > 
> > Usually, fragmentation starts being a problem under memory pressure, and
> > memory pressure can show up simply because the page cache makes us of as
> > much memory as it wants.
> > 
> > As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> > up + free only some parts back to the buddy (on private->shared conversion),
> > we create fragmentation that cannot get resolved as long as the remaining
> > private pages are not freed. A new conversion from shared->private on the
> > previously freed parts will allocate other unmovable pages (not the freed
> > ones) and make fragmentation worse.
> Ah, I see. The problem of fragmentation is because memory allocated by
> guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
> still unmovable. 
> 
> I previously thought fragmentation would only impact the guest by providing no
> new huge pages. So if a confidential VM does not support merging small PTEs into
> a huge PMD entry in its private page table, even if the new huge memory range is
> physically contiguous after a private->shared->private conversion, the guest
> still cannot bring back huge pages.
> 
> > In-place conversion improves that quite a lot, because guest_memfd tself
> > will not cause unmovable fragmentation. Of course, under memory pressure,
> > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> > then, we already had fragmentation (and did not really cause any new one).
> > 
> > We discussed in the upstream call, that if guest_memfd (primarily) only
> > allocates 2M pages and frees 2M pages, it will not cause fragmentation
> > itself, which is pretty nice.
> Makes sense.
> 
> > > 
> > > > (b) Partial discarding
> > > For shared pages, page migration and folio split are possible for shared THP?
> > 
> > I assume by "shared" you mean "not guest_memfd, but some other memory we use
> Yes, not guest_memfd, in the case of non-in-place conversion.
> 
> > as an overlay" -- so no in-place conversion.
> > 
> > Yes, that should be possible as long as nothing else prevents
> > migration/split (e.g., longterm pinning)
> > 
> > > 
> > > For private pages, as you pointed out earlier, if we can ensure there are no
> > > unexpected folio references for private memory, splitting a private huge folio
> > > should succeed.
> > 
> > Yes, and maybe (hopefully) we'll reach a point where private parts will not
> > have a refcount at all (initially, frozen refcount, discussed during the
> > last upstream call).
> Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
> and found that partial splitting could work.
> 
> > Are you concerned about the memory fragmentation after repeated
> > > partial conversions of private pages to and from shared?
> > 
> > Not only repeated, even just a single partial conversion. But of course,
> > repeated partial conversions will make it worse (e.g., never getting a
> > private huge page back when there was a partial conversion).
> Thanks for the explanation!
> 
> Do you think there's any chance for guest_memfd to support non-in-place
> conversion first?
e.g. we can have private pages allocated from guest_memfd and allows the
private pages to be THP.

Meanwhile, shared pages are not allocated from guest_memfd, and let it only
fault in 4K granularity. (specify it by a flag?)

When we want to convert a 4K from a 2M private folio to shared, we can just
split the 2M private folio as there's no extra ref count of private pages;

when we do shared to private conversion, no split is required as shared pages
are in 4K granularity. And even if user fails to specify the shared pages as
small pages only, the worst thing is that a 2M shared folio cannot be split, and
more memory is consumed.

Of couse, memory fragmentation is still an issue as the private pages are
allocated unmovable. But do you think it's a good simpler start before in-place
conversion is ready?

Thanks
Yan

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by David Hildenbrand 10 months, 3 weeks ago

On 18.03.25 03:24, Yan Zhao wrote:
> On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
>> On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
>>> On 14.03.25 10:09, Yan Zhao wrote:
>>>> On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
>>>>> (split is possible if there are no unexpected folio references; private
>>>>> pages cannot be GUP'ed, so it is feasible)
>>>> ...
>>>>>>> Note that I'm not quite sure about the "2MB" interface, should it be
>>>>>>> a
>>>>>>> "PMD-size" interface?
>>>>>>
>>>>>> I think Mike and I touched upon this aspect too - and I may be
>>>>>> misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
>>>>>> in increments -- and then fitting in PMD sizes when we've had enough of
>>>>>> those.  That is to say he didn't want to preclude it, or gate the PMD
>>>>>> work on enabling all sizes first.
>>>>>
>>>>> Starting with 2M is reasonable for now. The real question is how we want to
>>>>> deal with
>>>> Hi David,
>>>>
>>>
>>> Hi!
>>>
>>>> I'm just trying to understand the background of in-place conversion.
>>>>
>>>> Regarding to the two issues you mentioned with THP and non-in-place-conversion,
>>>> I have some questions (still based on starting with 2M):
>>>>
>>>>> (a) Not being able to allocate a 2M folio reliably
>>>> If we start with fault in private pages from guest_memfd (not in page pool way)
>>>> and shared pages anonymously, is it correct to say that this is only a concern
>>>> when memory is under pressure?
>>>
>>> Usually, fragmentation starts being a problem under memory pressure, and
>>> memory pressure can show up simply because the page cache makes us of as
>>> much memory as it wants.
>>>
>>> As soon as we start allocating a 2 MB page for guest_memfd, to then split it
>>> up + free only some parts back to the buddy (on private->shared conversion),
>>> we create fragmentation that cannot get resolved as long as the remaining
>>> private pages are not freed. A new conversion from shared->private on the
>>> previously freed parts will allocate other unmovable pages (not the freed
>>> ones) and make fragmentation worse.
>> Ah, I see. The problem of fragmentation is because memory allocated by
>> guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
>> still unmovable.
>>
>> I previously thought fragmentation would only impact the guest by providing no
>> new huge pages. So if a confidential VM does not support merging small PTEs into
>> a huge PMD entry in its private page table, even if the new huge memory range is
>> physically contiguous after a private->shared->private conversion, the guest
>> still cannot bring back huge pages.
>>
>>> In-place conversion improves that quite a lot, because guest_memfd tself
>>> will not cause unmovable fragmentation. Of course, under memory pressure,
>>> when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
>>> then, we already had fragmentation (and did not really cause any new one).
>>>
>>> We discussed in the upstream call, that if guest_memfd (primarily) only
>>> allocates 2M pages and frees 2M pages, it will not cause fragmentation
>>> itself, which is pretty nice.
>> Makes sense.
>>
>>>>
>>>>> (b) Partial discarding
>>>> For shared pages, page migration and folio split are possible for shared THP?
>>>
>>> I assume by "shared" you mean "not guest_memfd, but some other memory we use
>> Yes, not guest_memfd, in the case of non-in-place conversion.
>>
>>> as an overlay" -- so no in-place conversion.
>>>
>>> Yes, that should be possible as long as nothing else prevents
>>> migration/split (e.g., longterm pinning)
>>>
>>>>
>>>> For private pages, as you pointed out earlier, if we can ensure there are no
>>>> unexpected folio references for private memory, splitting a private huge folio
>>>> should succeed.
>>>
>>> Yes, and maybe (hopefully) we'll reach a point where private parts will not
>>> have a refcount at all (initially, frozen refcount, discussed during the
>>> last upstream call).
>> Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
>> and found that partial splitting could work.
>>
>>> Are you concerned about the memory fragmentation after repeated
>>>> partial conversions of private pages to and from shared?
>>>
>>> Not only repeated, even just a single partial conversion. But of course,
>>> repeated partial conversions will make it worse (e.g., never getting a
>>> private huge page back when there was a partial conversion).
>> Thanks for the explanation!
>>
>> Do you think there's any chance for guest_memfd to support non-in-place
>> conversion first?
> e.g. we can have private pages allocated from guest_memfd and allows the
> private pages to be THP.
> 
> Meanwhile, shared pages are not allocated from guest_memfd, and let it only
> fault in 4K granularity. (specify it by a flag?)
> 
> When we want to convert a 4K from a 2M private folio to shared, we can just
> split the 2M private folio as there's no extra ref count of private pages;

Yes, IIRC that's precisely what this series is doing, because the 
ftruncate() will try splitting the folio (which might still fail on 
speculative references, see my comment as rely to this series)

In essence: yes, splitting to 4k should work (although speculative 
reference might require us to retry). But the "4k hole punch" is the 
ugly it.

So you really want in-place conversion where the private->shared will 
split (but not punch) and the shared->private will collapse again if 
possible.

> 
> when we do shared to private conversion, no split is required as shared pages
> are in 4K granularity. And even if user fails to specify the shared pages as
> small pages only, the worst thing is that a 2M shared folio cannot be split, and
> more memory is consumed.
> 
> Of couse, memory fragmentation is still an issue as the private pages are
> allocated unmovable.

Yes, and that you will never ever get a "THP" back when there was a 
conversion from private->shared of a single page that split the THP and 
discarded that page.

  But do you think it's a good simpler start before in-place
> conversion is ready?

There was a discussion on that on the bi-weekly upstream meeting on 
February the 6. The recording has more details, I summarized it as

"David: Probably a good idea to focus on the long-term use case where we 
have in-place conversion support, and only allow truncation in hugepage 
(e.g., 2 MiB) size; conversion shared<->private could still be done on 4 
KiB granularity as for hugetlb."

In general, I think our time is better spent working on the real deal 
than on interim solutions that should not be called "THP support".

-- 
Cheers,

David / dhildenb

Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Posted by Yan Zhao 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 08:13:05PM +0100, David Hildenbrand wrote:
> On 18.03.25 03:24, Yan Zhao wrote:
> > On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
> > > On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> > > > On 14.03.25 10:09, Yan Zhao wrote:
> > > > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > > > > (split is possible if there are no unexpected folio references; private
> > > > > > pages cannot be GUP'ed, so it is feasible)
> > > > > ...
> > > > > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > > > > a
> > > > > > > > "PMD-size" interface?
> > > > > > > 
> > > > > > > I think Mike and I touched upon this aspect too - and I may be
> > > > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > > > > work on enabling all sizes first.
> > > > > > 
> > > > > > Starting with 2M is reasonable for now. The real question is how we want to
> > > > > > deal with
> > > > > Hi David,
> > > > > 
> > > > 
> > > > Hi!
> > > > 
> > > > > I'm just trying to understand the background of in-place conversion.
> > > > > 
> > > > > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > > > > I have some questions (still based on starting with 2M):
> > > > > 
> > > > > > (a) Not being able to allocate a 2M folio reliably
> > > > > If we start with fault in private pages from guest_memfd (not in page pool way)
> > > > > and shared pages anonymously, is it correct to say that this is only a concern
> > > > > when memory is under pressure?
> > > > 
> > > > Usually, fragmentation starts being a problem under memory pressure, and
> > > > memory pressure can show up simply because the page cache makes us of as
> > > > much memory as it wants.
> > > > 
> > > > As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> > > > up + free only some parts back to the buddy (on private->shared conversion),
> > > > we create fragmentation that cannot get resolved as long as the remaining
> > > > private pages are not freed. A new conversion from shared->private on the
> > > > previously freed parts will allocate other unmovable pages (not the freed
> > > > ones) and make fragmentation worse.
> > > Ah, I see. The problem of fragmentation is because memory allocated by
> > > guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
> > > still unmovable.
> > > 
> > > I previously thought fragmentation would only impact the guest by providing no
> > > new huge pages. So if a confidential VM does not support merging small PTEs into
> > > a huge PMD entry in its private page table, even if the new huge memory range is
> > > physically contiguous after a private->shared->private conversion, the guest
> > > still cannot bring back huge pages.
> > > 
> > > > In-place conversion improves that quite a lot, because guest_memfd tself
> > > > will not cause unmovable fragmentation. Of course, under memory pressure,
> > > > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> > > > then, we already had fragmentation (and did not really cause any new one).
> > > > 
> > > > We discussed in the upstream call, that if guest_memfd (primarily) only
> > > > allocates 2M pages and frees 2M pages, it will not cause fragmentation
> > > > itself, which is pretty nice.
> > > Makes sense.
> > > 
> > > > > 
> > > > > > (b) Partial discarding
> > > > > For shared pages, page migration and folio split are possible for shared THP?
> > > > 
> > > > I assume by "shared" you mean "not guest_memfd, but some other memory we use
> > > Yes, not guest_memfd, in the case of non-in-place conversion.
> > > 
> > > > as an overlay" -- so no in-place conversion.
> > > > 
> > > > Yes, that should be possible as long as nothing else prevents
> > > > migration/split (e.g., longterm pinning)
> > > > 
> > > > > 
> > > > > For private pages, as you pointed out earlier, if we can ensure there are no
> > > > > unexpected folio references for private memory, splitting a private huge folio
> > > > > should succeed.
> > > > 
> > > > Yes, and maybe (hopefully) we'll reach a point where private parts will not
> > > > have a refcount at all (initially, frozen refcount, discussed during the
> > > > last upstream call).
> > > Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
> > > and found that partial splitting could work.
> > > 
> > > > Are you concerned about the memory fragmentation after repeated
> > > > > partial conversions of private pages to and from shared?
> > > > 
> > > > Not only repeated, even just a single partial conversion. But of course,
> > > > repeated partial conversions will make it worse (e.g., never getting a
> > > > private huge page back when there was a partial conversion).
> > > Thanks for the explanation!
> > > 
> > > Do you think there's any chance for guest_memfd to support non-in-place
> > > conversion first?
> > e.g. we can have private pages allocated from guest_memfd and allows the
> > private pages to be THP.
> > 
> > Meanwhile, shared pages are not allocated from guest_memfd, and let it only
> > fault in 4K granularity. (specify it by a flag?)
> > 
> > When we want to convert a 4K from a 2M private folio to shared, we can just
> > split the 2M private folio as there's no extra ref count of private pages;
> 
> Yes, IIRC that's precisely what this series is doing, because the
> ftruncate() will try splitting the folio (which might still fail on
> speculative references, see my comment as rely to this series)
> 
> In essence: yes, splitting to 4k should work (although speculative reference
> might require us to retry). But the "4k hole punch" is the ugly it.
> 
> So you really want in-place conversion where the private->shared will split
> (but not punch) and the shared->private will collapse again if possible.
> 
> > 
> > when we do shared to private conversion, no split is required as shared pages
> > are in 4K granularity. And even if user fails to specify the shared pages as
> > small pages only, the worst thing is that a 2M shared folio cannot be split, and
> > more memory is consumed.
> > 
> > Of couse, memory fragmentation is still an issue as the private pages are
> > allocated unmovable.
> 
> Yes, and that you will never ever get a "THP" back when there was a
> conversion from private->shared of a single page that split the THP and
> discarded that page.
Yes, unless we still keep that page in page cache, which would consume even more
memory.
 
>  But do you think it's a good simpler start before in-place
> > conversion is ready?
> 
> There was a discussion on that on the bi-weekly upstream meeting on February
> the 6. The recording has more details, I summarized it as
> 
> "David: Probably a good idea to focus on the long-term use case where we
> have in-place conversion support, and only allow truncation in hugepage
> (e.g., 2 MiB) size; conversion shared<->private could still be done on 4 KiB
> granularity as for hugetlb."
Will check and study it. Thanks for directing me to the history.

> In general, I think our time is better spent working on the real deal than
> on interim solutions that should not be called "THP support".
I see. Thanks for the explanation!