[v1] khugepaged: mTHP support

[RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

The following series provides khugepaged and madvise collapse with the 
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
determine how full a mTHP order needs to be before collapsing it.

Some design choices to note: 
 - bitmap structures are allocated dynamically because on some arch's 
    (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
    compile time leading to warnings.
 - The recursion is masked through a stack structure.
 - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
    64bit on x86. This provides some optimization on the bitmap operations.
    if other arches/configs that have larger than 512 PTEs per PMD want to 
    compress their bitmap further we can change this value per arch.

Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
Patch 3:    A minor "fix"/optimization
Patch 4:    Refactor/rename hpage_collapse
Patch 5-7:  Generalize khugepaged functions for arbitrary orders
Patch 8-11: The mTHP patches

This series acts as an alternative to Dev Jain's approach [1]. The two 
series differ in a few ways:
  - My approach uses a bitmap to store the state of the linear scan_pmd to
    then determine potential mTHP batches. Devs incorporates his directly
    into the scan, and will try each available order. 
  - Dev is attempting to optimize the locking, while my approach keeps the
    locking changes to a minimum. I believe his changes are not safe for
    uffd.
  - Dev's changes only work for khugepaged not madvise_collapse (although
    i think that was by choice and it could easily support madvise)
  - Dev scales all khugepaged sysfs tunables by order, while im removing 
    the restriction of max_ptes_none and converting it to a scale to 
    determine a (m)THP threshold.
  - Dev turns on khugepaged if any order is available while mine still 
    only runs if PMDs are enabled. I like Dev's approach and will most
    likely do the same in my PATCH posting.
  - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Patch 11 was inspired by one of Dev's changes.

[1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/

Nico Pache (11):
  introduce khugepaged_collapse_single_pmd to collapse a single pmd
  khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
  khugepaged: Don't allocate khugepaged mm_slot early
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: remove max_ptes_none restriction on the pmd scan
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h |   4 +-
 mm/huge_memory.c           |   3 +-
 mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
 3 files changed, 306 insertions(+), 137 deletions(-)

-- 
2.47.1

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago

On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Well, I did not miss it :)

int nr_pages = folio_nr_pages(folio);
folio_ref_add(folio, nr_pages - 1);

>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Wed, Jan 8, 2025 at 11:27 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Well, I did not miss it :)
Sorry! I missed that in my initial review of your code. Seeing that
would have saved me a few hours of debugging xD

>
> int nr_pages = folio_nr_pages(folio);
> folio_ref_add(folio, nr_pages - 1);

Once I found the fix I forgot to cross reference with your series.
Missing this ref update was causing the issue I alluded to in your RFC
thread. When you said you ran into some issues on the debug configs I
figured it was the same one.


>
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
> >
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago

On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)

Before I take a proper look at your series, can you please include any testing
you may have done?

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
>
> Before I take a proper look at your series, can you please include any testing
> you may have done?

I Built these changes for the following arches: x86_64, arm64,
arm64-64k, ppc64le, s390x

x86 testing:
- Selftests mm
- some stress-ng tests
- compile kernel
- I did some tests with my defer [1] set on top. This pushes all the
work to khugepaged, which removes the noise of all the PF allocations.

I recently got an ARM64 machine and did some simple sanity tests (on
both 4k and 64k) like selftests, stress-ng, and playing around with
the tunables, etc.

I will also be running all the builds through our CI, and perf testing
environments before posting.

[1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/

>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


On 10/01/25 7:57 am, Nico Pache wrote:
> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 09/01/25 5:01 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>> determine how full a mTHP order needs to be before collapsing it.
>>>
>>> Some design choices to note:
>>>    - bitmap structures are allocated dynamically because on some arch's
>>>       (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>       compile time leading to warnings.
>>>    - The recursion is masked through a stack structure.
>>>    - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>       64bit on x86. This provides some optimization on the bitmap operations.
>>>       if other arches/configs that have larger than 512 PTEs per PMD want to
>>>       compress their bitmap further we can change this value per arch.
>>>
>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 3:    A minor "fix"/optimization
>>> Patch 4:    Refactor/rename hpage_collapse
>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>> Patch 8-11: The mTHP patches
>>>
>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>> series differ in a few ways:
>>>     - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>       then determine potential mTHP batches. Devs incorporates his directly
>>>       into the scan, and will try each available order.
>>>     - Dev is attempting to optimize the locking, while my approach keeps the
>>>       locking changes to a minimum. I believe his changes are not safe for
>>>       uffd.
>>>     - Dev's changes only work for khugepaged not madvise_collapse (although
>>>       i think that was by choice and it could easily support madvise)
>>>     - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>       the restriction of max_ptes_none and converting it to a scale to
>>>       determine a (m)THP threshold.
>>>     - Dev turns on khugepaged if any order is available while mine still
>>>       only runs if PMDs are enabled. I like Dev's approach and will most
>>>       likely do the same in my PATCH posting.
>>>     - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>
>>> Patch 11 was inspired by one of Dev's changes.
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>
>>> Nico Pache (11):
>>>     introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>     khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>     khugepaged: Don't allocate khugepaged mm_slot early
>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>     khugepaged: add mTHP support
>>>     khugepaged: remove max_ptes_none restriction on the pmd scan
>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>    include/linux/khugepaged.h |   4 +-
>>>    mm/huge_memory.c           |   3 +-
>>>    mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>    3 files changed, 306 insertions(+), 137 deletions(-)
>>
>> Before I take a proper look at your series, can you please include any testing
>> you may have done?
> 
> I Built these changes for the following arches: x86_64, arm64,
> arm64-64k, ppc64le, s390x
> 
> x86 testing:
> - Selftests mm
> - some stress-ng tests
> - compile kernel
> - I did some tests with my defer [1] set on top. This pushes all the
> work to khugepaged, which removes the noise of all the PF allocations.
> 
> I recently got an ARM64 machine and did some simple sanity tests (on
> both 4k and 64k) like selftests, stress-ng, and playing around with
> the tunables, etc.
> 
> I will also be running all the builds through our CI, and perf testing
> environments before posting.
> 
> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> 
>>
> 
I tested your series with the program I was using and it is not working; 
can you please confirm it.

diff --git a/mytests/mthp.c b/mytests/mthp.c
new file mode 100644
index 000000000000..e3029dbcf035
--- /dev/null
+++ b/mytests/mthp.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * Author: Dev Jain <dev.jain@arm.com>
+ *
+ * Program to test khugepaged mTHP collapse
+ */
+
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <string.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/random.h>
+#include <assert.h>
+
+int main(int argc, char *argv[])
+{
+	char *ptr;
+	unsigned long mthp_size = (1UL << 16);
+	size_t chunk_size = (1UL << 25);
+
+	ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (((unsigned long)ptr) != (1UL << 30)) {
+		printf("mmap did not work on required address\n");
+		return 1;
+	}
+
+	/* Fill first pte in every 64K interval */
+	for (int i = 0; i < chunk_size; i += mthp_size)
+		ptr[i] = i;
+
+	if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
+		perror("madvise");
+		return 1;
+	}
+	sleep(100);
+	return 0;
+}
-- 
2.30.2

Set enabled = madvise, hugepages-2048k/enabled = hugepages-64k/enabled = 
inherit. Run the program in the background, then run tools/mm/thpmaps.
You will see PMD collapse correctly, but when you echo never into
hugepages-2048k/enabled and test this again, you won't see contpte 64K 
collapse. With my series, you will see something like

anon-cont-pte-aligned-64kB : 32768 kB (100%).

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 10/01/25 7:57 am, Nico Pache wrote:
> > On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >> On 09/01/25 5:01 am, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>> determine how full a mTHP order needs to be before collapsing it.
> >>>
> >>> Some design choices to note:
> >>>    - bitmap structures are allocated dynamically because on some arch's
> >>>       (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>       compile time leading to warnings.
> >>>    - The recursion is masked through a stack structure.
> >>>    - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>       64bit on x86. This provides some optimization on the bitmap operations.
> >>>       if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>       compress their bitmap further we can change this value per arch.
> >>>
> >>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 3:    A minor "fix"/optimization
> >>> Patch 4:    Refactor/rename hpage_collapse
> >>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>> Patch 8-11: The mTHP patches
> >>>
> >>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>> series differ in a few ways:
> >>>     - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>       then determine potential mTHP batches. Devs incorporates his directly
> >>>       into the scan, and will try each available order.
> >>>     - Dev is attempting to optimize the locking, while my approach keeps the
> >>>       locking changes to a minimum. I believe his changes are not safe for
> >>>       uffd.
> >>>     - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>       i think that was by choice and it could easily support madvise)
> >>>     - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>       the restriction of max_ptes_none and converting it to a scale to
> >>>       determine a (m)THP threshold.
> >>>     - Dev turns on khugepaged if any order is available while mine still
> >>>       only runs if PMDs are enabled. I like Dev's approach and will most
> >>>       likely do the same in my PATCH posting.
> >>>     - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>
> >>> Patch 11 was inspired by one of Dev's changes.
> >>>
> >>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>
> >>> Nico Pache (11):
> >>>     introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>     khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>     khugepaged: Don't allocate khugepaged mm_slot early
> >>>     khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>     khugepaged: generalize alloc_charge_folio for mTHP support
> >>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>     khugepaged: add mTHP support
> >>>     khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>     khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>    include/linux/khugepaged.h |   4 +-
> >>>    mm/huge_memory.c           |   3 +-
> >>>    mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>    3 files changed, 306 insertions(+), 137 deletions(-)
> >>
> >> Before I take a proper look at your series, can you please include any testing
> >> you may have done?
> >
> > I Built these changes for the following arches: x86_64, arm64,
> > arm64-64k, ppc64le, s390x
> >
> > x86 testing:
> > - Selftests mm
> > - some stress-ng tests
> > - compile kernel
> > - I did some tests with my defer [1] set on top. This pushes all the
> > work to khugepaged, which removes the noise of all the PF allocations.
> >
> > I recently got an ARM64 machine and did some simple sanity tests (on
> > both 4k and 64k) like selftests, stress-ng, and playing around with
> > the tunables, etc.
> >
> > I will also be running all the builds through our CI, and perf testing
> > environments before posting.
> >
> > [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> >
> >>
> >
> I tested your series with the program I was using and it is not working;
> can you please confirm it.

Yes, this is expected because you are not fully filling any 32K chunk
(MIN_MTHP_ORDER) so no bit is ever set.
I should probably add some threshold to scan_pmd so we set the bitmap
if at least half of the region is full or scale it on max_ptes_none
like I do in scan_bitmap.

Thanks for your review-- Have a good weekend!
>
> diff --git a/mytests/mthp.c b/mytests/mthp.c
> new file mode 100644
> index 000000000000..e3029dbcf035
> --- /dev/null
> +++ b/mytests/mthp.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *
> + * Author: Dev Jain <dev.jain@arm.com>
> + *
> + * Program to test khugepaged mTHP collapse
> + */
> +
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <string.h>
> +#include <stdint.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/mman.h>
> +#include <sys/time.h>
> +#include <sys/random.h>
> +#include <assert.h>
> +
> +int main(int argc, char *argv[])
> +{
> +       char *ptr;
> +       unsigned long mthp_size = (1UL << 16);
> +       size_t chunk_size = (1UL << 25);
> +
> +       ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
> +                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (((unsigned long)ptr) != (1UL << 30)) {
> +               printf("mmap did not work on required address\n");
> +               return 1;
> +       }
> +
> +       /* Fill first pte in every 64K interval */
> +       for (int i = 0; i < chunk_size; i += mthp_size)
> +               ptr[i] = i;
> +
> +       if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
> +               perror("madvise");
> +               return 1;
> +       }
> +       sleep(100);
> +       return 0;
> +}
> --
> 2.30.2
>
> Set enabled = madvise, hugepages-2048k/enabled = hugepages-64k/enabled =
> inherit. Run the program in the background, then run tools/mm/thpmaps.
> You will see PMD collapse correctly, but when you echo never into
> hugepages-2048k/enabled and test this again, you won't see contpte 64K
> collapse. With my series, you will see something like
>
> anon-cont-pte-aligned-64kB : 32768 kB (100%).
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


On 11/01/25 3:31 am, Nico Pache wrote:
> On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 10/01/25 7:57 am, Nico Pache wrote:
>>> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>> On 09/01/25 5:01 am, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>>>> determine how full a mTHP order needs to be before collapsing it.
>>>>>
>>>>> Some design choices to note:
>>>>>     - bitmap structures are allocated dynamically because on some arch's
>>>>>        (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>>>        compile time leading to warnings.
>>>>>     - The recursion is masked through a stack structure.
>>>>>     - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>>>        64bit on x86. This provides some optimization on the bitmap operations.
>>>>>        if other arches/configs that have larger than 512 PTEs per PMD want to
>>>>>        compress their bitmap further we can change this value per arch.
>>>>>
>>>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 3:    A minor "fix"/optimization
>>>>> Patch 4:    Refactor/rename hpage_collapse
>>>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>>>> Patch 8-11: The mTHP patches
>>>>>
>>>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>>>> series differ in a few ways:
>>>>>      - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>>>        then determine potential mTHP batches. Devs incorporates his directly
>>>>>        into the scan, and will try each available order.
>>>>>      - Dev is attempting to optimize the locking, while my approach keeps the
>>>>>        locking changes to a minimum. I believe his changes are not safe for
>>>>>        uffd.
>>>>>      - Dev's changes only work for khugepaged not madvise_collapse (although
>>>>>        i think that was by choice and it could easily support madvise)
>>>>>      - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>>>        the restriction of max_ptes_none and converting it to a scale to
>>>>>        determine a (m)THP threshold.
>>>>>      - Dev turns on khugepaged if any order is available while mine still
>>>>>        only runs if PMDs are enabled. I like Dev's approach and will most
>>>>>        likely do the same in my PATCH posting.
>>>>>      - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>>>
>>>>> Patch 11 was inspired by one of Dev's changes.
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>>>
>>>>> Nico Pache (11):
>>>>>      introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>>>      khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>>>      khugepaged: Don't allocate khugepaged mm_slot early
>>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>      khugepaged: add mTHP support
>>>>>      khugepaged: remove max_ptes_none restriction on the pmd scan
>>>>>      khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>     include/linux/khugepaged.h |   4 +-
>>>>>     mm/huge_memory.c           |   3 +-
>>>>>     mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>>>     3 files changed, 306 insertions(+), 137 deletions(-)
>>>>
>>>> Before I take a proper look at your series, can you please include any testing
>>>> you may have done?
>>>
>>> I Built these changes for the following arches: x86_64, arm64,
>>> arm64-64k, ppc64le, s390x
>>>
>>> x86 testing:
>>> - Selftests mm
>>> - some stress-ng tests
>>> - compile kernel
>>> - I did some tests with my defer [1] set on top. This pushes all the
>>> work to khugepaged, which removes the noise of all the PF allocations.
>>>
>>> I recently got an ARM64 machine and did some simple sanity tests (on
>>> both 4k and 64k) like selftests, stress-ng, and playing around with
>>> the tunables, etc.
>>>
>>> I will also be running all the builds through our CI, and perf testing
>>> environments before posting.
>>>
>>> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
>>>
>>>>
>>>
>> I tested your series with the program I was using and it is not working;
>> can you please confirm it.
> 
> Yes, this is expected because you are not fully filling any 32K chunk
> (MIN_MTHP_ORDER) so no bit is ever set.

That is weird, because if this is the case, then PMD-collapse should 
have also failed, but that succeeded. Do you have some userspace program 
I can test with?

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Sun, Jan 12, 2025 at 7:12 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/01/25 3:31 am, Nico Pache wrote:
> > On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 10/01/25 7:57 am, Nico Pache wrote:
> >>> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
> >>>>
> >>>>
> >>>> On 09/01/25 5:01 am, Nico Pache wrote:
> >>>>> The following series provides khugepaged and madvise collapse with the
> >>>>> capability to collapse regions to mTHPs.
> >>>>>
> >>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>>>> on max_ptes_none is removed during the scan, to make sure we account for
> >>>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>>>> determine how full a mTHP order needs to be before collapsing it.
> >>>>>
> >>>>> Some design choices to note:
> >>>>>     - bitmap structures are allocated dynamically because on some arch's
> >>>>>        (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>>>        compile time leading to warnings.
> >>>>>     - The recursion is masked through a stack structure.
> >>>>>     - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>>>        64bit on x86. This provides some optimization on the bitmap operations.
> >>>>>        if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>>>        compress their bitmap further we can change this value per arch.
> >>>>>
> >>>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>>>> Patch 3:    A minor "fix"/optimization
> >>>>> Patch 4:    Refactor/rename hpage_collapse
> >>>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>>>> Patch 8-11: The mTHP patches
> >>>>>
> >>>>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>>>> series differ in a few ways:
> >>>>>      - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>>>        then determine potential mTHP batches. Devs incorporates his directly
> >>>>>        into the scan, and will try each available order.
> >>>>>      - Dev is attempting to optimize the locking, while my approach keeps the
> >>>>>        locking changes to a minimum. I believe his changes are not safe for
> >>>>>        uffd.
> >>>>>      - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>>>        i think that was by choice and it could easily support madvise)
> >>>>>      - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>>>        the restriction of max_ptes_none and converting it to a scale to
> >>>>>        determine a (m)THP threshold.
> >>>>>      - Dev turns on khugepaged if any order is available while mine still
> >>>>>        only runs if PMDs are enabled. I like Dev's approach and will most
> >>>>>        likely do the same in my PATCH posting.
> >>>>>      - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>>>
> >>>>> Patch 11 was inspired by one of Dev's changes.
> >>>>>
> >>>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>>>
> >>>>> Nico Pache (11):
> >>>>>      introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>>>      khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>>>      khugepaged: Don't allocate khugepaged mm_slot early
> >>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>      khugepaged: add mTHP support
> >>>>>      khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>>>      khugepaged: skip collapsing mTHP to smaller orders
> >>>>>
> >>>>>     include/linux/khugepaged.h |   4 +-
> >>>>>     mm/huge_memory.c           |   3 +-
> >>>>>     mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>>>     3 files changed, 306 insertions(+), 137 deletions(-)
> >>>>
> >>>> Before I take a proper look at your series, can you please include any testing
> >>>> you may have done?
> >>>
> >>> I Built these changes for the following arches: x86_64, arm64,
> >>> arm64-64k, ppc64le, s390x
> >>>
> >>> x86 testing:
> >>> - Selftests mm
> >>> - some stress-ng tests
> >>> - compile kernel
> >>> - I did some tests with my defer [1] set on top. This pushes all the
> >>> work to khugepaged, which removes the noise of all the PF allocations.
> >>>
> >>> I recently got an ARM64 machine and did some simple sanity tests (on
> >>> both 4k and 64k) like selftests, stress-ng, and playing around with
> >>> the tunables, etc.
> >>>
> >>> I will also be running all the builds through our CI, and perf testing
> >>> environments before posting.
> >>>
> >>> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> >>>
> >>>>
> >>>
> >> I tested your series with the program I was using and it is not working;
> >> can you please confirm it.
> >
> > Yes, this is expected because you are not fully filling any 32K chunk
> > (MIN_MTHP_ORDER) so no bit is ever set.
>
> That is weird, because if this is the case, then PMD-collapse should
> have also failed, but that succeeded. Do you have some userspace program
> I can test with?
Not exactly, if max_ptes_none is still 511, the old behavior is kept.

I modified your program to set the first 8 pages (32k chunk) in every
64k region.

#include <unistd.h>
#include <sys/ioctl.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/random.h>
#include <assert.h>

int main(int argc, char *argv[])
{
    char *ptr;
    unsigned long mthp_size = (1UL << 16); // 64 KB chunk size
    size_t chunk_size = (1UL << 25);       // 32 MB total size

    // mmap() to allocate memory at a specific address (1 GB address)
    ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (((unsigned long)ptr) != (1UL << 30)) {
        printf("mmap did not work on required address\n");
        return 1;
    }

    // Touch the first 8 pages in every 64 KB chunk
    for (int i = 0; i < chunk_size; i += mthp_size) {
        // Touch the first 8 pages within the 64 KB chunk (8 * 4 KB = 32 KB)
        for (int j = 0; j < 8; ++j) {
            ptr[i + j * 4096] = i + j * 4096;  // Touch the first byte
of each page
        }
    }

    // Use madvise() to advise the kernel to use huge pages for this memory
    if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
        perror("madvise");
        return 1;
    }

    sleep(100); // Sleep to allow time for the kernel to process the advice
    return 0;
}

There's some rounding errors in how I compute the threshold_bits... I
think I will adopt how you do the max_ptes_none shifting for better
accuracy. Currently if you run this with max_ptes_none=255 (or even
lower values like 200...) it will still collapse to a 64k chunk when
in reality it should only do 32k because only half the bitmap is set
for this order, and 255 < 50% of 512.

I'm adding a threshold to the bitmap_set, and doing better scaling
like you do. My next version should handle the example code better.

>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Ryan Roberts 1 year ago

Hi Nico,

On 08/01/2025 23:31, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the 
> capability to collapse regions to mTHPs.

It's great to see multiple solutions for this feature being posted; I guess that
leaves us with the luxurious problem of figuring out an uber-patchset that
incorporates the best of both? :)

I haven't had a chance to review your series in detail yet, but have a few
questions below that will help me understand the key differences between your
series and Dev's.

> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
> determine how full a mTHP order needs to be before collapsing it.
> 
> Some design choices to note: 
>  - bitmap structures are allocated dynamically because on some arch's 
>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>     compile time leading to warnings.

We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
time. Could these help avoid the dynamic allocation?

MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)

Althogh to be honest, it's not super clear to me what the benefit of the bitmap
is vs just iterating through the PTEs like Dev does; is there a significant cost
saving in practice? On the face of it, it seems like it might be uneeded complexity.

>  - The recursion is masked through a stack structure.
>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>     64bit on x86. This provides some optimization on the bitmap operations.
>     if other arches/configs that have larger than 512 PTEs per PMD want to 
>     compress their bitmap further we can change this value per arch.

So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
enabled, you would want to collapse every other 16K block in this case, but I'm
guessing with your scheme, all the bits will be clear and no collapse will
occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

> 
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
> 
> This series acts as an alternative to Dev Jain's approach [1]. The two 
> series differ in a few ways:
>   - My approach uses a bitmap to store the state of the linear scan_pmd to
>     then determine potential mTHP batches. Devs incorporates his directly
>     into the scan, and will try each available order.

So if I'm understanding, the benefit of the bitmap is to remove the need to
re-scan the "low" PTEs when moving to a lower order, which is what Dev's
approach does? Are there not some locking/consistency issues to manage if not
re-scanning?

>   - Dev is attempting to optimize the locking, while my approach keeps the
>     locking changes to a minimum. I believe his changes are not safe for
>     uffd.

I agree; let's keep the locking simple for the initial effort.

>   - Dev's changes only work for khugepaged not madvise_collapse (although
>     i think that was by choice and it could easily support madvise)

I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
though? I think it ignores the sysfs settings (max_ptes_none and friends) so
presumably it will continue to be much more greedy about collapsing to the
highest possible order and only fall back to lower orders if the VMA boundaries
force it to or if the higher order allocation fails?

>   - Dev scales all khugepaged sysfs tunables by order, while im removing 
>     the restriction of max_ptes_none and converting it to a scale to 
>     determine a (m)THP threshold.

I don't really understand this statement. You say you are removing the
restriction of max_ptes_none. But then you say you scale it to determine a
threshold. So are you honoring it or not? And if you're honouring it, how is
your scaling method different to Dev's? What about the other tunables (shared
and swap)?

>   - Dev turns on khugepaged if any order is available while mine still 
>     only runs if PMDs are enabled. I like Dev's approach and will most
>     likely do the same in my PATCH posting.

Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
that cover only a partial PMD entry. I think neither of your implementations
currently do that. As I understand it, Dev's v2 will add that support. Is your
approach ammeanable to this?

>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> 
> Patch 11 was inspired by one of Dev's changes.

I think the 1 problem that emerged during review of Dev's series, which we don't
have a proper solution to yet, is the issue of "creep", where regions can be
collapsed to progressively higher orders through iterative scans. At each
collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
effectively adds more non-none ptes so the next scan will then collapse to even
higher order. Does your solution suffer from this (theoretical/edge case) issue?
If not, how did you solve?

Thanks,
Ryan

> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> 
> Nico Pache (11):
>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>   khugepaged: Don't allocate khugepaged mm_slot early
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: remove max_ptes_none restriction on the pmd scan
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h |   4 +-
>  mm/huge_memory.c           |   3 +-
>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>  3 files changed, 306 insertions(+), 137 deletions(-)
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Nico,
Hi Ryan!
>
> On 08/01/2025 23:31, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
>
> It's great to see multiple solutions for this feature being posted; I guess that
> leaves us with the luxurious problem of figuring out an uber-patchset that
> incorporates the best of both? :)
I guess so! My motivation for developing this was inspired by my
'defer' RFC. Which can't really live without khugepaged having mTHP
support (ie having 32k mTHP= always and global=defer doesnt make
sense).
>
> I haven't had a chance to review your series in detail yet, but have a few
> questions below that will help me understand the key differences between your
> series and Dev's.
>
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >  - bitmap structures are allocated dynamically because on some arch's
> >     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >     compile time leading to warnings.
>
> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
> time. Could these help avoid the dynamic allocation?
>
> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
edge cases where PMD_ORDER < MAX_PMD_ORDER.

>
> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> is vs just iterating through the PTEs like Dev does; is there a significant cost
> saving in practice? On the face of it, it seems like it might be uneeded complexity.
The bitmap was to encode the state of PMD without needing rescanning
(or refactor a lot of code). We keep the scan runtime constant at 512
(for x86). Dev did some good analysis for this here
https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
This prevents needing to hold the read lock for longer, and prevents
needing to reacquire it too.
>
> >  - The recursion is masked through a stack structure.
> >  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >     64bit on x86. This provides some optimization on the bitmap operations.
> >     if other arches/configs that have larger than 512 PTEs per PMD want to
> >     compress their bitmap further we can change this value per arch.
>
> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
> enabled, you would want to collapse every other 16K block in this case, but I'm
> guessing with your scheme, all the bits will be clear and no collapse will
> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
setting the bit. This will covert this case better (given a better
default max_ptes_none).
The way i see it 511 max_ptes_none is just wrong... we should flip it
towards the lower end of the scale (ie 64), and the "always" THP
setting should ignore it (like madvise does).

>
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >   - My approach uses a bitmap to store the state of the linear scan_pmd to
> >     then determine potential mTHP batches. Devs incorporates his directly
> >     into the scan, and will try each available order.
>
> So if I'm understanding, the benefit of the bitmap is to remove the need to
> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
> approach does? Are there not some locking/consistency issues to manage if not
> re-scanning?
Correct, so far i haven't found any issues (other than the bugs Dev
reported in his review)-- my fixed version of this RFC has been
running fine with no notable locking issues.
>
> >   - Dev is attempting to optimize the locking, while my approach keeps the
> >     locking changes to a minimum. I believe his changes are not safe for
> >     uffd.
>
> I agree; let's keep the locking simple for the initial effort.
>
> >   - Dev's changes only work for khugepaged not madvise_collapse (although
> >     i think that was by choice and it could easily support madvise)
>
> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
> presumably it will continue to be much more greedy about collapsing to the
> highest possible order and only fall back to lower orders if the VMA boundaries
> force it to or if the higher order allocation fails?
Kind of, because I removed the max_ptes_none check during the scan,
and reintroduced it in the bitmap scan (without a madvise
restriction), MADV_COLLAPSE and khugepaged will work more similarly.
>
> >   - Dev scales all khugepaged sysfs tunables by order, while im removing
> >     the restriction of max_ptes_none and converting it to a scale to
> >     determine a (m)THP threshold.
>
> I don't really understand this statement. You say you are removing the
> restriction of max_ptes_none. But then you say you scale it to determine a
> threshold. So are you honoring it or not? And if you're honouring it, how is
> your scaling method different to Dev's? What about the other tunables (shared
> and swap)?
I removed the max_ptes_none restriction during the initial scan, so we
can account for the full PMD (which is what happens with
max_ptes_none=511 anyways). Then max_ptes_none can be used with the
bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
finding the optimal mTHP size.

This RFC scales max_ptes_none to 0-100, but it has some really bad
rounding issues, so instead ive incorporated scaling (via bitshifting)
like Dev did in his series. Ive tested this and it's more accurate
now.
>
> >   - Dev turns on khugepaged if any order is available while mine still
> >     only runs if PMDs are enabled. I like Dev's approach and will most
> >     likely do the same in my PATCH posting.
>
> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
> that cover only a partial PMD entry. I think neither of your implementations
> currently do that. As I understand it, Dev's v2 will add that support. Is your
> approach ammeanable to this?

Yes, I believe so. I'm working on adding this too.

>
> >   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
>
> I think the 1 problem that emerged during review of Dev's series, which we don't
> have a proper solution to yet, is the issue of "creep", where regions can be
> collapsed to progressively higher orders through iterative scans. At each
> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
> effectively adds more non-none ptes so the next scan will then collapse to even
> higher order. Does your solution suffer from this (theoretical/edge case) issue?
> If not, how did you solve?

Yes sadly it suffers from the same issue. bringing max_ptes_none much
lower as a default would "help".
I liked Zi Yan's solution of a per-VMA bit that gets set when
khugepaged collapses, and unset when the VMA changes (pf, realloc,
etc).
Then khugepaged can only operate on VMAs that dont have the bit set.
This way we only collapse once, unless the mapping was changed.

Could we map the new "non-none" pages to the zero page (rather than
actually zeroing the page), so they dont actually act as new "utilized
pages" and are still counted as none pages during the scan (until they
are written to)?

>
> Thanks,
> Ryan

Cheers!
-- Nico

>
>
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >   introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >   khugepaged: Don't allocate khugepaged mm_slot early
> >   khugepaged: rename hpage_collapse_* to khugepaged_*
> >   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >   khugepaged: generalize alloc_charge_folio for mTHP support
> >   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >   khugepaged: add mTHP support
> >   khugepaged: remove max_ptes_none restriction on the pmd scan
> >   khugepaged: skip collapsing mTHP to smaller orders
> >
> >  include/linux/khugepaged.h |   4 +-
> >  mm/huge_memory.c           |   3 +-
> >  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >  3 files changed, 306 insertions(+), 137 deletions(-)
> >
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by David Hildenbrand 1 year ago

>> I think the 1 problem that emerged during review of Dev's series, which we don't
>> have a proper solution to yet, is the issue of "creep", where regions can be
>> collapsed to progressively higher orders through iterative scans. At each
>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>> effectively adds more non-none ptes so the next scan will then collapse to even
>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>> If not, how did you solve?
> 
> Yes sadly it suffers from the same issue. bringing max_ptes_none much
> lower as a default would "help".

Can we just keep it simple and only support max_ptes_none = 511 
("pagefault behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 
("deferred behavior") and document that the other weird configurations 
will make mTHP skip, because "weird and unexpetced" ? :)

-- 
Cheers,

David / dhildenb

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Ryan Roberts 1 year ago

On 20/01/2025 12:54, David Hildenbrand wrote:
>>> I think the 1 problem that emerged during review of Dev's series, which we don't
>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>> collapsed to progressively higher orders through iterative scans. At each
>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>>> effectively adds more non-none ptes so the next scan will then collapse to even
>>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>>> If not, how did you solve?
>>
>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>> lower as a default would "help".
> 
> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
> document that the other weird configurations will make mTHP skip, because "weird
> and unexpetced" ? :)
> 

That sounds like a great simplification in principle! We would need to consider
the swap and shared tunables too though. Perhaps we can pull a similar trick
with those?

Re: [RFC 00/11] khugepaged: mTHP support

Posted by David Hildenbrand 1 year ago

On 20.01.25 14:37, Ryan Roberts wrote:
> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>> I think the 1 problem that emerged during review of Dev's series, which we don't
>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>> collapsed to progressively higher orders through iterative scans. At each
>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>>>> effectively adds more non-none ptes so the next scan will then collapse to even
>>>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>>>> If not, how did you solve?
>>>
>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>> lower as a default would "help".
>>
>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>> document that the other weird configurations will make mTHP skip, because "weird
>> and unexpetced" ? :)
>>
> 
> That sounds like a great simplification in principle!

And certainly a much easier to start with :)

If we ever get the request to support something else, maybe that's also 
where we can learn *why*, and what we would actually want to do with mTHP.

> We would need to consider
> the swap and shared tunables too though. Perhaps we can pull a similar trick
> with those?

Swapped and shared are a bit more challenging, because they are set to 
"/ 2" or "/ 8" heuristics.

One simple starting point here is of course to say "when collapsing 
mTHP, all have to be unshared and all have to be swapped in", so to 
essentially ignore both tunables (in a memory friendly way, as if they 
are set to 0) for mTHP collapse and worry about that later, when really 
required.

Two alternatives I discussed with Nico for these (not sure which is 
implemented here) is to calculate it proportionally to the folio order 
we are collapsing:

Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 
MiB mTHP (256 PTEs), 32 PTEs would be allowed to be swapped out.

-- 
Cheers,

David / dhildenb

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Ryan Roberts 1 year ago

On 20/01/2025 13:56, David Hildenbrand wrote:
> On 20.01.25 14:37, Ryan Roberts wrote:
>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>> don't
>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>> collapse
>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>> even
>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>> issue?
>>>>> If not, how did you solve?
>>>>
>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>> lower as a default would "help".
>>>
>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>> document that the other weird configurations will make mTHP skip, because "weird
>>> and unexpetced" ? :)

nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
perhaps it's better to say we round to closest of 0 and max?

>>>
>>
>> That sounds like a great simplification in principle!
> 
> And certainly a much easier to start with :)
> 
> If we ever get the request to support something else, maybe that's also where we
> can learn *why*, and what we would actually want to do with mTHP.
> 
>> We would need to consider
>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>> with those?
> 
> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
> "/ 8" heuristics.
> 
> 
> One simple starting point here is of course to say "when collapsing mTHP, all
> have to be unshared and all have to be swapped in", so to essentially ignore
> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
> collapse and worry about that later, when really required.

For swap, if we assume we start with the whole VMA swapped out, I think setting
max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
in sequentially? I guess that's creep due to faulting pattern though, so at
least it's not due to collapse. Doesn't feel ideal though.

I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
for private COWed pages, and khugepaged will trigger the COW on collapse? So
again depending on the pattern of writes we could still end up with creep in a
similar way to swap?

> 
> Two alternatives I discussed with Nico for these (not sure which is implemented
> here) is to calculate it proportionally to the folio order we are collapsing:

You're only listing one option here... what's the other one you discussed?

> 
> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
> (256 PTEs), 32 PTEs would be allowed to be swapped out.

Yeah this is exactly what Dev's version is doing at the moment. But that's the
behaviour that leads to the "creep" problem.

Thanks,
Ryan

>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by David Hildenbrand 1 year ago

On 20.01.25 17:27, Ryan Roberts wrote:
> On 20/01/2025 13:56, David Hildenbrand wrote:
>> On 20.01.25 14:37, Ryan Roberts wrote:
>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>> don't
>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>> collapse
>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>> even
>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>> issue?
>>>>>> If not, how did you solve?
>>>>>
>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>> lower as a default would "help".
>>>>
>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>> document that the other weird configurations will make mTHP skip, because "weird
>>>> and unexpetced" ? :)
> 
> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
> perhaps it's better to say we round to closest of 0 and max?

Maybe. Rounding down always implies doing something not necessarily desired.

In any case, I assume most setups just have the default values here ... :)

> 
>>>>
>>>
>>> That sounds like a great simplification in principle!
>>
>> And certainly a much easier to start with :)
>>
>> If we ever get the request to support something else, maybe that's also where we
>> can learn *why*, and what we would actually want to do with mTHP.
>>
>>> We would need to consider
>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>> with those?
>>
>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>> "/ 8" heuristics.
>>
>>
>> One simple starting point here is of course to say "when collapsing mTHP, all
>> have to be unshared and all have to be swapped in", so to essentially ignore
>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>> collapse and worry about that later, when really required.
> 
> For swap, if we assume we start with the whole VMA swapped out, I think setting
> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
> in sequentially? I guess that's creep due to faulting pattern though, so at
> least it's not due to collapse. Doesn't feel ideal though.
 > > I'm not sure what the semantic of "shared" is? I'm guessing it's 
specifically
> for private COWed pages, and khugepaged will trigger the COW on collapse?

Yes.

> So
> again depending on the pattern of writes we could still end up with creep in a
> similar way to swap?

I think in regards of both "yes", so a simple starting point but not 
necessarily what we want long term. The creep is at least "not wasting 
more memory", because we don't collapse where PMD wouldn't have collapsed.

After all, right now we don't collapse mTHP, now we would collapse mTHP 
in many scenarios, so we don't have to be perfect initially.

Deriving stuff for small THP sizes when configured for PMD THP sizes is 
not easy to do right.

> 
>>
>> Two alternatives I discussed with Nico for these (not sure which is implemented
>> here) is to calculate it proportionally to the folio order we are collapsing:
> 
> You're only listing one option here... what's the other one you discussed?
>

Ah sorry, reshuffled it and then had to rush.

The other thing I had in mind is to scan the whole PMD range, and 
discard skip the whole PMD range if it doesn't obey the max_ptes_* 
stuff. Not perfect, but will mean that we behave just like PMD collapse 
would, unless I am missing something.


>>
>> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
>> (256 PTEs), 32 PTEs would be allowed to be swapped out.
> 
> Yeah this is exactly what Dev's version is doing at the moment. But that's the
> behaviour that leads to the "creep" problem.

Right.

-- 
Cheers,

David / dhildenb

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Ryan Roberts 1 year ago

On 20/01/2025 18:39, David Hildenbrand wrote:
> On 20.01.25 17:27, Ryan Roberts wrote:
>> On 20/01/2025 13:56, David Hildenbrand wrote:
>>> On 20.01.25 14:37, Ryan Roberts wrote:
>>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>>> don't
>>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>>> collapse
>>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>>> even
>>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>>> issue?
>>>>>>> If not, how did you solve?
>>>>>>
>>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>>> lower as a default would "help".
>>>>>
>>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>>> document that the other weird configurations will make mTHP skip, because
>>>>> "weird
>>>>> and unexpetced" ? :)
>>
>> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
>> perhaps it's better to say we round to closest of 0 and max?
> 
> Maybe. Rounding down always implies doing something not necessarily desired.
> 
> In any case, I assume most setups just have the default values here ... :)
> 
>>
>>>>>
>>>>
>>>> That sounds like a great simplification in principle!
>>>
>>> And certainly a much easier to start with :)
>>>
>>> If we ever get the request to support something else, maybe that's also where we
>>> can learn *why*, and what we would actually want to do with mTHP.
>>>
>>>> We would need to consider
>>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>>> with those?
>>>
>>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>>> "/ 8" heuristics.
>>>
>>>
>>> One simple starting point here is of course to say "when collapsing mTHP, all
>>> have to be unshared and all have to be swapped in", so to essentially ignore
>>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>>> collapse and worry about that later, when really required.
>>
>> For swap, if we assume we start with the whole VMA swapped out, I think setting
>> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
>> in sequentially? I guess that's creep due to faulting pattern though, so at
>> least it's not due to collapse. Doesn't feel ideal though.
>> > I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
>> for private COWed pages, and khugepaged will trigger the COW on collapse?
> 
> Yes.
> 
>> So
>> again depending on the pattern of writes we could still end up with creep in a
>> similar way to swap?
> 
> I think in regards of both "yes", so a simple starting point but not necessarily
> what we want long term. The creep is at least "not wasting more memory", because
> we don't collapse where PMD wouldn't have collapsed.
> 
> After all, right now we don't collapse mTHP, now we would collapse mTHP in many
> scenarios, so we don't have to be perfect initially.
> 
> Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
> to do right.
> 
>>
>>>
>>> Two alternatives I discussed with Nico for these (not sure which is implemented
>>> here) is to calculate it proportionally to the folio order we are collapsing:
>>
>> You're only listing one option here... what's the other one you discussed?
>>
> 
> Ah sorry, reshuffled it and then had to rush.
> 
> The other thing I had in mind is to scan the whole PMD range, and discard skip
> the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
> will mean that we behave just like PMD collapse would, unless I am missing
> something.

Hmm that's an interesting idea; If I've understood, we would effectively test
the PMD for collapse as if we were collapsing to PMD-size, but then do the
actual collapse to the "highest allowed order" (dictated by what's enabled +
MADV_HUGEPAGE config).

I'm not so sure this is a good way to go; there would be no way to support VMAs
(or parts of VMAs) that don't span a full PMD. And I can imagine we might see
memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
let's say we have a 2M (aligned portion of a) VMA that does NOT have
MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
folios, 31 of which are all zeros. We have spent the same amount of memory as if
2M=always. Perhaps that's a detail that could be solved by ignoring fully none
64K blocks when collapsing to 64K...

Personally, I think your "enforce simplicifation of the tunables for mTHP
collapse" idea is the best we have so far.

But I'll just push against your pushback of the per-VMA cursor idea briefly. It
strikes me that this could be useful for khugepaged regardless of mTHP support.
Today, it starts scanning a VMA, collapses the first PMD it finds that meets the
requirements, then switches to scanning another VMA. When it eventually gets
back to scanning the first VMA, it starts from the beginning again. Wouldn't a
cursor help reduce the amount of scanning it has to do?

> 
> 
>>>
>>> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
>>> (256 PTEs), 32 PTEs would be allowed to be swapped out.
>>
>> Yeah this is exactly what Dev's version is doing at the moment. But that's the
>> behaviour that leads to the "creep" problem.
> 
> Right.
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago

On 21/01/25 3:18 pm, Ryan Roberts wrote:
> On 20/01/2025 18:39, David Hildenbrand wrote:
>> On 20.01.25 17:27, Ryan Roberts wrote:
>>> On 20/01/2025 13:56, David Hildenbrand wrote:
>>>> On 20.01.25 14:37, Ryan Roberts wrote:
>>>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>>>> don't
>>>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>>>> collapse
>>>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>>>> even
>>>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>>>> issue?
>>>>>>>> If not, how did you solve?
>>>>>>>
>>>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>>>> lower as a default would "help".
>>>>>>
>>>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>>>> document that the other weird configurations will make mTHP skip, because
>>>>>> "weird
>>>>>> and unexpetced" ? :)
>>>
>>> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
>>> perhaps it's better to say we round to closest of 0 and max?
>>
>> Maybe. Rounding down always implies doing something not necessarily desired.
>>
>> In any case, I assume most setups just have the default values here ... :)
>>
>>>
>>>>>>
>>>>>
>>>>> That sounds like a great simplification in principle!
>>>>
>>>> And certainly a much easier to start with :)
>>>>
>>>> If we ever get the request to support something else, maybe that's also where we
>>>> can learn *why*, and what we would actually want to do with mTHP.
>>>>
>>>>> We would need to consider
>>>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>>>> with those?
>>>>
>>>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>>>> "/ 8" heuristics.
>>>>
>>>>
>>>> One simple starting point here is of course to say "when collapsing mTHP, all
>>>> have to be unshared and all have to be swapped in", so to essentially ignore
>>>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>>>> collapse and worry about that later, when really required.
>>>
>>> For swap, if we assume we start with the whole VMA swapped out, I think setting
>>> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
>>> in sequentially? I guess that's creep due to faulting pattern though, so at
>>> least it's not due to collapse. Doesn't feel ideal though.
>>>> I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
>>> for private COWed pages, and khugepaged will trigger the COW on collapse?
>>
>> Yes.
>>
>>> So
>>> again depending on the pattern of writes we could still end up with creep in a
>>> similar way to swap?
>>
>> I think in regards of both "yes", so a simple starting point but not necessarily
>> what we want long term. The creep is at least "not wasting more memory", because
>> we don't collapse where PMD wouldn't have collapsed.
>>
>> After all, right now we don't collapse mTHP, now we would collapse mTHP in many
>> scenarios, so we don't have to be perfect initially.
>>
>> Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
>> to do right.
>>
>>>
>>>>
>>>> Two alternatives I discussed with Nico for these (not sure which is implemented
>>>> here) is to calculate it proportionally to the folio order we are collapsing:
>>>
>>> You're only listing one option here... what's the other one you discussed?
>>>
>>
>> Ah sorry, reshuffled it and then had to rush.
>>
>> The other thing I had in mind is to scan the whole PMD range, and discard skip
>> the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
>> will mean that we behave just like PMD collapse would, unless I am missing
>> something.
> 
> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

There are two ways a collapse can fail.
(1) We exceed one of max_ptes_* in the range.
(2) We fail in the function collapse_huge_page(), whose real bottleneck 
really is alloc_charge_folio(); i.e we fail to find physically 
contiguous memory for the corresponding scan order.

Now, we do not care whether we get a "creep" in case of (2), because 
khugepaged's end goal is to collapse to the highest order. We care if we 
get a creep from (1), because a smaller collapse leads us to come under 
the max_ptes_* constraint for a bigger order.

So I think, what David is suggesting is to ignore (1) for smaller 
orders. I'll try to formalize the proposed policy as follows: 
khugepaged's goal should be to collapse to the highest order possible, 
and therefore, we should bother with max_ptes_* only for the highest 
order, since that really is the end goal. So, the algorithm is: check 
max_ptes_* for PMD -> if success, collapse_huge_page(PMD_ORDER) -> if 
this fails, we drop down the order, and we ignore the local distribution 
of none, swap, shared PTEs -> collapse_huge_page(say, 64K) -> success. 
Now, this won't give us any creep since PMD collapse was eligible 
anyways. The only rule we should add is "collapse to a smaller order 
only when at least one PTE is filled in that range" since we do not want 
to collapse an empty range, as Ryan notes above. This should be easy to 
do; just maintain a local bitmap in hpage_collapse_scan_pmd(), the 
question we need to answer is "is the bitmap non-zero for a range?" 
which we can get in O(1).
The issue of bothering with max_ptes_* will come here when we drop and 
reacquire locks, because the global distribution may change, so we will 
be trading with accuracy...

With regards to "there would be no way to support VMAs that don't span a 
full PMD", the policy should be "scale max_ptes_* to the highest order 
possible for the VMA". Basically, just get rid of the PMD-case and 
generalize this algorithm.

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

What I understood: keep max_ptes_none = 511 or 0, document that keeping 
it other than that may cause the creep, and consider max_ptes_swap = 
max_ptes_shared = 0 for mTHP. I will vote for this too...we won't have 
locking issues since we will *have* to scan ranges for smaller orders to 
check nothing is swap and shared. I have two justifications to support 
this policy:

(1) Systems in general may, and in particular, Android has a good 
percent of memory in swap, so we really don't want khugepaged to say 
"please swap-in memory for this range so I could collapse to 64K" when 
the real goal is to collapse to 2MB.

(2) The reason we dropped down to a lower order, is because we failed 
PMD-collapse. The reason for that is that we couldn't find 2MB 
physically contiguous memory, so, assume for the sake of argument that 
we are in memory pressure. We don't want khugepaged to say "please 
swapin memory and let me create even more memory pressure". The analogy 
runs for the shared case.

All in all, the essence of the policy is that mTHP collapse should be 
made stricter, since the fact that we failed PMD-collapse means we can't 
ask a lot of memory from the system just to do an mTHP collapse.

With regards to max_ptes_none, David says:
"If we ever get the request to support something else, maybe that's also 
where we can learn *why*, and what we would actually want to do with mTHP. "

Which sounds very reasonable, since we are solving a theoretical 
problem, we don't have a real-use-case justification as to why we should 
bother to solve this problem :)

Re: [RFC 00/11] khugepaged: mTHP support

Posted by David Hildenbrand 1 year ago

> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. 


In Nicos approach to locking, we temporarily have to remove the PTE 
table either way. While holding the mmap lock in write mode, the VMAs 
cannot go away, so we could scan the whole PTE table to figure it out.

To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
don't need the other VMA information. Figuring out "shared" is trickier, 
because we have to obtain the folio and would have to walk the other VMAs.

It's a good question if we would have to VMA-write-lock the other 
affected VMAs as well in order to temporarily remove the PTE table that 
crosses multiple VMAs, or if we'd need something different (collapse PMD 
marker) so the page table walkers could handle that case properly -- 
keep retrying or fallback to the mmap lock.

> And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

Yes, that's what I had in mind. No need to collapse where there is 
nothing at all ...

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

Right.

> 
> But I'll just push against your pushback of the per-VMA cursor idea briefly. It
> strikes me that this could be useful for khugepaged regardless of mTHP support.

Not a clear pushback, as you say to me this is a different optimization 
and I am missing how it could really solve the problem at hand here.

Note that we're already fighting with not growing VMAs (see the VMA 
locking changes under review), but maybe we could still squeeze it in 
there without requiring a bigger slab.

> Today, it starts scanning a VMA, collapses the first PMD it finds that meets the
> requirements, then switches to scanning another VMA. When it eventually gets
> back to scanning the first VMA, it starts from the beginning again. Wouldn't a
> cursor help reduce the amount of scanning it has to do?

Yes, that whole scanning approach sound weird. I would have assumed that 
it might nowdays be smarter to just scan the MM sequentially, and not 
jump between VMAs.

Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
up scanning the same handful of VMAs over and over again.

I think a lot of the khugepaged codebase is just full with historical 
baggage that must be cleaned up and re-validated if it still required ...

-- 
Cheers,

David / dhildenb

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


On 21/01/25 3:49 pm, David Hildenbrand wrote:
>> Hmm that's an interesting idea; If I've understood, we would 
>> effectively test
>> the PMD for collapse as if we were collapsing to PMD-size, but then do 
>> the
>> actual collapse to the "highest allowed order" (dictated by what's 
>> enabled +
>> MADV_HUGEPAGE config).
>>
>> I'm not so sure this is a good way to go; there would be no way to 
>> support VMAs
>> (or parts of VMAs) that don't span a full PMD. 
> 
> 
> In Nicos approach to locking, we temporarily have to remove the PTE 
> table either way. While holding the mmap lock in write mode, the VMAs 
> cannot go away, so we could scan the whole PTE table to figure it out.
> 
> To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
> don't need the other VMA information. Figuring out "shared" is trickier, 
> because we have to obtain the folio and would have to walk the other VMAs.
> 
> It's a good question if we would have to VMA-write-lock the other 
> affected VMAs as well in order to temporarily remove the PTE table that 
> crosses multiple VMAs, or if we'd need something different (collapse PMD 
> marker) so the page table walkers could handle that case properly -- 
> keep retrying or fallback to the mmap lock.

I missed this reply, could have saved me some trouble :) When collapsing 
for VMAs < PMD, we *will* have to write lock the VMAs, write lock the 
anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs, 
otherwise someone may fault on another VMA mapping the same PTE table. I 
was trying to implement this, but cannot find a clean way: we will have 
to implement it like mm_take_all_locks(), with a similar bit like 
AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then 
two VMAs may have the same anon_vma, and we cannot get away with the 
following check:

lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem)

since I need to skip the lock only when it is khugepaged which has taken 
the lock.

I guess the way to go about this then is the PMD-marker thingy, which I 
am not very familiar with.

> 
>> And I can imagine we might see
>> memory bloat; imagine you have 2M=madvise, 64K=always, 
>> max_ptes_none=511, and
>> let's say we have a 2M (aligned portion of a) VMA that does NOT have
>> MADV_HUGEPAGE set and has a single page populated. It passes the PMD- 
>> size test,
>> but we opt to collapse to 64K (since 2M=madvise). So now we end up 
>> with 32x 64K
>> folios, 31 of which are all zeros. We have spent the same amount of 
>> memory as if
>> 2M=always. Perhaps that's a detail that could be solved by ignoring 
>> fully none
>> 64K blocks when collapsing to 64K...
> 
> Yes, that's what I had in mind. No need to collapse where there is 
> nothing at all ...
> 
>>
>> Personally, I think your "enforce simplicifation of the tunables for mTHP
>> collapse" idea is the best we have so far.
> 
> Right.
> 
>>
>> But I'll just push against your pushback of the per-VMA cursor idea 
>> briefly. It
>> strikes me that this could be useful for khugepaged regardless of mTHP 
>> support.
> 
> Not a clear pushback, as you say to me this is a different optimization 
> and I am missing how it could really solve the problem at hand here.
> 
> Note that we're already fighting with not growing VMAs (see the VMA 
> locking changes under review), but maybe we could still squeeze it in 
> there without requiring a bigger slab.
> 
>> Today, it starts scanning a VMA, collapses the first PMD it finds that 
>> meets the
>> requirements, then switches to scanning another VMA. When it 
>> eventually gets
>> back to scanning the first VMA, it starts from the beginning again. 
>> Wouldn't a
>> cursor help reduce the amount of scanning it has to do?
> 
> Yes, that whole scanning approach sound weird. I would have assumed that 
> it might nowdays be smarter to just scan the MM sequentially, and not 
> jump between VMAs.
> 
> Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
> up scanning the same handful of VMAs over and over again.
> 
> I think a lot of the khugepaged codebase is just full with historical 
> baggage that must be cleaned up and re-validated if it still required ...
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Ryan Roberts 1 year ago

On 16/01/2025 20:53, Nico Pache wrote:
> On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Nico,
> Hi Ryan!
>>
>> On 08/01/2025 23:31, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>
>> It's great to see multiple solutions for this feature being posted; I guess that
>> leaves us with the luxurious problem of figuring out an uber-patchset that
>> incorporates the best of both? :)
> I guess so! My motivation for developing this was inspired by my
> 'defer' RFC. Which can't really live without khugepaged having mTHP
> support (ie having 32k mTHP= always and global=defer doesnt make
> sense).

I'm not sure why that wouldn't make sense? setting global=defer would only be
picked up for a given size that sets "inherit". So "32k=always, 2m=inherit,
global=defer" is the same as "32k=always, 2m=defer"; which means you would try
to allocate 32K directly in the fault handler and defer collapse to 2m to
khugepaged. I guess where it would get difficult is if you set a size less than
PMD-size to defer; at the moment khugepaged can't actually do that; it would
just end up collapsing to 2M? Anyway, I'm rambling... I get your point.

>>
>> I haven't had a chance to review your series in detail yet, but have a few
>> questions below that will help me understand the key differences between your
>> series and Dev's.
>>
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>> determine how full a mTHP order needs to be before collapsing it.
>>>
>>> Some design choices to note:
>>>  - bitmap structures are allocated dynamically because on some arch's
>>>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>     compile time leading to warnings.
>>
>> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
>> time. Could these help avoid the dynamic allocation?
>>
>> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
> is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
> edge cases where PMD_ORDER < MAX_PMD_ORDER.

No, MAX_PMD_ORDER becomes the largest order that could be configured at boot.
PMD_ORDER is what is actually configured at boot. My understanding was that you
were dynamically allocating your bitmap based on the runtime value of PMD_ORDER?
I was just suggesting that you could allocate it statically (on stack or
whatever) based on MAX_PMD_ORDER, for the worst-case requirement and only
actually use the portion required by the runtime PMD_ORDER value. It avoids the
kmalloc call.

> 
>>
>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> The bitmap was to encode the state of PMD without needing rescanning
> (or refactor a lot of code). We keep the scan runtime constant at 512
> (for x86). Dev did some good analysis for this here
> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> This prevents needing to hold the read lock for longer, and prevents
> needing to reacquire it too.
>>>>>  - The recursion is masked through a stack structure.
>>>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>     64bit on x86. This provides some optimization on the bitmap operations.
>>>     if other arches/configs that have larger than 512 PTEs per PMD want to
>>>     compress their bitmap further we can change this value per arch.
>>
>> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
>> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
>> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
>> enabled, you would want to collapse every other 16K block in this case, but I'm
>> guessing with your scheme, all the bits will be clear and no collapse will
>> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.
> 
> Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
> setting the bit. This will covert this case better (given a better
> default max_ptes_none).
> The way i see it 511 max_ptes_none is just wrong... 

You mean it's a bad default?

> we should flip it
> towards the lower end of the scale (ie 64), and the "always" THP
> setting should ignore it (like madvise does).

But user space can already get that behaviour by modifying the tunable, right?
Isn't that just a user space policy choice?

One other thing that occurs to me regarding the bitmap; In the context of Dev's
series, we have discussed policy for what to do when the source PTEs are backed
by a large folio already. I'm guessing if you are making your
smaller-than-PMD-size collapse decisions based solely on the bitmap, you won't
be able to see when the PTEs are already collpsed for the target order? i.e.
let's say you already have a 64K folio fully mapped in an aligned way. You
wouldn't want to "re-collapse" it to 64K. Are you robust to this?

> 
>>
>>>
>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 3:    A minor "fix"/optimization
>>> Patch 4:    Refactor/rename hpage_collapse
>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>> Patch 8-11: The mTHP patches
>>>
>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>> series differ in a few ways:
>>>   - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>     then determine potential mTHP batches. Devs incorporates his directly
>>>     into the scan, and will try each available order.
>>
>> So if I'm understanding, the benefit of the bitmap is to remove the need to
>> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
>> approach does? Are there not some locking/consistency issues to manage if not
>> re-scanning?
> Correct, so far i haven't found any issues (other than the bugs Dev
> reported in his review)-- my fixed version of this RFC has been
> running fine with no notable locking issues.
>>
>>>   - Dev is attempting to optimize the locking, while my approach keeps the
>>>     locking changes to a minimum. I believe his changes are not safe for
>>>     uffd.
>>
>> I agree; let's keep the locking simple for the initial effort.
>>
>>>   - Dev's changes only work for khugepaged not madvise_collapse (although
>>>     i think that was by choice and it could easily support madvise)
>>
>> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
>> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
>> presumably it will continue to be much more greedy about collapsing to the
>> highest possible order and only fall back to lower orders if the VMA boundaries
>> force it to or if the higher order allocation fails?
> Kind of, because I removed the max_ptes_none check during the scan,
> and reintroduced it in the bitmap scan (without a madvise
> restriction), MADV_COLLAPSE and khugepaged will work more similarly.
>>
>>>   - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>     the restriction of max_ptes_none and converting it to a scale to
>>>     determine a (m)THP threshold.
>>
>> I don't really understand this statement. You say you are removing the
>> restriction of max_ptes_none. But then you say you scale it to determine a
>> threshold. So are you honoring it or not? And if you're honouring it, how is
>> your scaling method different to Dev's? What about the other tunables (shared
>> and swap)?
> I removed the max_ptes_none restriction during the initial scan, so we
> can account for the full PMD (which is what happens with
> max_ptes_none=511 anyways). Then max_ptes_none can be used with the
> bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
> finding the optimal mTHP size.
> 
> This RFC scales max_ptes_none to 0-100, but it has some really bad
> rounding issues, so instead ive incorporated scaling (via bitshifting)
> like Dev did in his series. Ive tested this and it's more accurate
> now.
>>
>>>   - Dev turns on khugepaged if any order is available while mine still
>>>     only runs if PMDs are enabled. I like Dev's approach and will most
>>>     likely do the same in my PATCH posting.
>>
>> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
>> that cover only a partial PMD entry. I think neither of your implementations
>> currently do that. As I understand it, Dev's v2 will add that support. Is your
>> approach ammeanable to this?
> 
> Yes, I believe so. I'm working on adding this too.
> 
>>
>>>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>
>>> Patch 11 was inspired by one of Dev's changes.
>>
>> I think the 1 problem that emerged during review of Dev's series, which we don't
>> have a proper solution to yet, is the issue of "creep", where regions can be
>> collapsed to progressively higher orders through iterative scans. At each
>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>> effectively adds more non-none ptes so the next scan will then collapse to even
>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>> If not, how did you solve?
> 
> Yes sadly it suffers from the same issue. bringing max_ptes_none much
> lower as a default would "help".
> I liked Zi Yan's solution of a per-VMA bit that gets set when
> khugepaged collapses, and unset when the VMA changes (pf, realloc,
> etc).
> Then khugepaged can only operate on VMAs that dont have the bit set.
> This way we only collapse once, unless the mapping was changed.

Dev raised the issue in discussion against his series, that currently khugepaged
doesn't scan the entire VMA, it scans to the first PMD that it collapses then
moves to another VMA. I guess that's a fairness thing. So a VMA flag won't quite
do the trick assuming we want to continue with that behavior. Perhaps we could
keep a "cursor" in the VMA though, which describes the starting address of the
next scan. We can move it forwards as we scan. And move it backwards when taking
a fault. Still not perfect, but perhaps good enough?

> 
> Could we map the new "non-none" pages to the zero page (rather than
> actually zeroing the page), so they dont actually act as new "utilized
> pages" and are still counted as none pages during the scan (until they
> are written to)?

I think you are propsing to use the zero page as a PTE marker to say "this
region is scheduled for collapse"? In which case, why not just use a PTE
marker... But you still have to do the collapse at some point (which I guess you
are now deferring to the next page fault that hits one of those markers)? Once
you have collapsed, you're still back to the original issue. So I don't think
it's bought you anything except complexity and more latency :)

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
> 
> Cheers!
> -- Nico
> 
>>
>>
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>
>>> Nico Pache (11):
>>>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>   khugepaged: Don't allocate khugepaged mm_slot early
>>>   khugepaged: rename hpage_collapse_* to khugepaged_*
>>>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>   khugepaged: generalize alloc_charge_folio for mTHP support
>>>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>   khugepaged: add mTHP support
>>>   khugepaged: remove max_ptes_none restriction on the pmd scan
>>>   khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>  include/linux/khugepaged.h |   4 +-
>>>  mm/huge_memory.c           |   3 +-
>>>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>  3 files changed, 306 insertions(+), 137 deletions(-)
>>>
>>
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Mon, Jan 20, 2025 at 5:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 16/01/2025 20:53, Nico Pache wrote:
> > On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Nico,
> > Hi Ryan!
> >>
> >> On 08/01/2025 23:31, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>
> >> It's great to see multiple solutions for this feature being posted; I guess that
> >> leaves us with the luxurious problem of figuring out an uber-patchset that
> >> incorporates the best of both? :)
> > I guess so! My motivation for developing this was inspired by my
> > 'defer' RFC. Which can't really live without khugepaged having mTHP
> > support (ie having 32k mTHP= always and global=defer doesnt make
> > sense).
>
> I'm not sure why that wouldn't make sense? setting global=defer would only be
> picked up for a given size that sets "inherit". So "32k=always, 2m=inherit,
> global=defer" is the same as "32k=always, 2m=defer"; which means you would try
> to allocate 32K directly in the fault handler and defer collapse to 2m to
> khugepaged. I guess where it would get difficult is if you set a size less than
> PMD-size to defer; at the moment khugepaged can't actually do that; it would
> just end up collapsing to 2M? Anyway, I'm rambling... I get your point.

Yeah looks like you found one of the issues. so defer means no pf time
(m)THPs. mthp sysctls need a "defer" entry, what does it mean to defer
globally and have a mthp size as always or inherit? I assume for
global=defer and mthps=always/inherit/defer, we defer at pf time and
can collapse the mthp. and gobal=always and mthp=defer, we always
allocate a thp, then khugepaged can scan for (m)thp collapse.

>
> >>
> >> I haven't had a chance to review your series in detail yet, but have a few
> >> questions below that will help me understand the key differences between your
> >> series and Dev's.
> >>
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>> determine how full a mTHP order needs to be before collapsing it.
> >>>
> >>> Some design choices to note:
> >>>  - bitmap structures are allocated dynamically because on some arch's
> >>>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>     compile time leading to warnings.
> >>
> >> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
> >> time. Could these help avoid the dynamic allocation?
> >>
> >> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
> > is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
> > edge cases where PMD_ORDER < MAX_PMD_ORDER.
>
> No, MAX_PMD_ORDER becomes the largest order that could be configured at boot.
> PMD_ORDER is what is actually configured at boot. My understanding was that you
> were dynamically allocating your bitmap based on the runtime value of PMD_ORDER?
> I was just suggesting that you could allocate it statically (on stack or
> whatever) based on MAX_PMD_ORDER, for the worst-case requirement and only
> actually use the portion required by the runtime PMD_ORDER value. It avoids the
> kmalloc call.

I originally had this on the stack, but the PMD_ORDER gave me trouble
for ppc. Ill try this approach to get it back on the stack! Thanks!

>
> >
> >>
> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> >> is vs just iterating through the PTEs like Dev does; is there a significant cost
> >> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> > The bitmap was to encode the state of PMD without needing rescanning
> > (or refactor a lot of code). We keep the scan runtime constant at 512
> > (for x86). Dev did some good analysis for this here
> > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> > This prevents needing to hold the read lock for longer, and prevents
> > needing to reacquire it too.
> >>>>>  - The recursion is masked through a stack structure.
> >>>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>     64bit on x86. This provides some optimization on the bitmap operations.
> >>>     if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>     compress their bitmap further we can change this value per arch.
> >>
> >> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
> >> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
> >> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
> >> enabled, you would want to collapse every other 16K block in this case, but I'm
> >> guessing with your scheme, all the bits will be clear and no collapse will
> >> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.
> >
> > Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
> > setting the bit. This will covert this case better (given a better
> > default max_ptes_none).
> > The way i see it 511 max_ptes_none is just wrong...
>
> You mean it's a bad default?
Yeah that's the better phrasing.
>
> > we should flip it
> > towards the lower end of the scale (ie 64), and the "always" THP
> > setting should ignore it (like madvise does).
>
> But user space can already get that behaviour by modifying the tunable, right?
> Isn't that just a user space policy choice?
technically yes, but shouldn't defaults reflect sane behavior? ofc
this is my opinion, some might think 511 is not a bad default at all.
My original perspective comes from the memory waste issue, 511 could
be really good for performance if you benefit from PMDs; hence why I
was also suggesting "always" ignores the max_ptes_none.
>
> One other thing that occurs to me regarding the bitmap; In the context of Dev's
> series, we have discussed policy for what to do when the source PTEs are backed
> by a large folio already. I'm guessing if you are making your
> smaller-than-PMD-size collapse decisions based solely on the bitmap, you won't
> be able to see when the PTEs are already collpsed for the target order? i.e.
> let's say you already have a 64K folio fully mapped in an aligned way. You
> wouldn't want to "re-collapse" it to 64K. Are you robust to this?

Yes, I am also skipping the order <= folio_order case.

>
> >
> >>
> >>>
> >>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 3:    A minor "fix"/optimization
> >>> Patch 4:    Refactor/rename hpage_collapse
> >>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>> Patch 8-11: The mTHP patches
> >>>
> >>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>> series differ in a few ways:
> >>>   - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>     then determine potential mTHP batches. Devs incorporates his directly
> >>>     into the scan, and will try each available order.
> >>
> >> So if I'm understanding, the benefit of the bitmap is to remove the need to
> >> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
> >> approach does? Are there not some locking/consistency issues to manage if not
> >> re-scanning?
> > Correct, so far i haven't found any issues (other than the bugs Dev
> > reported in his review)-- my fixed version of this RFC has been
> > running fine with no notable locking issues.
> >>
> >>>   - Dev is attempting to optimize the locking, while my approach keeps the
> >>>     locking changes to a minimum. I believe his changes are not safe for
> >>>     uffd.
> >>
> >> I agree; let's keep the locking simple for the initial effort.
> >>
> >>>   - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>     i think that was by choice and it could easily support madvise)
> >>
> >> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
> >> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
> >> presumably it will continue to be much more greedy about collapsing to the
> >> highest possible order and only fall back to lower orders if the VMA boundaries
> >> force it to or if the higher order allocation fails?
> > Kind of, because I removed the max_ptes_none check during the scan,
> > and reintroduced it in the bitmap scan (without a madvise
> > restriction), MADV_COLLAPSE and khugepaged will work more similarly.
> >>
> >>>   - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>     the restriction of max_ptes_none and converting it to a scale to
> >>>     determine a (m)THP threshold.
> >>
> >> I don't really understand this statement. You say you are removing the
> >> restriction of max_ptes_none. But then you say you scale it to determine a
> >> threshold. So are you honoring it or not? And if you're honouring it, how is
> >> your scaling method different to Dev's? What about the other tunables (shared
> >> and swap)?
> > I removed the max_ptes_none restriction during the initial scan, so we
> > can account for the full PMD (which is what happens with
> > max_ptes_none=511 anyways). Then max_ptes_none can be used with the
> > bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
> > finding the optimal mTHP size.
> >
> > This RFC scales max_ptes_none to 0-100, but it has some really bad
> > rounding issues, so instead ive incorporated scaling (via bitshifting)
> > like Dev did in his series. Ive tested this and it's more accurate
> > now.
> >>
> >>>   - Dev turns on khugepaged if any order is available while mine still
> >>>     only runs if PMDs are enabled. I like Dev's approach and will most
> >>>     likely do the same in my PATCH posting.
> >>
> >> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
> >> that cover only a partial PMD entry. I think neither of your implementations
> >> currently do that. As I understand it, Dev's v2 will add that support. Is your
> >> approach ammeanable to this?
> >
> > Yes, I believe so. I'm working on adding this too.
> >
> >>
> >>>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>
> >>> Patch 11 was inspired by one of Dev's changes.
> >>
> >> I think the 1 problem that emerged during review of Dev's series, which we don't
> >> have a proper solution to yet, is the issue of "creep", where regions can be
> >> collapsed to progressively higher orders through iterative scans. At each
> >> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
> >> effectively adds more non-none ptes so the next scan will then collapse to even
> >> higher order. Does your solution suffer from this (theoretical/edge case) issue?
> >> If not, how did you solve?
> >
> > Yes sadly it suffers from the same issue. bringing max_ptes_none much
> > lower as a default would "help".
> > I liked Zi Yan's solution of a per-VMA bit that gets set when
> > khugepaged collapses, and unset when the VMA changes (pf, realloc,
> > etc).
> > Then khugepaged can only operate on VMAs that dont have the bit set.
> > This way we only collapse once, unless the mapping was changed.
>
> Dev raised the issue in discussion against his series, that currently khugepaged
> doesn't scan the entire VMA, it scans to the first PMD that it collapses then
> moves to another VMA. I guess that's a fairness thing. So a VMA flag won't quite
> do the trick assuming we want to continue with that behavior. Perhaps we could
> keep a "cursor" in the VMA though, which describes the starting address of the
> next scan. We can move it forwards as we scan. And move it backwards when taking
> a fault. Still not perfect, but perhaps good enough?

I started playing around with some of these changes, it seems to work,
but David raised the issue that we can't expand vm_struct, so I need
to find a different solution.

>
> >
> > Could we map the new "non-none" pages to the zero page (rather than
> > actually zeroing the page), so they dont actually act as new "utilized
> > pages" and are still counted as none pages during the scan (until they
> > are written to)?
>
> I think you are propsing to use the zero page as a PTE marker to say "this
> region is scheduled for collapse"? In which case, why not just use a PTE
> marker... But you still have to do the collapse at some point (which I guess you
> are now deferring to the next page fault that hits one of those markers)? Once
> you have collapsed, you're still back to the original issue. So I don't think
> it's bought you anything except complexity and more latency :)

Ah ok i see! Thanks for clarifying
>
> Thanks,
> Ryan
>
> >
> >>
> >> Thanks,
> >> Ryan
> >
> > Cheers!
> > -- Nico
> >
> >>
> >>
> >>>
> >>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>
> >>> Nico Pache (11):
> >>>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>   khugepaged: Don't allocate khugepaged mm_slot early
> >>>   khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>   khugepaged: generalize alloc_charge_folio for mTHP support
> >>>   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>   khugepaged: add mTHP support
> >>>   khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>   khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>  include/linux/khugepaged.h |   4 +-
> >>>  mm/huge_memory.c           |   3 +-
> >>>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>  3 files changed, 306 insertions(+), 137 deletions(-)
> >>>
> >>
> >
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


--- snip ---
>>
>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> The bitmap was to encode the state of PMD without needing rescanning
> (or refactor a lot of code). We keep the scan runtime constant at 512
> (for x86). Dev did some good analysis for this here
> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/

I think I swayed away and over-analyzed, and probably did not make my 
main objection clear enough, so let us cut to the chase.
*Why* is it correct to remember the state of the PMD?

In__collapse_huge_page_isolate(), we check the PTEs against the sysfs 
tunables again, since we dropped the lock. The bitmap thingy which you 
are doing, and in general, any algorithm which tries to remember the 
state of the PMD, violates the entire point of max_ptes_*. Take for 
example: Suppose the PTE table had a lot of shared ptes. After you drop 
the PTL, you do this: scan_bitmap() -> read_unlock() -> 
alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot 
of stuff. Now, you do write_lock(), which means that you need to wait 
for all faulting/forking/mremap/mmap etc to stop. Suppose this process 
forks and then a lot of PTEs become shared. The point of max_ptes_shared 
is to stop the collapse here, since we do not want memory bloat 
(collapse will grab more memory from the buddy and the old memory won't 
be freed because it has a reference from the parent/child).
Another example would be, a sysadmin does not want too much memory 
wastage from khugepaged, so we decide to set max_ptes_none low. When you 
scan the PTE table you justify the collapse. After you drop the PTL and 
the mmap_lock, a munmap() happens in the region, no longer justifying 
the collapse. If you have a lot of VMAs of size <= 2MB, then any 
munmap() on a VMA will happen on the single PTE table present.

So, IMHO before even jumping on analyzing the bitmap algorithm, we need 
to ask whether any algorithm remembering the state of the PMD is even 
conceptually right.

Then, you have the harder task of proving that your optimization is 
actually an optimization, that it is not turned into being futile 
because of overhead. From a high-level mathematical PoV, you are saving 
iterations. Any mathematical analysis has the underlying assumption that 
every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << 
order)] is virtually and physically contiguous in memory so prefetching 
helps us. You are trying to save on pte memory references, but then look 
at the number of bitmap memory references you have created, not to 
mention that you are doing a (costly?) division operation in there, you 
have a while loop, a stack, new structs, and if conditions. I do not see 
how this is any faster than a naive linear scan.

> This prevents needing to hold the read lock for longer, and prevents
> needing to reacquire it too.

My implementation does not hold the read lock for longer. What you mean 
to say is, I need to reacquire the lock, and this is by design, to 
ensure correctness, which boils down to what I wrote above.

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Nico Pache 1 year ago

On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> --- snip ---
> >>
> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> >> is vs just iterating through the PTEs like Dev does; is there a significant cost
> >> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> > The bitmap was to encode the state of PMD without needing rescanning
> > (or refactor a lot of code). We keep the scan runtime constant at 512
> > (for x86). Dev did some good analysis for this here
> > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>
> I think I swayed away and over-analyzed, and probably did not make my
> main objection clear enough, so let us cut to the chase.
> *Why* is it correct to remember the state of the PMD?
>
> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
> tunables again, since we dropped the lock. The bitmap thingy which you
> are doing, and in general, any algorithm which tries to remember the
> state of the PMD, violates the entire point of max_ptes_*. Take for
> example: Suppose the PTE table had a lot of shared ptes. After you drop
> the PTL, you do this: scan_bitmap() -> read_unlock() ->
> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
per your recommendation I dropped the read_lock() -> read_unlock() and
made it a conditional unlock
> of stuff. Now, you do write_lock(), which means that you need to wait
> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
> forks and then a lot of PTEs become shared. The point of max_ptes_shared
> is to stop the collapse here, since we do not want memory bloat
> (collapse will grab more memory from the buddy and the old memory won't
> be freed because it has a reference from the parent/child).

That's a fair point, but given the other feedback, my current
implementation now requires mTHPs to have no shared/swap, and ive
improved the sysctl interactions for the set_bitmap and the
max_ptes_none check in the _isolate function.

As for *why* remembering the state is correct. It just prevents
needing to rescan.

> Another example would be, a sysadmin does not want too much memory
> wastage from khugepaged, so we decide to set max_ptes_none low. When you
> scan the PTE table you justify the collapse. After you drop the PTL and
> the mmap_lock, a munmap() happens in the region, no longer justifying
> the collapse. If you have a lot of VMAs of size <= 2MB, then any
> munmap() on a VMA will happen on the single PTE table present.
>
> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
> to ask whether any algorithm remembering the state of the PMD is even
> conceptually right.

Both the issues you raised dont really have to do with the bitmap...
they are fair points, but they are more of a criticism of my sysctl
handling. Ive cleaned up the max_ptes_none interactions, and now that
we dont plan to initially support swap/shared both these problems are
'gone'.
>
> Then, you have the harder task of proving that your optimization is
> actually an optimization, that it is not turned into being futile
> because of overhead. From a high-level mathematical PoV, you are saving
> iterations. Any mathematical analysis has the underlying assumption that
> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
> order)] is virtually and physically contiguous in memory so prefetching
> helps us. You are trying to save on pte memory references, but then look
> at the number of bitmap memory references you have created, not to
> mention that you are doing a (costly?) division operation in there, you
> have a while loop, a stack, new structs, and if conditions. I do not see
> how this is any faster than a naive linear scan.

Yeah it's hard to say without real performance testing. I hope to
include some performance results with my next post.

>
> > This prevents needing to hold the read lock for longer, and prevents
> > needing to reacquire it too.
>
> My implementation does not hold the read lock for longer. What you mean
> to say is, I need to reacquire the lock, and this is by design, to
yes sorry.
> ensure correctness, which boils down to what I wrote above.
The write lock is what ensures correctness, not the read lock. The
read lock is to gain insight of potential collapse candidates while
avoiding the cost of the write lock.

Cheers!
-- Nico
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


On 24/01/25 1:54 am, Nico Pache wrote:
> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> --- snip ---
>>>>
>>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>>>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>>>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
>>> The bitmap was to encode the state of PMD without needing rescanning
>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>> (for x86). Dev did some good analysis for this here
>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>> I think I swayed away and over-analyzed, and probably did not make my
>> main objection clear enough, so let us cut to the chase.
>> *Why* is it correct to remember the state of the PMD?
>>
>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>> tunables again, since we dropped the lock. The bitmap thingy which you
>> are doing, and in general, any algorithm which tries to remember the
>> state of the PMD, violates the entire point of max_ptes_*. Take for
>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
> per your recommendation I dropped the read_lock() -> read_unlock() and
> made it a conditional unlock

That's not the one I was talking about here...

>> of stuff. Now, you do write_lock(), which means that you need to wait
>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>> is to stop the collapse here, since we do not want memory bloat
>> (collapse will grab more memory from the buddy and the old memory won't
>> be freed because it has a reference from the parent/child).
> 
> That's a fair point, but given the other feedback, my current
> implementation now requires mTHPs to have no shared/swap, and ive
> improved the sysctl interactions for the set_bitmap and the
> max_ptes_none check in the _isolate function.

I am guessing you are following the policy of letting the creep happen 
for none ptes, and assuming shared and swap to be zero.

> 
> As for *why* remembering the state is correct. It just prevents
> needing to rescan.

That is what I am saying...if collapse_huge_page() fails, then you have 
dropped the mmap write lock, so now the state of the PTEs may have 
changed, so you must rescan...

> 
>> Another example would be, a sysadmin does not want too much memory
>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>> scan the PTE table you justify the collapse. After you drop the PTL and
>> the mmap_lock, a munmap() happens in the region, no longer justifying
>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>> munmap() on a VMA will happen on the single PTE table present.
>>
>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>> to ask whether any algorithm remembering the state of the PMD is even
>> conceptually right.
> 
> Both the issues you raised dont really have to do with the bitmap...

Correct, my issue is with any general algorithm remembering PTE state.

> they are fair points, but they are more of a criticism of my sysctl
> handling. Ive cleaned up the max_ptes_none interactions, and now that
> we dont plan to initially support swap/shared both these problems are
> 'gone'.
>>
>> Then, you have the harder task of proving that your optimization is
>> actually an optimization, that it is not turned into being futile
>> because of overhead. From a high-level mathematical PoV, you are saving
>> iterations. Any mathematical analysis has the underlying assumption that
>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>> order)] is virtually and physically contiguous in memory so prefetching
>> helps us. You are trying to save on pte memory references, but then look
>> at the number of bitmap memory references you have created, not to
>> mention that you are doing a (costly?) division operation in there, you
>> have a while loop, a stack, new structs, and if conditions. I do not see
>> how this is any faster than a naive linear scan.
> 
> Yeah it's hard to say without real performance testing. I hope to
> include some performance results with my next post.
> 
>>
>>> This prevents needing to hold the read lock for longer, and prevents
>>> needing to reacquire it too.
>>
>> My implementation does not hold the read lock for longer. What you mean
>> to say is, I need to reacquire the lock, and this is by design, to
> yes sorry.
>> ensure correctness, which boils down to what I wrote above.
> The write lock is what ensures correctness, not the read lock. The
> read lock is to gain insight of potential collapse candidates while
> avoiding the cost of the write lock.
> 
> Cheers!
> -- Nico
>>
>

Re: [RFC 00/11] khugepaged: mTHP support

Posted by Dev Jain 1 year ago


On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@arm.com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
>