introduce PGTY_mgt_entry page_type

[RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by Huan Yang 2 months, 1 week ago

Summary
==
This patchset reuses page_type to store migrate entry count during the
period from migrate entry setup to removal, enabling accelerated VMA
traversal when removing migrate entries, following a similar principle to
early termination when folio is unmapped in try_to_migrate.

In my self-constructed test scenario, the migration time can be reduced
from over 150+ms to around 30+ms, achieving nearly a 70% performance
improvement. Additionally, the flame graph shows that the proportion of
remove_migration_ptes can be reduced from 80%+ to 60%+.

Notice: migrate entry specifically refers to migrate PTE entry, as large
folio are not supported page type and 0 mapcount reuse.

Principle
==
When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
entry, we can determine whether the traversal of remaining VMAs can be
terminated early by checking if mapcount is zero. This optimization
helps improve performance during migration.

However, when removing migrate PTE entries and setting up PTEs for the
destination folio in remove_migration_ptes, there is no such information
available to assist in deciding whether the traversal of remaining VMAs
can be ended early. Therefore, it is necessary to traversal all VMAs
associated with this folio.

In reality, when a folio is fully unmapped and before all migrate PTE
entries are removed, the mapcount will always be zero. Since page_type
and mapcount share a union, and referring to folio_mapcount, we can
reuse page_type to record the number of migrate PTE entries of the
current folio in the system as long as it's not a large folio. This
reuse does not affect calls to folio_mapcount, which will always return
zero.

Therefore, we can set the folio's page_type to PGTY_mgt_entry when
try_to_migrate completes, the folio is already unmapped, and it's not a
large folio. The remaining 24 bits can then be used to record the number
of migrate PTE entries generated by try_to_migrate.

Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
zero, we can terminate the VMA traversal early.

It's important to note that we need to initialize the folio's page_type
to PGTY_mgt_entry and set the migrate entry count only while holding the
rmap walk lock.This is because during the lock period, we can prevent
new VMA fork (which would increase migrate entries) and VMA unmap
(which would decrease migrate entries).

However, I doubt there is actually an additional critical section here, for
example anon:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree tail
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter                            pte_lock
                ....                           pte_present copy
                                            ...
                pte_lock
                    new forked pte clean
....
remove_migration_ptes
    rmap_walk_anon_lock

If my understanding is correct and such a critical section exists, it
shouldn't cause any issues—newly added PTEs can still be properly
removed and converted into migrate entries.

But in this:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter
                pte_lock
                    migrate entry set
                ....                        pte_lock
                                                pte_nonpresent copy
                                            ....
....
remove_migration_ptes
    rmap_walk_anon_lock

If the parent process first acquires the pte_lock to set a migrate
entry, the child process will then directly copy the non-present migrate
entry, resulting in an increase in migrate entries. However, since the
newly added VMA is positioned later in the rb tree of the folio's
anon_vma, when we traverse to this child-process-added migrate entry,
the count of migrate entries will still be correctly recorded, and this
will not cause any issues.

If I misunderstand, please correct me. :)

After a folio exits try_to_migrate and before remove_migration_ptes
acquires the rmap lock, the system can perform normal fork and unmap
operations. Therefore, we need to increment or decrement the migrate
entry count recorded in the folio (if it's of type PGTY_mgt_entry) when
handling copy/zap_nonpresent_pte.

When performing remove_migration_ptes during migration to start removing
migrate entries, we need to dynamically decrement the recorded migrate
entry count. Once this count reaches zero, it indicates there are no
remaining migrate entries in the associated VMAs that need to be cleared
and replaced with the destination PFN. This allows us to safely
terminate the VMA traversal early.

However, it's important to note that if issues occur during migration
requiring an undo operation, PGTY_mgt_entry can no longer be used. This
is because the dst needs to be set back to the src, and the presence of
PGTY_mgt_entry would interfere with the normal usage of mapcount when
setup rmap info.

Test
==
I set up a 2-node test environment using QEMU, and used mbind to trigger
page migration between nodes for the specified VMA.

The core idea of the test scenario is to create a situation where the
number of VMAs that need to be itered in the anon_vma is significantly
larger than the folio's mapcount.

To achieve this, I constructed an exaggerated scenario: the parent
process allocates 5MB of memory and binds it to node0, then immediately
forks 1000 child processes. Each child process runs and immediately
memset all this memory to complete COW-ed. Afterwards, the parent process
calls mbind to migrate the memory from node0 to node1, while recording
the time consumed during this period.
Additionally, perf is used to capture a flame graph during the mbind
execution.

The time cost results are as follows:
    Patch1-9               Normal(f817b6d)
      18ms                    197ms
      58ms                    152ms
      40ms                    120ms

The hot path show in fireflame:
    Patch1-9
      move_to_new_folio        38.89%
      remove_migration_ptes    61.11%
      ---------------------
      move_to_new_folio        32.76%
      remove_migration_ptes    67.24%
      ---------------------
      move_to_new_folio        37.50%
      remove_migration_ptes    62.50%

    Normal(f817b6d)
      move_to_new_folio        11.43%
      remove_migration_ptes    87.43%
      ---------------------
      move_to_new_folio        13.91%
      remove_migration_ptes    86.09%
      ---------------------
      move_to_new_folio        12.50%
      remove_migration_ptes    85.83%

Can easy see that cost time optimized by approximately 75.3%.
And the proportion of the remove_migration_ptes function path
has decreased by approximately 20%.

Simplify Test Code:

```c
#define size (5 << 20)
#define CHILD_COUNT 1000

int *buffer = (int *)mmap(NULL, size, PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

unsigned long mask = 1UL << 0;
mbind(buffer, size, MPOL_BIND, &mask, 2, 0);
// let all page-faulted in node 0
memset(buffer, 0, size);

// fork child.
pid_t children[CHILD_COUNT];
for (int i = 0; i < CHILD_COUNT; i++) {
    pid_t pid = fork();
    if (pid == 0) {
        // let all child process COW-ed
        memset(buffer, 0, size);
        sleep(100000);
    } else {
        children[i] = pid;
    }
}

// maybe you need sleep to wait child process COW-ed
sleep(10);


// You can use perf watch here
mask = 1UL << 1;
// migrate this buffer from node 0 -> node 1
mbind(buffer, size, MPOL_BIND, &mask, 4, MPOL_MF_MOVE);

```
Notice: this code removed many error assert and resource clean
action, time record ...

Why RFC
==
Memory migration is one of the most general-purpose modules.
My own tests cannot cover all system scenarios, and there
may be omissions or misunderstandings in the code modifications.

If good enough, I will send the formal patch.

Patch 1-7 do some code clean work.
Patch 8 prepare for PGTY_mgt_entry.
Patch 9 apply it.

Huan Yang (9):
  mm: introduce PAGE_TYPE_SHIFT
  mm: add page_type value helper
  mm/rmap: simplify rmap_walk invoke
  mm/rmap: add args in rmap_walk_control done hook
  mm/rmap: introduce exit hook
  mm/rmap: introduce migrate_walk_arg
  mm/migrate: rename rmap_walk_arg folio
  mm/migrate: infrastructure for migrate entry page_type.
  mm/migrate: apply migrate entry page_type

 include/linux/page-flags.h | 106 +++++++++++++++++++++++++++++++++++--
 include/linux/rmap.h       |   7 ++-
 mm/ksm.c                   |   2 +-
 mm/memory.c                |   2 +
 mm/migrate.c               |  38 ++++++++++---
 mm/rmap.c                  |  85 ++++++++++++++++++-----------
 6 files changed, 193 insertions(+), 47 deletions(-)

--
2.34.1

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by David Hildenbrand 2 months, 1 week ago

On 24.07.25 10:44, Huan Yang wrote:
> Summary
> ==
> This patchset reuses page_type to store migrate entry count during the
> period from migrate entry setup to removal, enabling accelerated VMA
> traversal when removing migrate entries, following a similar principle to
> early termination when folio is unmapped in try_to_migrate.

I absolutely detest (ab)using page types for that, so no from my side 
unless I am missing something important.

> 
> In my self-constructed test scenario, the migration time can be reduced

How relevant is that in practice?

> from over 150+ms to around 30+ms, achieving nearly a 70% performance
> improvement. Additionally, the flame graph shows that the proportion of
> remove_migration_ptes can be reduced from 80%+ to 60%+.
> 
> Notice: migrate entry specifically refers to migrate PTE entry, as large
> folio are not supported page type and 0 mapcount reuse.
> 
> Principle
> ==
> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
> entry, we can determine whether the traversal of remaining VMAs can be
> terminated early by checking if mapcount is zero. This optimization
> helps improve performance during migration.
> 
> However, when removing migrate PTE entries and setting up PTEs for the
> destination folio in remove_migration_ptes, there is no such information
> available to assist in deciding whether the traversal of remaining VMAs
> can be ended early. Therefore, it is necessary to traversal all VMAs
> associated with this folio.

Yes, we don't know how many migration entries are still pointing at the 
page.

> 
> In reality, when a folio is fully unmapped and before all migrate PTE
> entries are removed, the mapcount will always be zero. Since page_type
> and mapcount share a union, and referring to folio_mapcount, we can
> reuse page_type to record the number of migrate PTE entries of the
> current folio in the system as long as it's not a large folio. This
> reuse does not affect calls to folio_mapcount, which will always return
> zero.
 > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when
> try_to_migrate completes, the folio is already unmapped, and it's not a
> large folio. The remaining 24 bits can then be used to record the number
> of migrate PTE entries generated by try_to_migrate.

In the future the page type will no longer overlay the mapcount and, 
consequently, be sticky.

> 
> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
> zero, we can terminate the VMA traversal early.
> 
> It's important to note that we need to initialize the folio's page_type
> to PGTY_mgt_entry and set the migrate entry count only while holding the
> rmap walk lock.This is because during the lock period, we can prevent
> new VMA fork (which would increase migrate entries) and VMA unmap
> (which would decrease migrate entries).

The more I read about PGTY_mgt_entry, the more I hate it.

> 
> However, I doubt there is actually an additional critical section here, for
> example anon:
> 
> Process Parent                          fork
> try_to_migrate
>                                          anon_vma_clone
>                                              write_lock
>                                                  avc_inster_tree tail
>                                          ....
>      folio_lock_anon_vma_read             copy_pte_range
>          vma_iter                            pte_lock
>                  ....                           pte_present copy
>                                              ...
>                  pte_lock
>                      new forked pte clean
> ....
> remove_migration_ptes
>      rmap_walk_anon_lock
> 
> If my understanding is correct and such a critical section exists, it
> shouldn't cause any issues—newly added PTEs can still be properly
> removed and converted into migrate entries.
> 
> But in this:
> 
> Process Parent                          fork
> try_to_migrate
>                                          anon_vma_clone
>                                              write_lock
>                                                  avc_inster_tree
>                                          ....
>      folio_lock_anon_vma_read             copy_pte_range
>          vma_iter
>                  pte_lock
>                      migrate entry set
>                  ....                        pte_lock
>                                                  pte_nonpresent copy
>                                              ....
> ....
> remove_migration_ptes
>      rmap_walk_anon_lock

Just a note: migration entries also apply to non-anon folios.

-- 
Cheers,

David / dhildenb

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by Lorenzo Stoakes 2 months, 1 week ago

NAK. This series is completely un-upstreamable in any form.

David has responded to you already, but to underline.

The lesson here is that you really ought to discuss things with people in
the subsystem you are changing in advance of spending a lot of time doing
work like this which you intend to upstream.

On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
> Summary
> ==
> This patchset reuses page_type to store migrate entry count during the
> period from migrate entry setup to removal, enabling accelerated VMA
> traversal when removing migrate entries, following a similar principle to
> early termination when folio is unmapped in try_to_migrate.
>
> In my self-constructed test scenario, the migration time can be reduced
> from over 150+ms to around 30+ms, achieving nearly a 70% performance
> improvement. Additionally, the flame graph shows that the proportion of
> remove_migration_ptes can be reduced from 80%+ to 60%+.

This sounds completely contrived. I don't even know if you have a use case
here.

>
> Notice: migrate entry specifically refers to migrate PTE entry, as large
> folio are not supported page type and 0 mapcount reuse.
>
> Principle
> ==
> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
> entry, we can determine whether the traversal of remaining VMAs can be
> terminated early by checking if mapcount is zero. This optimization
> helps improve performance during migration.
>
> However, when removing migrate PTE entries and setting up PTEs for the
> destination folio in remove_migration_ptes, there is no such information
> available to assist in deciding whether the traversal of remaining VMAs
> can be ended early. Therefore, it is necessary to traversal all VMAs
> associated with this folio.
>
> In reality, when a folio is fully unmapped and before all migrate PTE
> entries are removed, the mapcount will always be zero. Since page_type
> and mapcount share a union, and referring to folio_mapcount, we can
> reuse page_type to record the number of migrate PTE entries of the
> current folio in the system as long as it's not a large folio. This
> reuse does not affect calls to folio_mapcount, which will always return
> zero.

OK so - if you ever find yourself thinking this way, please stop. We are in
the midst of fundamentally changing how folios and pages work.

There is absolutely ZERO room for reusing arbitrary fields in this way. Any
series that attempts to do this will be rejected.

Again, I must say - if you had raised this ahead of time we could have
saved you some effort.

>
> Therefore, we can set the folio's page_type to PGTY_mgt_entry when
> try_to_migrate completes, the folio is already unmapped, and it's not a
> large folio. The remaining 24 bits can then be used to record the number
> of migrate PTE entries generated by try_to_migrate.

I mean there's so much wrong here. The future is large folios. Making some
fundamental change that relies on not-large folio is a mistake. 24
bits... I mean no.

>
> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
> zero, we can terminate the VMA traversal early.
>
> It's important to note that we need to initialize the folio's page_type
> to PGTY_mgt_entry and set the migrate entry count only while holding the
> rmap walk lock.This is because during the lock period, we can prevent
> new VMA fork (which would increase migrate entries) and VMA unmap
> (which would decrease migrate entries).

No, no no. NO.

You are not introducing new locking complexity for this.

I could go on, but there's no point.

This series is not upstreamable, NAK.

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by Huan Yang 2 months, 1 week ago

在 2025/7/24 17:15, Lorenzo Stoakes 写道:
> NAK. This series is completely un-upstreamable in any form.
>
> David has responded to you already, but to underline.
>
> The lesson here is that you really ought to discuss things with people in
> the subsystem you are changing in advance of spending a lot of time doing
> work like this which you intend to upstream.

Yes, this is a very useful lesson.:)

In the future, when I have ideas in this area, I will bring them up for 
discussion first, especially when

they involve folios or pages.

>
> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>> Summary
>> ==
>> This patchset reuses page_type to store migrate entry count during the
>> period from migrate entry setup to removal, enabling accelerated VMA
>> traversal when removing migrate entries, following a similar principle to
>> early termination when folio is unmapped in try_to_migrate.
>>
>> In my self-constructed test scenario, the migration time can be reduced
>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>> improvement. Additionally, the flame graph shows that the proportion of
>> remove_migration_ptes can be reduced from 80%+ to 60%+.
> This sounds completely contrived. I don't even know if you have a use case
> here.

The test case I provided does have an amplified effect, but the 
optimization it demonstrates is real. It's just that when scaled up to 
the system level, the effect becomes difficult to observe.

>
>> Notice: migrate entry specifically refers to migrate PTE entry, as large
>> folio are not supported page type and 0 mapcount reuse.
>>
>> Principle
>> ==
>> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
>> entry, we can determine whether the traversal of remaining VMAs can be
>> terminated early by checking if mapcount is zero. This optimization
>> helps improve performance during migration.
>>
>> However, when removing migrate PTE entries and setting up PTEs for the
>> destination folio in remove_migration_ptes, there is no such information
>> available to assist in deciding whether the traversal of remaining VMAs
>> can be ended early. Therefore, it is necessary to traversal all VMAs
>> associated with this folio.
>>
>> In reality, when a folio is fully unmapped and before all migrate PTE
>> entries are removed, the mapcount will always be zero. Since page_type
>> and mapcount share a union, and referring to folio_mapcount, we can
>> reuse page_type to record the number of migrate PTE entries of the
>> current folio in the system as long as it's not a large folio. This
>> reuse does not affect calls to folio_mapcount, which will always return
>> zero.
> OK so - if you ever find yourself thinking this way, please stop. We are in
> the midst of fundamentally changing how folios and pages work.
>
> There is absolutely ZERO room for reusing arbitrary fields in this way. Any
> series that attempts to do this will be rejected.
>
> Again, I must say - if you had raised this ahead of time we could have
> saved you some effort.
>
>> Therefore, we can set the folio's page_type to PGTY_mgt_entry when
>> try_to_migrate completes, the folio is already unmapped, and it's not a
>> large folio. The remaining 24 bits can then be used to record the number
>> of migrate PTE entries generated by try_to_migrate.
> I mean there's so much wrong here. The future is large folios. Making some
> fundamental change that relies on not-large folio is a mistake. 24
> bits... I mean no.
Thanks, I understand it.
>
>> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
>> zero, we can terminate the VMA traversal early.
>>
>> It's important to note that we need to initialize the folio's page_type
>> to PGTY_mgt_entry and set the migrate entry count only while holding the
>> rmap walk lock.This is because during the lock period, we can prevent
>> new VMA fork (which would increase migrate entries) and VMA unmap
>> (which would decrease migrate entries).
> No, no no. NO.
>
> You are not introducing new locking complexity for this.
>
> I could go on, but there's no point.
>
> This series is not upstreamable, NAK.
>

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by Huang, Ying 2 months, 1 week ago

Huan Yang <link@vivo.com> writes:

> 在 2025/7/24 17:15, Lorenzo Stoakes 写道:

[snip]

>> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>>> Summary
>>> ==
>>> This patchset reuses page_type to store migrate entry count during the
>>> period from migrate entry setup to removal, enabling accelerated VMA
>>> traversal when removing migrate entries, following a similar principle to
>>> early termination when folio is unmapped in try_to_migrate.
>>>
>>> In my self-constructed test scenario, the migration time can be reduced
>>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>>> improvement. Additionally, the flame graph shows that the proportion of
>>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>> This sounds completely contrived. I don't even know if you have a use case
>> here.
>
> The test case I provided does have an amplified effect, but the
> optimization it demonstrates is real. It's just that when scaled up to
> the system level, the effect becomes difficult to observe.
>

It's more important to sell your problems than selling your code :-)

If you cannot prove that the optimization has some practical effect,
it's hard to persuade others for increased complexity.

---
Best Regards,
Huang, Ying

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by Huan Yang 2 months, 1 week ago

在 2025/7/25 09:37, Huang, Ying 写道:
> Huan Yang <link@vivo.com> writes:
>
>> 在 2025/7/24 17:15, Lorenzo Stoakes 写道:
> [snip]
>
>>> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote:
>>>> Summary
>>>> ==
>>>> This patchset reuses page_type to store migrate entry count during the
>>>> period from migrate entry setup to removal, enabling accelerated VMA
>>>> traversal when removing migrate entries, following a similar principle to
>>>> early termination when folio is unmapped in try_to_migrate.
>>>>
>>>> In my self-constructed test scenario, the migration time can be reduced
>>>> from over 150+ms to around 30+ms, achieving nearly a 70% performance
>>>> improvement. Additionally, the flame graph shows that the proportion of
>>>> remove_migration_ptes can be reduced from 80%+ to 60%+.
>>> This sounds completely contrived. I don't even know if you have a use case
>>> here.
>> The test case I provided does have an amplified effect, but the
>> optimization it demonstrates is real. It's just that when scaled up to
>> the system level, the effect becomes difficult to observe.
>>
> It's more important to sell your problems than selling your code :-)
I'll remember it. Thanks. :)
>
> If you cannot prove that the optimization has some practical effect,
> it's hard to persuade others for increased complexity.

To be honest, this patch stems from an issue I noticed during code review.

When this patchset was completed, I did put in some effort to find its 
benefits, and it was only

under such an exaggeratedly constructed test scenario that the effect 
could be demonstrated. :(

The actual problem I'm facing has been described in other replies.

It's actually about some anonymous pages and fully COW-ed pages, but 
their avcs haven't been

removed from the anon_vma's RB tree, resulting in inefficient traversal.

Lorenzo has mentioned that he has some bold ideas regarding this, let's 
look forward it. :)

Thanks.

>
> ---
> Best Regards,
> Huang, Ying

Re: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Posted by David Hildenbrand 2 months, 1 week ago

>> If you cannot prove that the optimization has some practical effect,
>> it's hard to persuade others for increased complexity.
> 
> To be honest, this patch stems from an issue I noticed during code review.
> 
> When this patchset was completed, I did put in some effort to find its
> benefits, and it was only
> 
> under such an exaggeratedly constructed test scenario that the effect
> could be demonstrated. :(

I mean, thanks for looking into that and trying to find a way to improve 
it. :)


That VMA walk is the real problem, stopping earlier is just an 
optimization that works in some cases. I guess on average it will 
improve things, although probably really hard to quantify in reality.

I think tracking the #migration entries might be a very good debugging tool.

A cleaner and more reliably solution regarding what you tried to 
implement would be able to

(a) track it in a separate counter, at the time we establish/remove a 
migration entry, not once the mapcount is already 0. With "struct folio" 
getting allocated separately in the future this could maybe be feasible 
(and putting it under a config knob).

(b) doing it also for large folios as well


(b) might be tricky with migration entries being used for THP splits, 
but probable it could be special-cased somehow, I am sure.


-- 
Cheers,

David / dhildenb