include/linux/page-flags.h | 106 +++++++++++++++++++++++++++++++++++-- include/linux/rmap.h | 7 ++- mm/ksm.c | 2 +- mm/memory.c | 2 + mm/migrate.c | 38 ++++++++++--- mm/rmap.c | 85 ++++++++++++++++++----------- 6 files changed, 193 insertions(+), 47 deletions(-)
Summary == This patchset reuses page_type to store migrate entry count during the period from migrate entry setup to removal, enabling accelerated VMA traversal when removing migrate entries, following a similar principle to early termination when folio is unmapped in try_to_migrate. In my self-constructed test scenario, the migration time can be reduced from over 150+ms to around 30+ms, achieving nearly a 70% performance improvement. Additionally, the flame graph shows that the proportion of remove_migration_ptes can be reduced from 80%+ to 60%+. Notice: migrate entry specifically refers to migrate PTE entry, as large folio are not supported page type and 0 mapcount reuse. Principle == When a page removes all PTEs in try_to_migrate and sets up a migrate PTE entry, we can determine whether the traversal of remaining VMAs can be terminated early by checking if mapcount is zero. This optimization helps improve performance during migration. However, when removing migrate PTE entries and setting up PTEs for the destination folio in remove_migration_ptes, there is no such information available to assist in deciding whether the traversal of remaining VMAs can be ended early. Therefore, it is necessary to traversal all VMAs associated with this folio. In reality, when a folio is fully unmapped and before all migrate PTE entries are removed, the mapcount will always be zero. Since page_type and mapcount share a union, and referring to folio_mapcount, we can reuse page_type to record the number of migrate PTE entries of the current folio in the system as long as it's not a large folio. This reuse does not affect calls to folio_mapcount, which will always return zero. Therefore, we can set the folio's page_type to PGTY_mgt_entry when try_to_migrate completes, the folio is already unmapped, and it's not a large folio. The remaining 24 bits can then be used to record the number of migrate PTE entries generated by try_to_migrate. Then, in remove_migration_ptes, when the nr_mgt_entry count drops to zero, we can terminate the VMA traversal early. It's important to note that we need to initialize the folio's page_type to PGTY_mgt_entry and set the migrate entry count only while holding the rmap walk lock.This is because during the lock period, we can prevent new VMA fork (which would increase migrate entries) and VMA unmap (which would decrease migrate entries). However, I doubt there is actually an additional critical section here, for example anon: Process Parent fork try_to_migrate anon_vma_clone write_lock avc_inster_tree tail .... folio_lock_anon_vma_read copy_pte_range vma_iter pte_lock .... pte_present copy ... pte_lock new forked pte clean .... remove_migration_ptes rmap_walk_anon_lock If my understanding is correct and such a critical section exists, it shouldn't cause any issues—newly added PTEs can still be properly removed and converted into migrate entries. But in this: Process Parent fork try_to_migrate anon_vma_clone write_lock avc_inster_tree .... folio_lock_anon_vma_read copy_pte_range vma_iter pte_lock migrate entry set .... pte_lock pte_nonpresent copy .... .... remove_migration_ptes rmap_walk_anon_lock If the parent process first acquires the pte_lock to set a migrate entry, the child process will then directly copy the non-present migrate entry, resulting in an increase in migrate entries. However, since the newly added VMA is positioned later in the rb tree of the folio's anon_vma, when we traverse to this child-process-added migrate entry, the count of migrate entries will still be correctly recorded, and this will not cause any issues. If I misunderstand, please correct me. :) After a folio exits try_to_migrate and before remove_migration_ptes acquires the rmap lock, the system can perform normal fork and unmap operations. Therefore, we need to increment or decrement the migrate entry count recorded in the folio (if it's of type PGTY_mgt_entry) when handling copy/zap_nonpresent_pte. When performing remove_migration_ptes during migration to start removing migrate entries, we need to dynamically decrement the recorded migrate entry count. Once this count reaches zero, it indicates there are no remaining migrate entries in the associated VMAs that need to be cleared and replaced with the destination PFN. This allows us to safely terminate the VMA traversal early. However, it's important to note that if issues occur during migration requiring an undo operation, PGTY_mgt_entry can no longer be used. This is because the dst needs to be set back to the src, and the presence of PGTY_mgt_entry would interfere with the normal usage of mapcount when setup rmap info. Test == I set up a 2-node test environment using QEMU, and used mbind to trigger page migration between nodes for the specified VMA. The core idea of the test scenario is to create a situation where the number of VMAs that need to be itered in the anon_vma is significantly larger than the folio's mapcount. To achieve this, I constructed an exaggerated scenario: the parent process allocates 5MB of memory and binds it to node0, then immediately forks 1000 child processes. Each child process runs and immediately memset all this memory to complete COW-ed. Afterwards, the parent process calls mbind to migrate the memory from node0 to node1, while recording the time consumed during this period. Additionally, perf is used to capture a flame graph during the mbind execution. The time cost results are as follows: Patch1-9 Normal(f817b6d) 18ms 197ms 58ms 152ms 40ms 120ms The hot path show in fireflame: Patch1-9 move_to_new_folio 38.89% remove_migration_ptes 61.11% --------------------- move_to_new_folio 32.76% remove_migration_ptes 67.24% --------------------- move_to_new_folio 37.50% remove_migration_ptes 62.50% Normal(f817b6d) move_to_new_folio 11.43% remove_migration_ptes 87.43% --------------------- move_to_new_folio 13.91% remove_migration_ptes 86.09% --------------------- move_to_new_folio 12.50% remove_migration_ptes 85.83% Can easy see that cost time optimized by approximately 75.3%. And the proportion of the remove_migration_ptes function path has decreased by approximately 20%. Simplify Test Code: ```c #define size (5 << 20) #define CHILD_COUNT 1000 int *buffer = (int *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); unsigned long mask = 1UL << 0; mbind(buffer, size, MPOL_BIND, &mask, 2, 0); // let all page-faulted in node 0 memset(buffer, 0, size); // fork child. pid_t children[CHILD_COUNT]; for (int i = 0; i < CHILD_COUNT; i++) { pid_t pid = fork(); if (pid == 0) { // let all child process COW-ed memset(buffer, 0, size); sleep(100000); } else { children[i] = pid; } } // maybe you need sleep to wait child process COW-ed sleep(10); // You can use perf watch here mask = 1UL << 1; // migrate this buffer from node 0 -> node 1 mbind(buffer, size, MPOL_BIND, &mask, 4, MPOL_MF_MOVE); ``` Notice: this code removed many error assert and resource clean action, time record ... Why RFC == Memory migration is one of the most general-purpose modules. My own tests cannot cover all system scenarios, and there may be omissions or misunderstandings in the code modifications. If good enough, I will send the formal patch. Patch 1-7 do some code clean work. Patch 8 prepare for PGTY_mgt_entry. Patch 9 apply it. Huan Yang (9): mm: introduce PAGE_TYPE_SHIFT mm: add page_type value helper mm/rmap: simplify rmap_walk invoke mm/rmap: add args in rmap_walk_control done hook mm/rmap: introduce exit hook mm/rmap: introduce migrate_walk_arg mm/migrate: rename rmap_walk_arg folio mm/migrate: infrastructure for migrate entry page_type. mm/migrate: apply migrate entry page_type include/linux/page-flags.h | 106 +++++++++++++++++++++++++++++++++++-- include/linux/rmap.h | 7 ++- mm/ksm.c | 2 +- mm/memory.c | 2 + mm/migrate.c | 38 ++++++++++--- mm/rmap.c | 85 ++++++++++++++++++----------- 6 files changed, 193 insertions(+), 47 deletions(-) -- 2.34.1
On 24.07.25 10:44, Huan Yang wrote: > Summary > == > This patchset reuses page_type to store migrate entry count during the > period from migrate entry setup to removal, enabling accelerated VMA > traversal when removing migrate entries, following a similar principle to > early termination when folio is unmapped in try_to_migrate. I absolutely detest (ab)using page types for that, so no from my side unless I am missing something important. > > In my self-constructed test scenario, the migration time can be reduced How relevant is that in practice? > from over 150+ms to around 30+ms, achieving nearly a 70% performance > improvement. Additionally, the flame graph shows that the proportion of > remove_migration_ptes can be reduced from 80%+ to 60%+. > > Notice: migrate entry specifically refers to migrate PTE entry, as large > folio are not supported page type and 0 mapcount reuse. > > Principle > == > When a page removes all PTEs in try_to_migrate and sets up a migrate PTE > entry, we can determine whether the traversal of remaining VMAs can be > terminated early by checking if mapcount is zero. This optimization > helps improve performance during migration. > > However, when removing migrate PTE entries and setting up PTEs for the > destination folio in remove_migration_ptes, there is no such information > available to assist in deciding whether the traversal of remaining VMAs > can be ended early. Therefore, it is necessary to traversal all VMAs > associated with this folio. Yes, we don't know how many migration entries are still pointing at the page. > > In reality, when a folio is fully unmapped and before all migrate PTE > entries are removed, the mapcount will always be zero. Since page_type > and mapcount share a union, and referring to folio_mapcount, we can > reuse page_type to record the number of migrate PTE entries of the > current folio in the system as long as it's not a large folio. This > reuse does not affect calls to folio_mapcount, which will always return > zero. > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when > try_to_migrate completes, the folio is already unmapped, and it's not a > large folio. The remaining 24 bits can then be used to record the number > of migrate PTE entries generated by try_to_migrate. In the future the page type will no longer overlay the mapcount and, consequently, be sticky. > > Then, in remove_migration_ptes, when the nr_mgt_entry count drops to > zero, we can terminate the VMA traversal early. > > It's important to note that we need to initialize the folio's page_type > to PGTY_mgt_entry and set the migrate entry count only while holding the > rmap walk lock.This is because during the lock period, we can prevent > new VMA fork (which would increase migrate entries) and VMA unmap > (which would decrease migrate entries). The more I read about PGTY_mgt_entry, the more I hate it. > > However, I doubt there is actually an additional critical section here, for > example anon: > > Process Parent fork > try_to_migrate > anon_vma_clone > write_lock > avc_inster_tree tail > .... > folio_lock_anon_vma_read copy_pte_range > vma_iter pte_lock > .... pte_present copy > ... > pte_lock > new forked pte clean > .... > remove_migration_ptes > rmap_walk_anon_lock > > If my understanding is correct and such a critical section exists, it > shouldn't cause any issues—newly added PTEs can still be properly > removed and converted into migrate entries. > > But in this: > > Process Parent fork > try_to_migrate > anon_vma_clone > write_lock > avc_inster_tree > .... > folio_lock_anon_vma_read copy_pte_range > vma_iter > pte_lock > migrate entry set > .... pte_lock > pte_nonpresent copy > .... > .... > remove_migration_ptes > rmap_walk_anon_lock Just a note: migration entries also apply to non-anon folios. -- Cheers, David / dhildenb
NAK. This series is completely un-upstreamable in any form. David has responded to you already, but to underline. The lesson here is that you really ought to discuss things with people in the subsystem you are changing in advance of spending a lot of time doing work like this which you intend to upstream. On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote: > Summary > == > This patchset reuses page_type to store migrate entry count during the > period from migrate entry setup to removal, enabling accelerated VMA > traversal when removing migrate entries, following a similar principle to > early termination when folio is unmapped in try_to_migrate. > > In my self-constructed test scenario, the migration time can be reduced > from over 150+ms to around 30+ms, achieving nearly a 70% performance > improvement. Additionally, the flame graph shows that the proportion of > remove_migration_ptes can be reduced from 80%+ to 60%+. This sounds completely contrived. I don't even know if you have a use case here. > > Notice: migrate entry specifically refers to migrate PTE entry, as large > folio are not supported page type and 0 mapcount reuse. > > Principle > == > When a page removes all PTEs in try_to_migrate and sets up a migrate PTE > entry, we can determine whether the traversal of remaining VMAs can be > terminated early by checking if mapcount is zero. This optimization > helps improve performance during migration. > > However, when removing migrate PTE entries and setting up PTEs for the > destination folio in remove_migration_ptes, there is no such information > available to assist in deciding whether the traversal of remaining VMAs > can be ended early. Therefore, it is necessary to traversal all VMAs > associated with this folio. > > In reality, when a folio is fully unmapped and before all migrate PTE > entries are removed, the mapcount will always be zero. Since page_type > and mapcount share a union, and referring to folio_mapcount, we can > reuse page_type to record the number of migrate PTE entries of the > current folio in the system as long as it's not a large folio. This > reuse does not affect calls to folio_mapcount, which will always return > zero. OK so - if you ever find yourself thinking this way, please stop. We are in the midst of fundamentally changing how folios and pages work. There is absolutely ZERO room for reusing arbitrary fields in this way. Any series that attempts to do this will be rejected. Again, I must say - if you had raised this ahead of time we could have saved you some effort. > > Therefore, we can set the folio's page_type to PGTY_mgt_entry when > try_to_migrate completes, the folio is already unmapped, and it's not a > large folio. The remaining 24 bits can then be used to record the number > of migrate PTE entries generated by try_to_migrate. I mean there's so much wrong here. The future is large folios. Making some fundamental change that relies on not-large folio is a mistake. 24 bits... I mean no. > > Then, in remove_migration_ptes, when the nr_mgt_entry count drops to > zero, we can terminate the VMA traversal early. > > It's important to note that we need to initialize the folio's page_type > to PGTY_mgt_entry and set the migrate entry count only while holding the > rmap walk lock.This is because during the lock period, we can prevent > new VMA fork (which would increase migrate entries) and VMA unmap > (which would decrease migrate entries). No, no no. NO. You are not introducing new locking complexity for this. I could go on, but there's no point. This series is not upstreamable, NAK.
在 2025/7/24 17:15, Lorenzo Stoakes 写道: > NAK. This series is completely un-upstreamable in any form. > > David has responded to you already, but to underline. > > The lesson here is that you really ought to discuss things with people in > the subsystem you are changing in advance of spending a lot of time doing > work like this which you intend to upstream. Yes, this is a very useful lesson.:) In the future, when I have ideas in this area, I will bring them up for discussion first, especially when they involve folios or pages. > > On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote: >> Summary >> == >> This patchset reuses page_type to store migrate entry count during the >> period from migrate entry setup to removal, enabling accelerated VMA >> traversal when removing migrate entries, following a similar principle to >> early termination when folio is unmapped in try_to_migrate. >> >> In my self-constructed test scenario, the migration time can be reduced >> from over 150+ms to around 30+ms, achieving nearly a 70% performance >> improvement. Additionally, the flame graph shows that the proportion of >> remove_migration_ptes can be reduced from 80%+ to 60%+. > This sounds completely contrived. I don't even know if you have a use case > here. The test case I provided does have an amplified effect, but the optimization it demonstrates is real. It's just that when scaled up to the system level, the effect becomes difficult to observe. > >> Notice: migrate entry specifically refers to migrate PTE entry, as large >> folio are not supported page type and 0 mapcount reuse. >> >> Principle >> == >> When a page removes all PTEs in try_to_migrate and sets up a migrate PTE >> entry, we can determine whether the traversal of remaining VMAs can be >> terminated early by checking if mapcount is zero. This optimization >> helps improve performance during migration. >> >> However, when removing migrate PTE entries and setting up PTEs for the >> destination folio in remove_migration_ptes, there is no such information >> available to assist in deciding whether the traversal of remaining VMAs >> can be ended early. Therefore, it is necessary to traversal all VMAs >> associated with this folio. >> >> In reality, when a folio is fully unmapped and before all migrate PTE >> entries are removed, the mapcount will always be zero. Since page_type >> and mapcount share a union, and referring to folio_mapcount, we can >> reuse page_type to record the number of migrate PTE entries of the >> current folio in the system as long as it's not a large folio. This >> reuse does not affect calls to folio_mapcount, which will always return >> zero. > OK so - if you ever find yourself thinking this way, please stop. We are in > the midst of fundamentally changing how folios and pages work. > > There is absolutely ZERO room for reusing arbitrary fields in this way. Any > series that attempts to do this will be rejected. > > Again, I must say - if you had raised this ahead of time we could have > saved you some effort. > >> Therefore, we can set the folio's page_type to PGTY_mgt_entry when >> try_to_migrate completes, the folio is already unmapped, and it's not a >> large folio. The remaining 24 bits can then be used to record the number >> of migrate PTE entries generated by try_to_migrate. > I mean there's so much wrong here. The future is large folios. Making some > fundamental change that relies on not-large folio is a mistake. 24 > bits... I mean no. Thanks, I understand it. > >> Then, in remove_migration_ptes, when the nr_mgt_entry count drops to >> zero, we can terminate the VMA traversal early. >> >> It's important to note that we need to initialize the folio's page_type >> to PGTY_mgt_entry and set the migrate entry count only while holding the >> rmap walk lock.This is because during the lock period, we can prevent >> new VMA fork (which would increase migrate entries) and VMA unmap >> (which would decrease migrate entries). > No, no no. NO. > > You are not introducing new locking complexity for this. > > I could go on, but there's no point. > > This series is not upstreamable, NAK. >
Huan Yang <link@vivo.com> writes: > 在 2025/7/24 17:15, Lorenzo Stoakes 写道: [snip] >> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote: >>> Summary >>> == >>> This patchset reuses page_type to store migrate entry count during the >>> period from migrate entry setup to removal, enabling accelerated VMA >>> traversal when removing migrate entries, following a similar principle to >>> early termination when folio is unmapped in try_to_migrate. >>> >>> In my self-constructed test scenario, the migration time can be reduced >>> from over 150+ms to around 30+ms, achieving nearly a 70% performance >>> improvement. Additionally, the flame graph shows that the proportion of >>> remove_migration_ptes can be reduced from 80%+ to 60%+. >> This sounds completely contrived. I don't even know if you have a use case >> here. > > The test case I provided does have an amplified effect, but the > optimization it demonstrates is real. It's just that when scaled up to > the system level, the effect becomes difficult to observe. > It's more important to sell your problems than selling your code :-) If you cannot prove that the optimization has some practical effect, it's hard to persuade others for increased complexity. --- Best Regards, Huang, Ying
在 2025/7/25 09:37, Huang, Ying 写道: > Huan Yang <link@vivo.com> writes: > >> 在 2025/7/24 17:15, Lorenzo Stoakes 写道: > [snip] > >>> On Thu, Jul 24, 2025 at 04:44:28PM +0800, Huan Yang wrote: >>>> Summary >>>> == >>>> This patchset reuses page_type to store migrate entry count during the >>>> period from migrate entry setup to removal, enabling accelerated VMA >>>> traversal when removing migrate entries, following a similar principle to >>>> early termination when folio is unmapped in try_to_migrate. >>>> >>>> In my self-constructed test scenario, the migration time can be reduced >>>> from over 150+ms to around 30+ms, achieving nearly a 70% performance >>>> improvement. Additionally, the flame graph shows that the proportion of >>>> remove_migration_ptes can be reduced from 80%+ to 60%+. >>> This sounds completely contrived. I don't even know if you have a use case >>> here. >> The test case I provided does have an amplified effect, but the >> optimization it demonstrates is real. It's just that when scaled up to >> the system level, the effect becomes difficult to observe. >> > It's more important to sell your problems than selling your code :-) I'll remember it. Thanks. :) > > If you cannot prove that the optimization has some practical effect, > it's hard to persuade others for increased complexity. To be honest, this patch stems from an issue I noticed during code review. When this patchset was completed, I did put in some effort to find its benefits, and it was only under such an exaggeratedly constructed test scenario that the effect could be demonstrated. :( The actual problem I'm facing has been described in other replies. It's actually about some anonymous pages and fully COW-ed pages, but their avcs haven't been removed from the anon_vma's RB tree, resulting in inefficient traversal. Lorenzo has mentioned that he has some bold ideas regarding this, let's look forward it. :) Thanks. > > --- > Best Regards, > Huang, Ying
>> If you cannot prove that the optimization has some practical effect, >> it's hard to persuade others for increased complexity. > > To be honest, this patch stems from an issue I noticed during code review. > > When this patchset was completed, I did put in some effort to find its > benefits, and it was only > > under such an exaggeratedly constructed test scenario that the effect > could be demonstrated. :( I mean, thanks for looking into that and trying to find a way to improve it. :) That VMA walk is the real problem, stopping earlier is just an optimization that works in some cases. I guess on average it will improve things, although probably really hard to quantify in reality. I think tracking the #migration entries might be a very good debugging tool. A cleaner and more reliably solution regarding what you tried to implement would be able to (a) track it in a separate counter, at the time we establish/remove a migration entry, not once the mapcount is already 0. With "struct folio" getting allocated separately in the future this could maybe be feasible (and putting it under a config knob). (b) doing it also for large folios as well (b) might be tricky with migration entries being used for THP splits, but probable it could be special-cased somehow, I am sure. -- Cheers, David / dhildenb
© 2016 - 2025 Red Hat, Inc.