arch/arm64/Kconfig | 1 + arch/x86/Kconfig | 1 + fs/proc/page.c | 6 +- include/linux/mm.h | 38 ++ include/linux/mm_types.h | 9 +- include/linux/page-flags.h | 34 +- include/linux/pagemap.h | 2 +- include/linux/rmap.h | 165 ++++++++- mm/Kconfig | 22 ++ mm/damon/ops-common.c | 4 +- mm/debug.c | 2 +- mm/debug_vm_pgtable.c | 2 +- mm/gup.c | 6 +- mm/huge_memory.c | 16 +- mm/internal.h | 171 +++++++++ mm/khugepaged.c | 13 +- mm/ksm.c | 43 ++- mm/memory-failure.c | 11 +- mm/memory.c | 19 +- mm/migrate.c | 126 ++++--- mm/mmap.c | 15 +- mm/mremap.c | 4 +- mm/page_idle.c | 2 +- mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- mm/vma.c | 76 ++-- mm/vma.h | 4 +- mm/vma_exec.c | 2 +- mm/vma_init.c | 1 + 28 files changed, 1279 insertions(+), 206 deletions(-)
TL;DR ----- This series introduces ANON_VMA_LAZY, which defers anon_vma creation until it is actually required. - anon_vma memory reduced by ~92-97%, anon_vma_chain reduced by ~50-57% - rmap operations on ANON_VMA_LAZY VMAs do not require anon_vma locking Background ---------- Currently anon_vma structures are created eagerly when anonymous VMAs are initialized. However, many VMAs never participate in fork or rmap operations that require anon_vma chains, so the allocated anon_vma and anon_vma_chain objects are often unnecessary. Design overview --------------- ANON_VMA_LAZY defers anon_vma allocation until it is actually needed (for example during fork). VMAs that never participate in sharing can avoid creating anon_vma structures entirely. Before an anon_vma exists, rmap operations rely directly on VMA information, so no anon_vma locking is required. An anon_vma is created and linked only when sharing semantics are required. This series introduces anon_rmap helpers to make rmap less dependent on direct anon_vma access. It also introduces anon_vma_tree_t as a container to support both the lazy and the existing anon_vma layouts. Once a VMA becomes associated with an anon_vma, the normal behavior remains unchanged. Memory impact ------------- Preliminary measurements show significant reductions in anon_vma-related slab allocations. After boot: Object | Before (active KB) | After (active KB) | Change vm_area_struct | 117035 | 118176 | +1.0% anon_vma_chain | 18865.8 | 8112.06 | -57.0% anon_vma | 20426.4 | 613.75 | -97.0% After launching 24 apps: Object | Before (active KB) | After (active KB) | Change vm_area_struct | 196873 | 197345 | +0.2% anon_vma_chain | 31477.1 | 15576.8 | -50.5% anon_vma | 33280 | 2648.12 | -92.0% Simple fork microbenchmarks also show a slight improvement in fork performance, since child VMAs do not need to allocate anon_vma structures during fork. Feedback and suggestions are welcome. tao (15): mm/rmap: introduce anon_rmap APIs for anonymous folios mm: convert anon_vma rmap APIs to anon_rmap mm: introduce anon_vma_tree_t for multiple anon_vma topologies mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY mm: add CONFIG_ANON_VMA_LAZY and folio helpers mm: add CONFIG_VMA_REF and VMA helpers mm: replace direct FOLIO_MAPPING_ANON usage with helpers mm: prepare rmap infrastructure for ANON_VMA_LAZY mm: implement ANON_VMA_LAZY rmap semantics mm: defer anon_vma creation with ANON_VMA_LAZY mm: handle ANON_VMA_LAZY in huge page operations mm: handle ANON_VMA_LAZY during migration mm: support setup and upgrade of ANON_VMA_LAZY folios mm: support merging of ANON_VMA_LAZY VMAs mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 arch/arm64/Kconfig | 1 + arch/x86/Kconfig | 1 + fs/proc/page.c | 6 +- include/linux/mm.h | 38 ++ include/linux/mm_types.h | 9 +- include/linux/page-flags.h | 34 +- include/linux/pagemap.h | 2 +- include/linux/rmap.h | 165 ++++++++- mm/Kconfig | 22 ++ mm/damon/ops-common.c | 4 +- mm/debug.c | 2 +- mm/debug_vm_pgtable.c | 2 +- mm/gup.c | 6 +- mm/huge_memory.c | 16 +- mm/internal.h | 171 +++++++++ mm/khugepaged.c | 13 +- mm/ksm.c | 43 ++- mm/memory-failure.c | 11 +- mm/memory.c | 19 +- mm/migrate.c | 126 ++++--- mm/mmap.c | 15 +- mm/mremap.c | 4 +- mm/page_idle.c | 2 +- mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- mm/vma.c | 76 ++-- mm/vma.h | 4 +- mm/vma_exec.c | 2 +- mm/vma_init.c | 1 + 28 files changed, 1279 insertions(+), 206 deletions(-) -- 2.17.1
On 5/27/26 13:01, tao wrote: > TL;DR > ----- > > This series introduces ANON_VMA_LAZY, which defers anon_vma creation > until it is actually required. > > - anon_vma memory reduced by ~92-97%, anon_vma_chain reduced by ~50-57% > - rmap operations on ANON_VMA_LAZY VMAs do not require anon_vma locking > > Background > ---------- > > Currently anon_vma structures are created eagerly when anonymous VMAs > are initialized. However, many VMAs never participate in fork or rmap > operations that require anon_vma chains, so the allocated anon_vma and > anon_vma_chain objects are often unnecessary. > > Design overview > --------------- > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > (for example during fork). VMAs that never participate in sharing can > avoid creating anon_vma structures entirely. > > Before an anon_vma exists, rmap operations rely directly on VMA > information, so no anon_vma locking is required. An anon_vma is created > and linked only when sharing semantics are required. > > This series introduces anon_rmap helpers to make rmap less dependent on > direct anon_vma access. It also introduces anon_vma_tree_t as a container > to support both the lazy and the existing anon_vma layouts. > > Once a VMA becomes associated with an anon_vma, the normal behavior > remains unchanged. > > Memory impact > ------------- > > Preliminary measurements show significant reductions in anon_vma-related > slab allocations. > > After boot: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 117035 | 118176 | +1.0% > anon_vma_chain | 18865.8 | 8112.06 | -57.0% > anon_vma | 20426.4 | 613.75 | -97.0% > > After launching 24 apps: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 196873 | 197345 | +0.2% > anon_vma_chain | 31477.1 | 15576.8 | -50.5% > anon_vma | 33280 | 2648.12 | -92.0% > > Simple fork microbenchmarks also show a slight improvement in fork > performance, since child VMAs do not need to allocate anon_vma > structures during fork. > > Feedback and suggestions are welcome. > > > tao (15): > mm/rmap: introduce anon_rmap APIs for anonymous folios > mm: convert anon_vma rmap APIs to anon_rmap > mm: introduce anon_vma_tree_t for multiple anon_vma topologies > mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY > mm: add CONFIG_ANON_VMA_LAZY and folio helpers > mm: add CONFIG_VMA_REF and VMA helpers > mm: replace direct FOLIO_MAPPING_ANON usage with helpers > mm: prepare rmap infrastructure for ANON_VMA_LAZY > mm: implement ANON_VMA_LAZY rmap semantics > mm: defer anon_vma creation with ANON_VMA_LAZY > mm: handle ANON_VMA_LAZY in huge page operations > mm: handle ANON_VMA_LAZY during migration > mm: support setup and upgrade of ANON_VMA_LAZY folios > mm: support merging of ANON_VMA_LAZY VMAs > mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 > > arch/arm64/Kconfig | 1 + > arch/x86/Kconfig | 1 + > fs/proc/page.c | 6 +- > include/linux/mm.h | 38 ++ > include/linux/mm_types.h | 9 +- > include/linux/page-flags.h | 34 +- > include/linux/pagemap.h | 2 +- > include/linux/rmap.h | 165 ++++++++- > mm/Kconfig | 22 ++ > mm/damon/ops-common.c | 4 +- > mm/debug.c | 2 +- > mm/debug_vm_pgtable.c | 2 +- > mm/gup.c | 6 +- > mm/huge_memory.c | 16 +- > mm/internal.h | 171 +++++++++ > mm/khugepaged.c | 13 +- > mm/ksm.c | 43 ++- > mm/memory-failure.c | 11 +- > mm/memory.c | 19 +- > mm/migrate.c | 126 ++++--- > mm/mmap.c | 15 +- > mm/mremap.c | 4 +- > mm/page_idle.c | 2 +- > mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- > mm/vma.c | 76 ++-- > mm/vma.h | 4 +- > mm/vma_exec.c | 2 +- > mm/vma_init.c | 1 + > 28 files changed, 1279 insertions(+), 206 deletions(-) Hi! When I saw the diffsat I was concerned. Going through the patches made me ... more concerned :) This is a lot of complexity. On top of something that is already so complicated that I fail to grasp most details without regularly taking a look at the nice figures Lorenzo created recently. For example, I read above "since child VMAs do not need to allocate anon_vma" and wondered how that could be part of something that is just done lazily. Then I had to learn in the patches that there is some additional "Child VMAs are created as ANON_VMA_TREE_PARENT and do not allocate anon_vma" -- excuse me, what? :) Reading about VMA refcounts made me shiver. Reading "Holding only folio_lock(folio) cannot guarantee that the split operation completes atomically." confused me. Learning that we have to invent interesting ways to make page migration mutually exclusive to free_pgtables() concerned me. Figuring out that there are arch-specific config options and runtime toggles is a clear warning sign. Seeing test_folio_unmapped() was funny, though (why?! :)). I think this patch set has a noble goal of reducing anon_vma overhead when anon pages are not shared during fork. However, using anon_vma for them actually makes the overall implementation (e.g., rmap walks, locking) more consistent and simpler. Even if we could be convinced that most of this here is correct, how should we reasonably maintain this increasing level of complexity here? I won't echo what has already been said in this thread (and I didn't manage to read all, unfortunately), but for such big and invasive work it's often best to get in touch with the community earlier. Otherwise, you might end up wasting your time. Ok, arguably, someone who writes that code learns a lot on the way. And if this code really was written by one developer only, I tip my hat! I'd be curious if that code already ran somewhere on some Android kernel out there? But adding more complexity on top of something that's already extremely complicated to save some memory looks like the wrong direction, really. I was excited when Lorenzo started working on a completely new approach that would focus on improving the common cases while trying to reduce the overall complexity. Because I think most of us really dislike anon_vma. It's still work in progress, and I am sure there are some rough edges. But fundamentally, I think we want to find a new design that is just naturally simpler. Lorenzo has been hard at work exploring various design options (and I'm afraid he might be one of the 3 people on this planet that understand anon_vma in full detail), so I suggest we wait for a redesign proposal from him and see if that is doable? -- Cheers, David
On Wed, Jun 03, 2026 at 10:25:05PM +0200, David Hildenbrand (Arm) wrote: > I was excited when Lorenzo started working on a completely new approach that > would focus on improving the common cases while trying to reduce the overall > complexity. Because I think most of us really dislike anon_vma. It's still work > in progress, and I am sure there are some rough edges. > > But fundamentally, I think we want to find a new design that is just naturally > simpler. > > Lorenzo has been hard at work exploring various design options (and I'm afraid > he might be one of the 3 people on this planet that understand anon_vma in full > detail), so I suggest we wait for a redesign proposal from him and see if that > is doable? I'll spend the next month focusing on shipping something as a priority, even if partial. Thanks, Lorenzo
> > > > arch/arm64/Kconfig | 1 + > > arch/x86/Kconfig | 1 + > > fs/proc/page.c | 6 +- > > include/linux/mm.h | 38 ++ > > include/linux/mm_types.h | 9 +- > > include/linux/page-flags.h | 34 +- > > include/linux/pagemap.h | 2 +- > > include/linux/rmap.h | 165 ++++++++- > > mm/Kconfig | 22 ++ > > mm/damon/ops-common.c | 4 +- > > mm/debug.c | 2 +- > > mm/debug_vm_pgtable.c | 2 +- > > mm/gup.c | 6 +- > > mm/huge_memory.c | 16 +- > > mm/internal.h | 171 +++++++++ > > mm/khugepaged.c | 13 +- > > mm/ksm.c | 43 ++- > > mm/memory-failure.c | 11 +- > > mm/memory.c | 19 +- > > mm/migrate.c | 126 ++++--- > > mm/mmap.c | 15 +- > > mm/mremap.c | 4 +- > > mm/page_idle.c | 2 +- > > mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- > > mm/vma.c | 76 ++-- > > mm/vma.h | 4 +- > > mm/vma_exec.c | 2 +- > > mm/vma_init.c | 1 + > > 28 files changed, 1279 insertions(+), 206 deletions(-) > > Hi! > > When I saw the diffsat I was concerned. Going through the patches made me ... > more concerned :) > > This is a lot of complexity. On top of something that is already so complicated > that I fail to grasp most details without regularly taking a look at the nice > figures Lorenzo created recently. > > For example, I read above "since child VMAs do not need to allocate anon_vma" > and wondered how that could be part of something that is just done lazily. Then > I had to learn in the patches that there is some additional "Child VMAs > are created as ANON_VMA_TREE_PARENT and do not allocate anon_vma" -- excuse me, > what? :) > > Reading about VMA refcounts made me shiver. Reading "Holding only > folio_lock(folio) cannot guarantee that the split > operation completes atomically." confused me. Learning that we have to invent > interesting ways to make page migration mutually exclusive to free_pgtables() > concerned me. Figuring out that there are arch-specific config options and > runtime toggles is a clear warning sign. > > Seeing test_folio_unmapped() was funny, though (why?! :)). > > I think this patch set has a noble goal of reducing anon_vma overhead when anon > pages are not shared during fork. However, using anon_vma for them actually > makes the overall implementation (e.g., rmap walks, locking) more consistent and > simpler. > > Even if we could be convinced that most of this here is correct, how should we > reasonably maintain this increasing level of complexity here? Indeed, it's very complex, but having the changes of 15 patches scattered across various subsystems is really frustrating for reviewers. It took me a whole day to read through the entire patch set, which made an already complicated matter even more complex (maintaining such complex code in the future will be a pain). However, overall, I think the original intention behind Tao's patch is innovative and valuable, and Tao could definitely make this patch set simpler and more readable, because the core changes actually start from PATCH 10. I believe that if Tao had done the following, things might have gone better and easier for reviewing. In fact, I understand the motivation behind the patch is quite simple at its core (just wanting to avoid allocating the anon_vma structure when a VMA hasn't been truly forked, and instead put the VMA information directly into folio->mapping): 1) You could actually simplify your patch significantly — without adding a lot of wrappers and helper functions that introduce extra review overhead — and keep only the most essential elements. 2) Provide complete test code (in tools/testing/selftest) that covers the affected functionality, such as VMA, huge pages, KSM, etc. 3) Use the RFC tag to start a discussion. I would be very glad to see if Tao could post a simpler v2 version that does not alter the rmap core data structures too much and does not introduce excessive complexity, no matter whether it can be merged finaly.
On 6/4/26 05:10, xu.xin16@zte.com.cn wrote: >>> >>> arch/arm64/Kconfig | 1 + >>> arch/x86/Kconfig | 1 + >>> fs/proc/page.c | 6 +- >>> include/linux/mm.h | 38 ++ >>> include/linux/mm_types.h | 9 +- >>> include/linux/page-flags.h | 34 +- >>> include/linux/pagemap.h | 2 +- >>> include/linux/rmap.h | 165 ++++++++- >>> mm/Kconfig | 22 ++ >>> mm/damon/ops-common.c | 4 +- >>> mm/debug.c | 2 +- >>> mm/debug_vm_pgtable.c | 2 +- >>> mm/gup.c | 6 +- >>> mm/huge_memory.c | 16 +- >>> mm/internal.h | 171 +++++++++ >>> mm/khugepaged.c | 13 +- >>> mm/ksm.c | 43 ++- >>> mm/memory-failure.c | 11 +- >>> mm/memory.c | 19 +- >>> mm/migrate.c | 126 ++++--- >>> mm/mmap.c | 15 +- >>> mm/mremap.c | 4 +- >>> mm/page_idle.c | 2 +- >>> mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- >>> mm/vma.c | 76 ++-- >>> mm/vma.h | 4 +- >>> mm/vma_exec.c | 2 +- >>> mm/vma_init.c | 1 + >>> 28 files changed, 1279 insertions(+), 206 deletions(-) >> >> Hi! >> >> When I saw the diffsat I was concerned. Going through the patches made me ... >> more concerned :) >> >> This is a lot of complexity. On top of something that is already so complicated >> that I fail to grasp most details without regularly taking a look at the nice >> figures Lorenzo created recently. >> >> For example, I read above "since child VMAs do not need to allocate anon_vma" >> and wondered how that could be part of something that is just done lazily. Then >> I had to learn in the patches that there is some additional "Child VMAs >> are created as ANON_VMA_TREE_PARENT and do not allocate anon_vma" -- excuse me, >> what? :) >> >> Reading about VMA refcounts made me shiver. Reading "Holding only >> folio_lock(folio) cannot guarantee that the split >> operation completes atomically." confused me. Learning that we have to invent >> interesting ways to make page migration mutually exclusive to free_pgtables() >> concerned me. Figuring out that there are arch-specific config options and >> runtime toggles is a clear warning sign. >> >> Seeing test_folio_unmapped() was funny, though (why?! :)). >> >> I think this patch set has a noble goal of reducing anon_vma overhead when anon >> pages are not shared during fork. However, using anon_vma for them actually >> makes the overall implementation (e.g., rmap walks, locking) more consistent and >> simpler. >> >> Even if we could be convinced that most of this here is correct, how should we >> reasonably maintain this increasing level of complexity here? > > Indeed, it's very complex, but having the changes of 15 patches scattered across > various subsystems is really frustrating for reviewers. It took me a whole day to > read through the entire patch set, which made an already complicated matter even > more complex (maintaining such complex code in the future will be a pain). > > However, overall, I think the original intention behind Tao's patch is innovative > and valuable, and Tao could definitely make this patch set simpler and more > readable, because the core changes actually start from PATCH 10. > > I believe that if Tao had done the following, things might have gone better and easier > for reviewing. In fact, I understand the motivation behind the patch is quite simple > at its core (just wanting to avoid allocating the anon_vma structure when a VMA hasn't > been truly forked, and instead put the VMA information directly into folio->mapping): > > 1) You could actually simplify your patch significantly — without adding a lot of wrappers > and helper functions that introduce extra review overhead — and keep only the most essential elements. > > 2) Provide complete test code (in tools/testing/selftest) that covers the affected functionality, > such as VMA, huge pages, KSM, etc. > > 3) Use the RFC tag to start a discussion. > > > I would be very glad to see if Tao could post a simpler v2 version that does not alter the rmap > core data structures too much and does not introduce excessive complexity, no matter whether > it can be merged finaly. I'm afraid, the overall complexity would increase in any case. So to quote myself "But fundamentally, I think we want to find a new design that is just naturally simpler." -- Cheers, David
On Fri, Jun 05, 2026 at 11:38:52AM +0200, David Hildenbrand (Arm) wrote: > On 6/4/26 05:10, xu.xin16@zte.com.cn wrote: > >>> > >>> arch/arm64/Kconfig | 1 + > >>> arch/x86/Kconfig | 1 + > >>> fs/proc/page.c | 6 +- > >>> include/linux/mm.h | 38 ++ > >>> include/linux/mm_types.h | 9 +- > >>> include/linux/page-flags.h | 34 +- > >>> include/linux/pagemap.h | 2 +- > >>> include/linux/rmap.h | 165 ++++++++- > >>> mm/Kconfig | 22 ++ > >>> mm/damon/ops-common.c | 4 +- > >>> mm/debug.c | 2 +- > >>> mm/debug_vm_pgtable.c | 2 +- > >>> mm/gup.c | 6 +- > >>> mm/huge_memory.c | 16 +- > >>> mm/internal.h | 171 +++++++++ > >>> mm/khugepaged.c | 13 +- > >>> mm/ksm.c | 43 ++- > >>> mm/memory-failure.c | 11 +- > >>> mm/memory.c | 19 +- > >>> mm/migrate.c | 126 ++++--- > >>> mm/mmap.c | 15 +- > >>> mm/mremap.c | 4 +- > >>> mm/page_idle.c | 2 +- > >>> mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- > >>> mm/vma.c | 76 ++-- > >>> mm/vma.h | 4 +- > >>> mm/vma_exec.c | 2 +- > >>> mm/vma_init.c | 1 + > >>> 28 files changed, 1279 insertions(+), 206 deletions(-) > >> > >> Hi! > >> > >> When I saw the diffsat I was concerned. Going through the patches made me ... > >> more concerned :) > >> > >> This is a lot of complexity. On top of something that is already so complicated > >> that I fail to grasp most details without regularly taking a look at the nice > >> figures Lorenzo created recently. > >> > >> For example, I read above "since child VMAs do not need to allocate anon_vma" > >> and wondered how that could be part of something that is just done lazily. Then > >> I had to learn in the patches that there is some additional "Child VMAs > >> are created as ANON_VMA_TREE_PARENT and do not allocate anon_vma" -- excuse me, > >> what? :) > >> > >> Reading about VMA refcounts made me shiver. Reading "Holding only > >> folio_lock(folio) cannot guarantee that the split > >> operation completes atomically." confused me. Learning that we have to invent > >> interesting ways to make page migration mutually exclusive to free_pgtables() > >> concerned me. Figuring out that there are arch-specific config options and > >> runtime toggles is a clear warning sign. > >> > >> Seeing test_folio_unmapped() was funny, though (why?! :)). > >> > >> I think this patch set has a noble goal of reducing anon_vma overhead when anon > >> pages are not shared during fork. However, using anon_vma for them actually > >> makes the overall implementation (e.g., rmap walks, locking) more consistent and > >> simpler. > >> > >> Even if we could be convinced that most of this here is correct, how should we > >> reasonably maintain this increasing level of complexity here? > > > > Indeed, it's very complex, but having the changes of 15 patches scattered across > > various subsystems is really frustrating for reviewers. It took me a whole day to > > read through the entire patch set, which made an already complicated matter even > > more complex (maintaining such complex code in the future will be a pain). > > > > However, overall, I think the original intention behind Tao's patch is innovative > > and valuable, and Tao could definitely make this patch set simpler and more > > readable, because the core changes actually start from PATCH 10. > > > > I believe that if Tao had done the following, things might have gone better and easier > > for reviewing. In fact, I understand the motivation behind the patch is quite simple > > at its core (just wanting to avoid allocating the anon_vma structure when a VMA hasn't > > been truly forked, and instead put the VMA information directly into folio->mapping): > > > > 1) You could actually simplify your patch significantly — without adding a lot of wrappers > > and helper functions that introduce extra review overhead — and keep only the most essential elements. > > > > 2) Provide complete test code (in tools/testing/selftest) that covers the affected functionality, > > such as VMA, huge pages, KSM, etc. > > > > 3) Use the RFC tag to start a discussion. > > > > > > I would be very glad to see if Tao could post a simpler v2 version that does not alter the rmap > > core data structures too much and does not introduce excessive complexity, no matter whether > > it can be merged finaly. > > I'm afraid, the overall complexity would increase in any case. > > So to quote myself "But fundamentally, I think we want to find a new design that > is just naturally simpler." In general I'm prioritising my work, and hope to ship something within the next month, even if it might just be some early bits of work on this to plumb stuff in, and certainly the main ting will be an RFC :) So keep an eye out for this which will hopefully address people's needs! > > -- > Cheers, > > David Thanks, Lorenzo
On 6/5/26 12:07, Lorenzo Stoakes wrote: > On Fri, Jun 05, 2026 at 11:38:52AM +0200, David Hildenbrand (Arm) wrote: >> On 6/4/26 05:10, xu.xin16@zte.com.cn wrote: >>> >>> Indeed, it's very complex, but having the changes of 15 patches scattered across >>> various subsystems is really frustrating for reviewers. It took me a whole day to >>> read through the entire patch set, which made an already complicated matter even >>> more complex (maintaining such complex code in the future will be a pain). >>> >>> However, overall, I think the original intention behind Tao's patch is innovative >>> and valuable, and Tao could definitely make this patch set simpler and more >>> readable, because the core changes actually start from PATCH 10. >>> >>> I believe that if Tao had done the following, things might have gone better and easier >>> for reviewing. In fact, I understand the motivation behind the patch is quite simple >>> at its core (just wanting to avoid allocating the anon_vma structure when a VMA hasn't >>> been truly forked, and instead put the VMA information directly into folio->mapping): >>> >>> 1) You could actually simplify your patch significantly — without adding a lot of wrappers >>> and helper functions that introduce extra review overhead — and keep only the most essential elements. >>> >>> 2) Provide complete test code (in tools/testing/selftest) that covers the affected functionality, >>> such as VMA, huge pages, KSM, etc. >>> >>> 3) Use the RFC tag to start a discussion. >>> >>> >>> I would be very glad to see if Tao could post a simpler v2 version that does not alter the rmap >>> core data structures too much and does not introduce excessive complexity, no matter whether >>> it can be merged finaly. >> >> I'm afraid, the overall complexity would increase in any case. >> >> So to quote myself "But fundamentally, I think we want to find a new design that >> is just naturally simpler." > > In general I'm prioritising my work, and hope to ship something within the next > month, even if it might just be some early bits of work on this to plumb stuff > in, and certainly the main ting will be an RFC :) Cool, I'll cover your back on upstream review as much as you need :) -- Cheers, David
> > > > Even if we could be convinced that most of this here is correct, how > > should we reasonably maintain this increasing level of complexity here? > > Indeed, it's very complex, but having the changes of 15 patches scattered > across various subsystems is really frustrating for reviewers. It took me a > whole day to read through the entire patch set, which made an already > complicated matter even more complex (maintaining such complex code in > the future will be a pain). > > However, overall, I think the original intention behind Tao's patch is > innovative and valuable, and Tao could definitely make this patch set simpler > and more readable, because the core changes actually start from PATCH 10. > Yes, initially it was basically patches 9, 10, and 13. Because without distinguishing the anon_rmap and anon_vma topology, some fundamental code logic is hard to understand, I added these two logical layers when preparing to submit it to the community. However, this also significantly increased the amount of code, which is not ideal. > I believe that if Tao had done the following, things might have gone better > and easier for reviewing. In fact, I understand the motivation behind the > patch is quite simple at its core (just wanting to avoid allocating the > anon_vma structure when a VMA hasn't been truly forked, and instead put > the VMA information directly into folio->mapping): > > 1) You could actually simplify your patch significantly — without adding a lot > of wrappers and helper functions that introduce extra review overhead — > and keep only the most essential elements. > > 2) Provide complete test code (in tools/testing/selftest) that covers the > affected functionality, such as VMA, huge pages, KSM, etc. > > 3) Use the RFC tag to start a discussion. > > > I would be very glad to see if Tao could post a simpler v2 version that does > not alter the rmap core data structures too much and does not introduce > excessive complexity, no matter whether it can be merged finaly.
On Thu, Jun 4, 2026 at 4:25 AM David Hildenbrand (Arm) <david@kernel.org> wrote: [...] > > > > arch/arm64/Kconfig | 1 + > > arch/x86/Kconfig | 1 + > > fs/proc/page.c | 6 +- > > include/linux/mm.h | 38 ++ > > include/linux/mm_types.h | 9 +- > > include/linux/page-flags.h | 34 +- > > include/linux/pagemap.h | 2 +- > > include/linux/rmap.h | 165 ++++++++- > > mm/Kconfig | 22 ++ > > mm/damon/ops-common.c | 4 +- > > mm/debug.c | 2 +- > > mm/debug_vm_pgtable.c | 2 +- > > mm/gup.c | 6 +- > > mm/huge_memory.c | 16 +- > > mm/internal.h | 171 +++++++++ > > mm/khugepaged.c | 13 +- > > mm/ksm.c | 43 ++- > > mm/memory-failure.c | 11 +- > > mm/memory.c | 19 +- > > mm/migrate.c | 126 ++++--- > > mm/mmap.c | 15 +- > > mm/mremap.c | 4 +- > > mm/page_idle.c | 2 +- > > mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- > > mm/vma.c | 76 ++-- > > mm/vma.h | 4 +- > > mm/vma_exec.c | 2 +- > > mm/vma_init.c | 1 + > > 28 files changed, 1279 insertions(+), 206 deletions(-) > > Hi! > > When I saw the diffsat I was concerned. Going through the patches made me ... > more concerned :) > > This is a lot of complexity. On top of something that is already so complicated > that I fail to grasp most details without regularly taking a look at the nice > figures Lorenzo created recently. > > For example, I read above "since child VMAs do not need to allocate anon_vma" > and wondered how that could be part of something that is just done lazily. Then > I had to learn in the patches that there is some additional "Child VMAs > are created as ANON_VMA_TREE_PARENT and do not allocate anon_vma" -- excuse me, > what? :) Yes, that part is quite complicated here. There are two cases here: 1. A forks B, and B inherits a VMA from A. In this case, B's VMA gets ANON_VMA_TREE_PARENT. 2. A forks B, and B later creates a new VMA via mmap(). If a page fault occurs in this new VMA, it gets ANON_VMA_TREE_VMA. In both cases, we need to upgrade B to a regular anon_vma when B becomes a parent and performs a fork(). This may be a bit off-topic, but I'm also considering whether there is a chance to work with Suren to support case 2 via a GKI hook in the Android kernel before Lorenzo's work is ready. Even then, the optimization would apply only to the case where B never forks, allowing us to skip the anon_vma "upgrade" entirely. That assumption holds for most applications, although there are a few cases where it does not. I'm actually hoping Android could eventually disable forking for UI applications altogether. From what I've heard, some applications use fork() primarily to evade LMKD (the Android low-memory killer daemon). For example, a child process may monitor the main process, and if the main process is killed, detect that event and request a relaunch. This is one way some applications attempt to keep themselves alive indefinitely. But even if we limit the optimization to the subset of case 2 where B never forks, we still need to handle mremap(), VMA merges, VMA splits, and similar cases. That starts to become quite a headache. So please just ignore my rambling if it turns out to be nonsense :-) > > Reading about VMA refcounts made me shiver. Reading "Holding only > folio_lock(folio) cannot guarantee that the split > operation completes atomically." confused me. Learning that we have to invent > interesting ways to make page migration mutually exclusive to free_pgtables() > concerned me. Figuring out that there are arch-specific config options and > runtime toggles is a clear warning sign. > > Seeing test_folio_unmapped() was funny, though (why?! :)). > > I think this patch set has a noble goal of reducing anon_vma overhead when anon > pages are not shared during fork. However, using anon_vma for them actually > makes the overall implementation (e.g., rmap walks, locking) more consistent and > simpler. > > Even if we could be convinced that most of this here is correct, how should we > reasonably maintain this increasing level of complexity here? > > I won't echo what has already been said in this thread (and I didn't manage to > read all, unfortunately), but for such big and invasive work it's often best to > get in touch with the community earlier. Otherwise, you might end up wasting > your time. > > Ok, arguably, someone who writes that code learns a lot on the way. And if this > code really was written by one developer only, I tip my hat! I'd be curious if > that code already ran somewhere on some Android kernel out there? I heard from Zicheng that they have been running this for months and it seems reasonably stable. Please correct me if I'm wrong, Zicheng :-). This really should have been discussed with the community earlier. > > But adding more complexity on top of something that's already extremely > complicated to save some memory looks like the wrong direction, really. > > I was excited when Lorenzo started working on a completely new approach that > would focus on improving the common cases while trying to reduce the overall > complexity. Because I think most of us really dislike anon_vma. It's still work > in progress, and I am sure there are some rough edges. > > But fundamentally, I think we want to find a new design that is just naturally > simpler. > +1 > Lorenzo has been hard at work exploring various design options (and I'm afraid > he might be one of the 3 people on this planet that understand anon_vma in full > detail), so I suggest we wait for a redesign proposal from him and see if that > is doable? > Thanks Barry
> > > > Hi! > > > > When I saw the diffsat I was concerned. Going through the patches made > me ... > > more concerned :) > > > > This is a lot of complexity. On top of something that is already so > > complicated that I fail to grasp most details without regularly taking > > a look at the nice figures Lorenzo created recently. > > > > For example, I read above "since child VMAs do not need to allocate > anon_vma" > > and wondered how that could be part of something that is just done > > lazily. Then I had to learn in the patches that there is some > > additional "Child VMAs are created as ANON_VMA_TREE_PARENT and do > not > > allocate anon_vma" -- excuse me, what? :) > > Yes, that part is quite complicated here. There are two cases here: > 1. A forks B, and B inherits a VMA from A. In this case, B's VMA gets > ANON_VMA_TREE_PARENT. > > 2. A forks B, and B later creates a new VMA via mmap(). > If a page fault occurs in this new VMA, it gets ANON_VMA_TREE_VMA. > > In both cases, we need to upgrade B to a regular anon_vma when B becomes > a parent and performs a fork(). > Thank you for helping explain. > This may be a bit off-topic, but I'm also considering whether there is a chance > to work with Suren to support case 2 via a GKI hook in the Android kernel > before Lorenzo's work is ready. > > Even then, the optimization would apply only to the case where B never > forks, allowing us to skip the anon_vma "upgrade" entirely. That assumption > holds for most applications, although there are a few cases where it does not. > > I'm actually hoping Android could eventually disable forking for UI > applications altogether. From what I've heard, some applications use fork() > primarily to evade LMKD (the Android low-memory killer daemon). For > example, a child process may monitor the main process, and if the main > process is killed, detect that event and request a relaunch. This is one way > some applications attempt to keep themselves alive indefinitely. > > But even if we limit the optimization to the subset of case 2 where B never > forks, we still need to handle mremap(), VMA merges, VMA splits, and > similar cases. > That starts to become quite a headache. > > So please just ignore my rambling if it turns out to be nonsense :-) > > > > > Reading about VMA refcounts made me shiver. Reading "Holding only > > folio_lock(folio) cannot guarantee that the split operation completes > > atomically." confused me. Learning that we have to invent interesting > > ways to make page migration mutually exclusive to free_pgtables() > > concerned me. Figuring out that there are arch-specific config options > > and runtime toggles is a clear warning sign. > > > > Seeing test_folio_unmapped() was funny, though (why?! :)). > > > > I think this patch set has a noble goal of reducing anon_vma overhead > > when anon pages are not shared during fork. However, using anon_vma > > for them actually makes the overall implementation (e.g., rmap walks, > > locking) more consistent and simpler. > > > > Even if we could be convinced that most of this here is correct, how > > should we reasonably maintain this increasing level of complexity here? > > > > I won't echo what has already been said in this thread (and I didn't > > manage to read all, unfortunately), but for such big and invasive work > > it's often best to get in touch with the community earlier. Otherwise, > > you might end up wasting your time. > > > > Ok, arguably, someone who writes that code learns a lot on the way. > > And if this code really was written by one developer only, I tip my > > hat! I'd be curious if that code already ran somewhere on some Android > kernel out there? > > I heard from Zicheng that they have been running this for months and it > seems reasonably stable. Please correct me if I'm wrong, Zicheng :-). This > really should have been discussed with the community earlier. > I initially developed and debugged this based on the Android GKI branch and did some preliminary testing on an Android phone. Since GKI generally only accepts features merged from the upstream community, and this memory saving could also benefit the community, I ported the patch to the Linux master branch. Because my English is not very good and I rarely participate in the community, I am not familiar with the community workflow. I did not send an email for discussion in advance with an RFC tag. I apologize again. > > > > But adding more complexity on top of something that's already > > extremely complicated to save some memory looks like the wrong > direction, really. > > > > I was excited when Lorenzo started working on a completely new > > approach that would focus on improving the common cases while trying > > to reduce the overall complexity. Because I think most of us really > > dislike anon_vma. It's still work in progress, and I am sure there are some > rough edges. > > > > But fundamentally, I think we want to find a new design that is just > > naturally simpler. > > > > +1 > > > Lorenzo has been hard at work exploring various design options (and > > I'm afraid he might be one of the 3 people on this planet that > > understand anon_vma in full detail), so I suggest we wait for a > > redesign proposal from him and see if that is doable? > > > > Thanks > Barry
On Thu, Jun 4, 2026 at 12:03 PM wangtao <tao.wangtao@honor.com> wrote: [...] > > > I won't echo what has already been said in this thread (and I didn't > > > manage to read all, unfortunately), but for such big and invasive work > > > it's often best to get in touch with the community earlier. Otherwise, > > > you might end up wasting your time. > > > > > > Ok, arguably, someone who writes that code learns a lot on the way. > > > And if this code really was written by one developer only, I tip my > > > hat! I'd be curious if that code already ran somewhere on some Android > > kernel out there? > > > > I heard from Zicheng that they have been running this for months and it > > seems reasonably stable. Please correct me if I'm wrong, Zicheng :-). This > > really should have been discussed with the community earlier. > > > I initially developed and debugged this based on the Android GKI branch > and did some preliminary testing on an Android phone. > > Since GKI generally only accepts features merged from the upstream > community, and this memory saving could also benefit the community, I > ported the patch to the Linux master branch. > > Because my English is not very good and I rarely participate in the > community, I am not familiar with the community workflow. I did not send > an email for discussion in advance with an RFC tag. I apologize again. > No worries. I know someone who has worked on the Linux kernel for many, many years and has excellent kernel expertise, yet has never submitted a patch throughout his career for various reasons. I heard from Zicheng that you are HONOR's key MM expert and have been guiding them on memory-management related work. That's really impressive. Personally, I'd love to see more ideas and contributions from you in the linux-mm community. BTW, regarding my earlier suggestion about using GKI hooks—limiting the optimization to newly created VMAs and to applications that never call fork()—do you have any ideas on what the smallest possible hook change would look like? Thanks Barry
> [...] > > > > I won't echo what has already been said in this thread (and I > > > > didn't manage to read all, unfortunately), but for such big and > > > > invasive work it's often best to get in touch with the community > > > > earlier. Otherwise, you might end up wasting your time. > > > > > > > > Ok, arguably, someone who writes that code learns a lot on the way. > > > > And if this code really was written by one developer only, I tip > > > > my hat! I'd be curious if that code already ran somewhere on some > > > > Android > > > kernel out there? > > > > > > I heard from Zicheng that they have been running this for months and > > > it seems reasonably stable. Please correct me if I'm wrong, Zicheng > > > :-). This really should have been discussed with the community earlier. > > > > > I initially developed and debugged this based on the Android GKI > > branch and did some preliminary testing on an Android phone. > > > > Since GKI generally only accepts features merged from the upstream > > community, and this memory saving could also benefit the community, I > > ported the patch to the Linux master branch. > > > > Because my English is not very good and I rarely participate in the > > community, I am not familiar with the community workflow. I did not > > send an email for discussion in advance with an RFC tag. I apologize again. > > > > No worries. I know someone who has worked on the Linux kernel for many, > many years and has excellent kernel expertise, yet has never submitted a > patch throughout his career for various reasons. > > I heard from Zicheng that you are HONOR's key MM expert and have been > guiding them on memory-management related work. That's really impressive. > Personally, I'd love to see more ideas and contributions from you in the linux- > mm community. > > BTW, regarding my earlier suggestion about using GKI hooks—limiting the > optimization to newly created VMAs and to applications that never call > fork()—do you have any ideas on what the smallest possible hook change > would look like? Even if Android does not consider handling for 32-bit, KSM, or memory-failure, a large number of hooks are still required. A different implementation approach is also needed. For example, one or a set of anon_vma_locks could be used to mark whether a VMA is lazy, so that we can avoid modifying all places that dereference anon_vma. Reserved fields in vma could be used to record the root_vma and a reference count respectively. Roughly, hooks would be needed in the following places: 1. __vmf_anon_prepare: mark lazy_vma in the fault path. 2. __folio_set_anon and folio_move_anon_rmap: set mapping to lazy_vma. 3. rmap_walk_anon: handle rmap_walk for lazy_vma. 4. folio_get_anon_vma and put_anon_vma: add hooks to distinguish lazy_vma handling. 5. anon_vma_fork: decide whether to upgrade pvma->anon_vma. 6. Add a hook in try_dup_anon_rmap to upgrade folio->mapping. 7. anon_vma_clone: handle links when src is a lazy_vma. 8. vm_area_alloc / vm_area_dup / vm_area_free: add VMA reference counting. 9. mremap: handle upgrades in copy_vma and move_page_tables. If this can be restricted to apps only and the app never calls fork(), then 5 and 6 could also be removed. If each page's mapping in the page tables is synchronously updated when upgrading lazy_vma, then 4 and 5 could be merged into one. If lazy_vma is restricted to anonymous VMAs where vma_mapping_base(vma) = 0, then root_vma and the VMA reference count could also be removed. With this restriction, we could directly use folio->mapping to record mm, and locate the vma using folio->index, which would simplify this patch series. > > Thanks > Barry
On 5/27/26 8:01 PM, tao wrote: > Design overview > --------------- > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > (for example during fork). VMAs that never participate in sharing can > avoid creating anon_vma structures entirely. > > Before an anon_vma exists, rmap operations rely directly on VMA > information, so no anon_vma locking is required. An anon_vma is created > and linked only when sharing semantics are required. It is unfortunate that the design overview doesn't cover correctness aspect at all. VMAs are subject to change (even before being shared with other processes), and rmap needs something that doesn't go away across VMA merging, split, etc. I'm not sure how the idea is supposed work correctly. -- Cheers, Harry / Hyeonggon
> On 5/27/26 8:01 PM, tao wrote:
> > Design overview
> > ---------------
> >
> > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed
> > (for example during fork). VMAs that never participate in sharing can
> > avoid creating anon_vma structures entirely.
> >
> > Before an anon_vma exists, rmap operations rely directly on VMA
> > information, so no anon_vma locking is required. An anon_vma is
> > created and linked only when sharing semantics are required.
>
> It is unfortunate that the design overview doesn't cover correctness aspect
> at all. VMAs are subject to change (even before being shared with other
> processes), and rmap needs something that doesn't go away across VMA
> merging, split, etc.
>
> I'm not sure how the idea is supposed work correctly.
>
> --
> Cheers,
> Harry / Hyeonggon
VMA operations can be roughly divided into three categories. The handling
of ANON_VMA_LAZY is briefly described below.
1. fork
fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is
not involved here.) This can be viewed as copying the VMAs with identical
virtual addresses into a new address space.
If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a
regular anon_vma. The corresponding folio->mapping is then fixed in
try_dup_anon_rmap().
2. mmap / brk / mprotect / munmap
These operations create, modify, or remove VMAs in the current mm. They
may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt.
When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized
and the VMA is inserted into mm_mt. Although these fields may later be
modified, the following value remains invariant:
(vm_start - vm_pgoff * PAGE_SIZE)
We refer to this value as:
vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE
This value also remains unchanged when the VMA is removed from mm_mt.
If a VMA is split and produces new_vma, the following holds:
vma_mapping_base(new_vma) == vma_mapping_base(vma)
If two adjacent VMAs vma_a and vma_b are merged into vma_x, then:
vma_mapping_base(vma_a) == vma_mapping_base(vma_b) ==
vma_mapping_base(vma_x)
Assume the VMA where the first page fault occurs is called root_vma, and
ensure that any VMA produced by split or merge holds a reference to
root_vma.
During rmap we can compute the folio address using root_vma:
vma_address(vma, pgoff, 1) =
vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma_mapping_base(vma) + pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
We can then use folio_addr to locate the VMA covering this folio.
3. mremap / uffd_move
If only the size changes and the start address remains the same, there
is no impact.
If the start address changes, the page is moved from (vma, addr) to
(new_vma, new_addr). In this case:
vma_mapping_base(new_vma) =
vma_mapping_base(vma) + new_addr - old_addr
We first upgrade the VMA, and then fix folio->mapping in move_ptes().
If performance becomes a concern, ANON_VMA_LAZY can be enabled only for
relatively small VMAs.
vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理:
1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。
这可以理解为在一个新的地址空间复制一份相同地址的VMAs.
如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping.
2. mmap/brk/mprotect/munmap
创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。
创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持
(vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。
这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。
从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma)
合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x)
如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。
在rmap时我们可以用root_vma计算folio地址:
vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma_mapping_base(vma) + pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
然后用folio_addr查找folio所在的vma。
3. mremap/uffd_move
如果只是修改大小,起始地址不变,不影响。
如果改变起始地址,将page从vma/addr移动到new_vma/new_addr
这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr
我们先升级vma,在move_ptes中再修正folio->mapping。
如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。
> During rmap we can compute the folio address using root_vma: > > vma_address(vma, pgoff, 1) = > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > It is inconsistent here. The offset should remain pgoff throughout. It should be: vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma_mapping_base(vma) + pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE
On Wed, Jun 03, 2026 at 02:59:04AM +0000, wangtao wrote: > > On 5/27/26 8:01 PM, tao wrote: > > > Design overview > > > --------------- > > > > > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > > > (for example during fork). VMAs that never participate in sharing can > > > avoid creating anon_vma structures entirely. > > > > > > Before an anon_vma exists, rmap operations rely directly on VMA > > > information, so no anon_vma locking is required. An anon_vma is > > > created and linked only when sharing semantics are required. > > > > It is unfortunate that the design overview doesn't cover correctness aspect > > at all. VMAs are subject to change (even before being shared with other > > processes), and rmap needs something that doesn't go away across VMA > > merging, split, etc. > > > > I'm not sure how the idea is supposed work correctly. > > > > -- > > Cheers, > > Harry / Hyeonggon > Against my better judgment I'll address the stuff here... > VMA operations can be roughly divided into three categories. The handling > of ANON_VMA_LAZY is briefly described below. I don't agree, there are plenty more VMA operations. But with respect to anon rmap there are: - fork - merge/split - remap Your approach seems to completely ignore VMA split and the need to maintain an interval tree to _multiple_ VMAs from a single anon_vma. You may also actually split a VMA against a single large folio (waiting on the deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped in two places. The lazy approach doesn't seem to address this properly. And fatally it ties an actual VMA afaict to the folio and has to implement a VMA reference count mechanism which interferes with the ordinarily VMA lifecycle to do it. The fact of us taking advantage of most stuff being AnonExclusive, i.e. 'leaves' is something that my approach is exactly taking into account. Of course also extending anon_vma is a real non-starter. Also the below + the series ignores MAP_PRIVATE file-backed mappings which is a pretty fatal flaw. It also, as Harry says, has zero description of correctness in a way we'd want and no tests. > > 1. fork > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is > not involved here.) This can be viewed as copying the VMAs with identical > virtual addresses into a new address space. > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > regular anon_vma. The corresponding folio->mapping is then fixed in > try_dup_anon_rmap(). And so we make fork, a very sensitive path in the kernel more expensive. I also question the locking situation with the conversion mentioned, updating folios in this manner is extremely difficult. > > 2. mmap / brk / mprotect / munmap > > These operations create, modify, or remove VMAs in the current mm. They > may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt. mmap and brk are not at all relevant to anon_vma, as no anon_vma is assigned upon mapping. It's on fault. mprotect/mlock/munmap/etc. might split, but I don't see how the lazy approach in any way addresses any of that. > > When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized > and the VMA is inserted into mm_mt. Although these fields may later be > modified, the following value remains invariant: > > (vm_start - vm_pgoff * PAGE_SIZE) Err no it doesn't at all? If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. Then if I remap it, vm_start changes, vm_pgoff stays the same, so: vm_start - vm_pgoff * PAGE_SIZE Changes right? And then that becomes essentially the offset from where it was faulted in. > > We refer to this value as: > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE This is mysteriously close to being the offset I mention in my CoW context work... I'm not sure what 'mapping base' means here. > > This value also remains unchanged when the VMA is removed from mm_mt. Why does it matter what this value is on unmap? > > If a VMA is split and produces new_vma, the following holds: > > vma_mapping_base(new_vma) == vma_mapping_base(vma) This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > vma_mapping_base(vma_x) This is just a roundabout way of saying the pgoff has to be aligned. > > Assume the VMA where the first page fault occurs is called root_vma, and > ensure that any VMA produced by split or merge holds a reference to > root_vma. But this VMA can be unmapped later? Or remapped? Holding on to a VMA and treating it as some kind of canonical reference with a reference count completely changes what VMAs are, impacts the VMA lifecycle, and produces unwanted memory overhead in itself. It also raises concerns and issues around lock order which is very sensitive. > > During rmap we can compute the folio address using root_vma: > > vma_address(vma, pgoff, 1) = What's the parameters here? What's 1? > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > We can then use folio_addr to locate the VMA covering this folio. I'm really confused by this, you're kind of mixing and match parameters here. What I think you're saying is that, if a folio hasn't been remapped, you can figure out its address based on page offset. That's completely broken for MAP_PRIVATE file-backed mappings which also use anon_vma and also have to keep on working. It seems that for the lazy approach what you are doing is essentially caching the 'root' VMA in the folio. But this doesn't account for large folios and split VMAs. Even if you disabled it for those cases (which adds a ton of complexity in itself), you then have issues with locking - the anon_vma lock has to take a lock (that cannot be a VMA-level lock - results in lock inversion) even on these leaf entries, or you break locking. And we can't reasonably start pinning VMAs and using them as a sort of proto cached thing on top of the existing anon_vma logic. You also then need to, on remap, undo all this, which requires updating folio->mapping on remap, something I tried doing previously myself, but that's fraught with issues around lock inversion itself. > > 3. mremap / uffd_move userfaultfd moving is not relevant as it actually updates the folio correctly. > > If only the size changes and the start address remains the same, there > is no impact. > > If the start address changes, the page is moved from (vma, addr) to > (new_vma, new_addr). In this case: > > vma_mapping_base(new_vma) = > vma_mapping_base(vma) + new_addr - old_addr You say above that the mapping base never changes? But here it changes? > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a 'normal' one. As above, this is fraught with lock inversion issues. > > If performance becomes a concern, ANON_VMA_LAZY can be enabled only for > relatively small VMAs. I think you've got serious correctness, lock management and complexity issues and it's all a non-starter as the costs deeply exceed the benefits. This is one of the fundamental, frustrating aspects of the anon rmap - you keep thinking that 'surely' you can do sensible thing X, but it turns out you can't for various annoying reasons. It's one of the reasons it's really fraught for somebody coming to make changes, and one of the reasons why I am very keen on fundamentally changing it. And also on a not-wasting-time basis - I was already working in parallel on a rework here, so I think the civil thing is to at least wait for my work before issuing alternative solutions. Thanks, Lorenzo > > > vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理: > > 1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。 > 这可以理解为在一个新的地址空间复制一份相同地址的VMAs. > 如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping. > > 2. mmap/brk/mprotect/munmap > 创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。 > 创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持 > (vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。 > 这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。 > 从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma) > 合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x) > 如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。 > 在rmap时我们可以用root_vma计算folio地址: > vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > 然后用folio_addr查找folio所在的vma。 > > 3. mremap/uffd_move > 如果只是修改大小,起始地址不变,不影响。 > 如果改变起始地址,将page从vma/addr移动到new_vma/new_addr > 这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr > 我们先升级vma,在move_ptes中再修正folio->mapping。 > 如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。 >
> > > > Against my better judgment I'll address the stuff here... > > > VMA operations can be roughly divided into three categories. The > > handling of ANON_VMA_LAZY is briefly described below. > > I don't agree, there are plenty more VMA operations. But with respect to > anon rmap there are: > > - fork > - merge/split > - remap > Yes, these are the three categories. I originally intended to explain them by classifying based on system calls; I should have used mremap instead of move_vma. 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma换成mremap的。 > Your approach seems to completely ignore VMA split and the need to > maintain an interval tree to _multiple_ VMAs from a single anon_vma. > The folio uses vma->root_vma to compute folio_address. A VMA split from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute folio_address. During rmap, once folio_address is obtained, the VMA can be found through mm_mt. Without fork, there is no need to maintain the interval tree. folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,使用vma_a->root_vma = vma->root_vma计算folio_address。 rmap时得到folio_address就可以通过mm_mt查找到vma。 不fork就不需要维护interval tree。 > You may also actually split a VMA against a single large folio (waiting on the > deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped > in two places. > > The lazy approach doesn't seem to address this properly. And fatally it ties an > actual VMA afaict to the folio and has to implement a VMA reference count > mechanism which interferes with the ordinarily VMA lifecycle to do it. > > The fact of us taking advantage of most stuff being AnonExclusive, i.e. > 'leaves' is something that my approach is exactly taking into account. > > Of course also extending anon_vma is a real non-starter. > > Also the below + the series ignores MAP_PRIVATE file-backed mappings > which is a pretty fatal flaw. > > It also, as Harry says, has zero description of correctness in a way we'd want > and no tests. > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的sub_page使用如下方式计算地址。 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。 It can correctly handle the case where a VMA is split within a large page. The address of a sub_page in the split VMA (vma_a or vma_b) is computed using the following method. For COW anonymous pages originating from file VMAs, the page/folio address is also computed using the same method. subpage_address = vma_address(vma_a, subpage_pgoff, 1) = vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE = vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff * PAGE_SIZE = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE > > > > 1. fork > > > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap > and > > is not involved here.) This can be viewed as copying the VMAs with > > identical virtual addresses into a new address space. > > > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > > regular anon_vma. The corresponding folio->mapping is then fixed in > > try_dup_anon_rmap(). > > And so we make fork, a very sensitive path in the kernel more expensive. > > I also question the locking situation with the conversion mentioned, updating > folios in this manner is extremely difficult. > Because rmap takes the PTE lock, while fork takes the mmap write lock, the VMA write lock, and the PTE lock. Given the rule that folio->mapping can only transition in one direction from lazy_vma to a regular anon_vma, the situation can be handled correctly even without taking the folio_lock. When rmap and fork run concurrently: If rmap observes folio->mapping as a regular anon_vma, there is obviously no issue. If rmap observes folio->mapping as lazy_vma, then rmap only processes the parent's pvma. At the end of rmap_walk_anon(), if we see that folio->mapping has changed to a regular anon_vma, we simply process it once more. The various rmap_one implementations are idempotent anyway. BTW: the commit message of patch 13 says a retry is needed, but the retry handling was accidentally omitted in the posted patch. 因为rmap获取pte锁;fork时获取mmap写锁、vma写锁、pte锁。 只允许folio->mapping从lazy_vma单向变成regular anon_vma的原则,不获取folio_lock也可以正确处理。 当rmap和fork并发处理时: 假如rmap看到的folio->mapping是regular anon_vma,显然没有问题。 假如rmap看到的folio->mapping是lazy_vma,则rmap只处理了父进程的pvma; 我们在rmap_walk_anon结束时如果看到folio->mapping变成了regular anon_vma,则再来一次处理即可,毕竟各种rmap_one实现是幂等的。 btw:patch 13的commit msg说要retry,但是发送的patch由于操作失误漏掉了重试处理。 > > > > 2. mmap / brk / mprotect / munmap > > > > These operations create, modify, or remove VMAs in the current mm. > > They may split existing VMAs, merge adjacent VMAs, or remove a VMA > from mm_mt. > > mmap and brk are not at all relevant to anon_vma, as no anon_vma is > assigned upon mapping. It's on fault. > mmap/brk 指定地址时可能导致匿名 VMA merge 或 split。 mmap()/brk() with a specified address may cause anonymous VMA merge or split. > mprotect/mlock/munmap/etc. might split, but I don't see how the lazy > approach in any way addresses any of that. > 上边说了,split后rmap仍使用root_vma计算folio_address或page_address。 As mentioned above, after the split, rmap still uses root_vma to compute folio_address or page_address. > > > > When a new VMA is created, vm_start, vm_end and vm_pgoff are > > initialized and the VMA is inserted into mm_mt. Although these fields > > may later be modified, the following value remains invariant: > > > > (vm_start - vm_pgoff * PAGE_SIZE) > > Err no it doesn't at all? > > If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. > > Then if I remap it, vm_start changes, vm_pgoff stays the same, so: > > vm_start - vm_pgoff * PAGE_SIZE > > Changes right? And then that becomes essentially the offset from where it > was faulted in. > If mremap modifies vm_start, i.e., move_vma, a new VMA will be created. This corresponds exactly to the third point mentioned later: upgrading anon_vma_lazy to a regular anon_vma and updating folio->mapping. mremap时如果修改vm_start,即move_vma则创建新的vma,这正是我后边第三点说的: 将anon_vma_lazy升级成regular anon_vma并修改folio->mapping。 > > > > We refer to this value as: > > > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE > > This is mysteriously close to being the offset I mention in my CoW context > work... > > I'm not sure what 'mapping base' means here. > vma_addrss(vma, pgoff, nr_pages) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE = vma_mapping_base(vma) + pgoff * PAGE_SIZE vma_mapping_base depends only on the VMA and is independent of the page. Alternatively, we could also call it vma_rmap_base. vma_mapping_base只和vma相关,和page无关,或者我们也可以叫他vma_rmap_base? > > > > This value also remains unchanged when the VMA is removed from > mm_mt. > > Why does it matter what this value is on unmap? > If root_vma is removed from mm_mt due to munmap, it will still remain valid as long as other VMAs hold references to it. root_vma如果被munmap从mm_mt中删除。其他vma持有引用,就仍有效。 > > > > If a VMA is split and produces new_vma, the following holds: > > > > vma_mapping_base(new_vma) == vma_mapping_base(vma) > > This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > > > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > > vma_mapping_base(vma_x) > > This is just a roundabout way of saying the pgoff has to be aligned. > > > > > Assume the VMA where the first page fault occurs is called root_vma, > > and ensure that any VMA produced by split or merge holds a reference > > to root_vma. > > But this VMA can be unmapped later? Or remapped? > It can be unmapped. As mentioned earlier, if mremap modifies vm_start, a new VMA will be created. 可以被munmap。前边说了mremap如果修改vm_start则创建新的vma。 > Holding on to a VMA and treating it as some kind of canonical reference with > a reference count completely changes what VMAs are, impacts the VMA > lifecycle, and produces unwanted memory overhead in itself. > During split/merge operations, we can try to preferentially use root_vma so as to avoid deleting it. 在split/merge时,我们可以尽量优先使用root_vma,避免删除root_vma。 > It also raises concerns and issues around lock order which is very sensitive. > Both rmap and fork acquire the PTE lock, which ensures that handling a page with respect to a particular VMA is atomic. There is no need to add folio_lock. When fork converts folio->mapping into a regular anon_vma, rmap_walk_anon can simply check and retry. rmap和fork时都要获取pte锁,可以确保rmap/fork在处理page的某个vma是原子的。 不需要增加folio_lock,当fork将folio->mapping变成regular anon_vma后,rmap_walk_anon检查retry即可。 > > > > During rmap we can compute the folio address using root_vma: > > > > vma_address(vma, pgoff, 1) = > > What's the parameters here? What's 1? > > > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > > > We can then use folio_addr to locate the VMA covering this folio. > I overlooked this earlier. We can unify it by using pgoff as follows. page_addr = vma_address(vma, pgoff, nr_pages) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE = vma_mapping_base(vma) + pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE > I'm really confused by this, you're kind of mixing and match parameters here. > > What I think you're saying is that, if a folio hasn't been remapped, you can > figure out its address based on page offset. > > That's completely broken for MAP_PRIVATE file-backed mappings which also > use anon_vma and also have to keep on working. > > It seems that for the lazy approach what you are doing is essentially caching > the 'root' VMA in the folio. But this doesn't account for large folios and split > VMAs. > As mentioned earlier: subpage_address = vma_address(vma_a, subpage_pgoff, 1) = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE > Even if you disabled it for those cases (which adds a ton of complexity in > itself), you then have issues with locking - the anon_vma lock has to take a > lock (that cannot be a VMA-level lock - results in lock inversion) even on > these leaf entries, or you break locking. > When there is no fork/mremap, we do not need the interval tree or the anon_vma lock. 不fork/mremap时我们不需要interval tree,不需要anon_vma锁。 > And we can't reasonably start pinning VMAs and using them as a sort of > proto cached thing on top of the existing anon_vma logic. > In most cases, root_vma is actively used. Although it may be removed by munmap, overall it still saves memory. 大部分情况下root_vma都是在被使用的,当然可能被munmap删除,但是整体上节省内存的。 > You also then need to, on remap, undo all this, which requires updating > folio->mapping on remap, something I tried doing previously myself, but > that's fraught with issues around lock inversion itself. > > > > > 3. mremap / uffd_move > > userfaultfd moving is not relevant as it actually updates the folio correctly. > These two operations are different from the previous two types, as they modify the virtual address of the page/folio. 这两个操作和前两类不同,修改page/folio的虚拟地址。 > > > > If only the size changes and the start address remains the same, there > > is no impact. > > > > If the start address changes, the page is moved from (vma, addr) to > > (new_vma, new_addr). In this case: > > > > vma_mapping_base(new_vma) = > > vma_mapping_base(vma) + new_addr - old_addr > > You say above that the mapping base never changes? But here it changes? > For the newly created new_vma, vma_mapping_base(new_vma) is not equal to vma_mapping_base(vma), while vma_mapping_base(vma) itself remains unchanged. 新创建的new_vma的vma_mapping_base(new_vma) 不等于vma_mapping_base(vma),但是vma_mapping_base(vma)不变。 > > > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). > > What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a > 'normal' one. > > As above, this is fraught with lock inversion issues. > Yes, it upgrades from a lazy_vma to a regular anon_vma. As mentioned earlier, during this process we hold the mmap write lock, the vma write lock, and the pte lock, so acquiring the folio_lock is unnecessary. 是的,从lazy_vma升级成regular anon_vma。 如前边所说,这个过程中我们有mmap写锁、vma写锁和pte锁,可以不获取folio_lock。 > > > > If performance becomes a concern, ANON_VMA_LAZY can be enabled > only > > for relatively small VMAs. > > I think you've got serious correctness, lock management and complexity > issues and it's all a non-starter as the costs deeply exceed the benefits. > I think the approach is feasible: 1. During merge/split, the newly created vma_a satisfies vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma). Therefore, we can use root_vma to compute the virtual address of the folio/page mapped by vma_a. 2. During fork and mremap, we hold the mmap write lock, the vma write lock, and the pte lock. In particular, the pte lock ensures that rmap and fork operations on a folio/page within a specific vma are atomic. If folio->mapping is upgraded during rmap_walk_anon(folio), we can simply let rmap_walk_anon retry once. 我认为方案可行: 1. merge/split时新创建的vma_a有vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma) 所以我们可利用root_vma计算vma_a映射的folio/page的虚拟地址。 2. fork和mremap时我们持有mmap写锁、vma写锁和pte锁。 特别的pte锁能确保rmap和fork在folio/page在某个vma上的操作是原子的。 如果rmap_walk_anon(folio)过程中folio->mapping有升级变化,我们让rmap_walk_anon retry一次即可。 > This is one of the fundamental, frustrating aspects of the anon rmap - you > keep thinking that 'surely' you can do sensible thing X, but it turns out you > can't for various annoying reasons. > > It's one of the reasons it's really fraught for somebody coming to make > changes, and one of the reasons why I am very keen on fundamentally > changing it. > > And also on a not-wasting-time basis - I was already working in parallel on a > rework here, so I think the civil thing is to at least wait for my work before > issuing alternative solutions. > > Thanks, Lorenzo >
Thanks for your replies, but I really have to stop doing deeper analyses like these for time management purposes. I did this more so to make the point from [0] as to why, in lower trust environments, this is just not feasible. We could loop around for hours and hours and hours here. In general as before, even if all worked perfectly (I'm very much not at all convinced), extending anon_vma and pinning VMAs is simply a no-go for architectural and complexity reasons. I also find the locking story dubious and the lack of tests or anything corroborating correctness is additionally fatal. And finally, I was already working on a replacement for anon_vma, and the generally done thing in these situations is for my work to take precedence. So I'm going to bail out on futher deeper analyses here as otherwise I simply can't work on anything else :) Thanks, Lorenzo [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/ On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote: > > > > > > > Against my better judgment I'll address the stuff here... > > > > > VMA operations can be roughly divided into three categories. The > > > handling of ANON_VMA_LAZY is briefly described below. > > > > I don't agree, there are plenty more VMA operations. But with respect to > > anon rmap there are: > > > > - fork > > - merge/split > > - remap > > > > Yes, these are the three categories. I originally intended to explain them > by classifying based on system calls; I should have used mremap instead of move_vma. I don't think you mentioned move_vma()? Maybe I missed it. The categorisation is most usefully based on callers of anon_vma_clone(). > > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma换成mremap的。 > > > Your approach seems to completely ignore VMA split and the need to > > maintain an interval tree to _multiple_ VMAs from a single anon_vma. > > > > The folio uses vma->root_vma to compute folio_address. A VMA split from it, > vma_a, also uses vma_a->root_vma = vma->root_vma to compute folio_address. > During rmap, once folio_address is obtained, the VMA can be found through > mm_mt. Without fork, there is no need to maintain the interval tree. Well you need to search for every possible split VMA in mm_mt now, so you have to go page-by-page searching for each page for the rmap walked range. You're also potentially racing against a remap, as you say below you don't folio lock on remap so concurrent rmap walkers can be present, the VMA can already be copied. We already have VMA lifecycle state around detached VMAs, so a VMA could be in a detached state, assumed by the existing logic to be entirely unavailable for use, out of the maple tree altogether but kept around in a zombie state. We'd then have lifecycle issues and races and edge cases around process teardown otherwise we might leak memory. Also, presumably you set vma->anon_vma to some lazy sentinel value so that mremap doesn't change vma->vm_pgoff when unfaulted? You would need to update any path that manipulates vma->anon_vma also so it doesn't incorrectly dereference it. > > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,使用vma_a->root_vma = vma->root_vma计算folio_address。 > rmap时得到folio_address就可以通过mm_mt查找到vma。 > 不fork就不需要维护interval tree。 > > > You may also actually split a VMA against a single large folio (waiting on the > > deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped > > in two places. > > > > The lazy approach doesn't seem to address this properly. And fatally it ties an > > actual VMA afaict to the folio and has to implement a VMA reference count > > mechanism which interferes with the ordinarily VMA lifecycle to do it. > > > > The fact of us taking advantage of most stuff being AnonExclusive, i.e. > > 'leaves' is something that my approach is exactly taking into account. > > > > Of course also extending anon_vma is a real non-starter. > > > > Also the below + the series ignores MAP_PRIVATE file-backed mappings > > which is a pretty fatal flaw. > > > > It also, as Harry says, has zero description of correctness in a way we'd want > > and no tests. > > > > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的sub_page使用如下方式计算地址。 > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。 > > It can correctly handle the case where a VMA is split within a large > page. The address of a sub_page in the split VMA (vma_a or vma_b) is > computed using the following method. > > For COW anonymous pages originating from file VMAs, the page/folio > address is also computed using the same method. > > subpage_address = vma_address(vma_a, subpage_pgoff, 1) > = vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE > = vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE OK but you want to walk entries in a _range_ in the interval tree. So you are then now looking up VMAs (in a racey way) using mm_mt (which is the whole basis of my work actually) which could change under you. I guess what you're doing is using the pinned 'root' VMA as the basis of everything, and the second a VMA is moved you (somehow) walk the page tables to update the folio->mapping. Again pinning the VMA like this and putting it in a folio is really not something we want to do. It adds a ton of complexity and also impacts VMA lifecycle which is already fairly fraught. It makes the VMA no longer just a VMA but rather also a 'memory' of where something was first faulted in as a hack more or less. > > > > > > > 1. fork > > > > > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap > > and > > > is not involved here.) This can be viewed as copying the VMAs with > > > identical virtual addresses into a new address space. > > > > > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > > > regular anon_vma. The corresponding folio->mapping is then fixed in > > > try_dup_anon_rmap(). > > > > And so we make fork, a very sensitive path in the kernel more expensive. > > > > I also question the locking situation with the conversion mentioned, updating > > folios in this manner is extremely difficult. > > > > Because rmap takes the PTE lock, while fork takes the mmap write lock, > the VMA write lock, and the PTE lock. The PTE lock is not held for the duration of an anon_vma lock. You will break anything that needs to hold the anon_vma lock for the duration, e.g. migration. This is substantively the issue I am working on in my approach and as per https://ljs.io/scalable-cow-lsf.pdf you can see that's an open question that I am currently researching. > > Given the rule that folio->mapping can only transition in one direction > from lazy_vma to a regular anon_vma, the situation can be handled > correctly even without taking the folio_lock. Folio lock serialises against concurrent rmap walks, and you can end up reading a lazy_vma that later gets converted into an anon_vma concurrently. > > When rmap and fork run concurrently: > If rmap observes folio->mapping as a regular anon_vma, there is > obviously no issue. > If rmap observes folio->mapping as lazy_vma, then rmap only processes > the parent's pvma. At the end of rmap_walk_anon(), if we see that folio->mapping has > changed to a regular anon_vma, we simply process it once more. The > various rmap_one implementations are idempotent anyway. Hm this all seems very racey. > > BTW: the commit message of patch 13 says a retry is needed, but the > retry handling was accidentally omitted in the posted patch. :)) > > 因为rmap获取pte锁;fork时获取mmap写锁、vma写锁、pte锁。 > 只允许folio->mapping从lazy_vma单向变成regular anon_vma的原则,不获取folio_lock也可以正确处理。 > 当rmap和fork并发处理时: > 假如rmap看到的folio->mapping是regular anon_vma,显然没有问题。 > 假如rmap看到的folio->mapping是lazy_vma,则rmap只处理了父进程的pvma; > 我们在rmap_walk_anon结束时如果看到folio->mapping变成了regular anon_vma,则再来一次处理即可,毕竟各种rmap_one实现是幂等的。 > btw:patch 13的commit msg说要retry,但是发送的patch由于操作失误漏掉了重试处理。 > > > > > > > 2. mmap / brk / mprotect / munmap > > > > > > These operations create, modify, or remove VMAs in the current mm. > > > They may split existing VMAs, merge adjacent VMAs, or remove a VMA > > from mm_mt. > > > > mmap and brk are not at all relevant to anon_vma, as no anon_vma is > > assigned upon mapping. It's on fault. > > > mmap/brk 指定地址时可能导致匿名 VMA merge 或 split。 > > mmap()/brk() with a specified address may cause anonymous VMA merge or split. > > > mprotect/mlock/munmap/etc. might split, but I don't see how the lazy > > approach in any way addresses any of that. > > > 上边说了,split后rmap仍使用root_vma计算folio_address或page_address。 > > As mentioned above, after the split, rmap still uses root_vma to compute > folio_address or page_address. > > > > > > > When a new VMA is created, vm_start, vm_end and vm_pgoff are > > > initialized and the VMA is inserted into mm_mt. Although these fields > > > may later be modified, the following value remains invariant: > > > > > > (vm_start - vm_pgoff * PAGE_SIZE) > > > > Err no it doesn't at all? > > > > If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. > > > > Then if I remap it, vm_start changes, vm_pgoff stays the same, so: > > > > vm_start - vm_pgoff * PAGE_SIZE > > > > Changes right? And then that becomes essentially the offset from where it > > was faulted in. > > > If mremap modifies vm_start, i.e., move_vma, a new VMA will be created. > This corresponds exactly to the third point mentioned later: upgrading > anon_vma_lazy to a regular anon_vma and updating folio->mapping. I think updating folio->mapping here is problematic, I know this because I worked on this very probably a year or so ago and found locking issues prevented this from being workable. > > mremap时如果修改vm_start,即move_vma则创建新的vma,这正是我后边第三点说的: > 将anon_vma_lazy升级成regular anon_vma并修改folio->mapping。 > > > > > > > We refer to this value as: > > > > > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE > > > > This is mysteriously close to being the offset I mention in my CoW context > > work... > > > > I'm not sure what 'mapping base' means here. > > > > vma_addrss(vma, pgoff, nr_pages) > = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) > = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > vma_mapping_base depends only on the VMA and is independent of the page. > Alternatively, we could also call it vma_rmap_base. > > vma_mapping_base只和vma相关,和page无关,或者我们也可以叫他vma_rmap_base? > > > > > > > This value also remains unchanged when the VMA is removed from > > mm_mt. > > > > Why does it matter what this value is on unmap? > > > If root_vma is removed from mm_mt due to munmap, it will still remain > valid as long as other VMAs hold references to it. Yeah this is something we don't want. > > root_vma如果被munmap从mm_mt中删除。其他vma持有引用,就仍有效。 > > > > > > > If a VMA is split and produces new_vma, the following holds: > > > > > > vma_mapping_base(new_vma) == vma_mapping_base(vma) > > > > This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > > > > > > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > > > > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > > > vma_mapping_base(vma_x) > > > > This is just a roundabout way of saying the pgoff has to be aligned. > > > > > > > > Assume the VMA where the first page fault occurs is called root_vma, > > > and ensure that any VMA produced by split or merge holds a reference > > > to root_vma. > > > > But this VMA can be unmapped later? Or remapped? > > > It can be unmapped. As mentioned earlier, if mremap modifies vm_start, > a new VMA will be created. But everything's racey? > > 可以被munmap。前边说了mremap如果修改vm_start则创建新的vma。 > > > > Holding on to a VMA and treating it as some kind of canonical reference with > > a reference count completely changes what VMAs are, impacts the VMA > > lifecycle, and produces unwanted memory overhead in itself. > > > During split/merge operations, we can try to preferentially use root_vma > so as to avoid deleting it. Adding yet more complexity and edge cases, we really cannot do that, sorry. > > 在split/merge时,我们可以尽量优先使用root_vma,避免删除root_vma。 > > > It also raises concerns and issues around lock order which is very sensitive. > > > Both rmap and fork acquire the PTE lock, which ensures that handling a page > with respect to a particular VMA is atomic. The PTE lock only locks the PTE. That wasn't the issue I was raising at all. See the top of rmap.c for lock ordering. There's substantial complexity there. > > There is no need to add folio_lock. > When fork converts folio->mapping into a regular anon_vma, > rmap_walk_anon can simply check and retry. This seems like it won't work. And again you're adding a lot of new complexity. > > rmap和fork时都要获取pte锁,可以确保rmap/fork在处理page的某个vma是原子的。 > 不需要增加folio_lock,当fork将folio->mapping变成regular anon_vma后,rmap_walk_anon检查retry即可。 > > > > > > > > During rmap we can compute the folio address using root_vma: > > > > > > vma_address(vma, pgoff, 1) = > > > > What's the parameters here? What's 1? > > > > > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > > > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > > > > > We can then use folio_addr to locate the VMA covering this folio. > > > > I overlooked this earlier. We can unify it by using pgoff as follows. > > page_addr = vma_address(vma, pgoff, nr_pages) > = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) > = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE It's really just saying > > > > I'm really confused by this, you're kind of mixing and match parameters here. > > > > What I think you're saying is that, if a folio hasn't been remapped, you can > > figure out its address based on page offset. > > > > That's completely broken for MAP_PRIVATE file-backed mappings which also > > use anon_vma and also have to keep on working. > > > > It seems that for the lazy approach what you are doing is essentially caching > > the 'root' VMA in the folio. But this doesn't account for large folios and split > > VMAs. > > > As mentioned earlier: > subpage_address = vma_address(vma_a, subpage_pgoff, 1) > = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE I'm not sure what these > > > Even if you disabled it for those cases (which adds a ton of complexity in > > itself), you then have issues with locking - the anon_vma lock has to take a > > lock (that cannot be a VMA-level lock - results in lock inversion) even on > > these leaf entries, or you break locking. > > > When there is no fork/mremap, we do not need the interval tree or the anon_vma lock. We need to stabilise across the VMAs. > > 不fork/mremap时我们不需要interval tree,不需要anon_vma锁。 > > > And we can't reasonably start pinning VMAs and using them as a sort of > > proto cached thing on top of the existing anon_vma logic. > > > > In most cases, root_vma is actively used. > Although it may be removed by munmap, overall it still saves memory. For what workloads? Where? How? It's adding complexity we can't have. > > 大部分情况下root_vma都是在被使用的,当然可能被munmap删除,但是整体上节省内存的。 > > > You also then need to, on remap, undo all this, which requires updating > > folio->mapping on remap, something I tried doing previously myself, but > > that's fraught with issues around lock inversion itself. > > > > > > > > 3. mremap / uffd_move > > > > userfaultfd moving is not relevant as it actually updates the folio correctly. > > > These two operations are different from the previous two types, > as they modify the virtual address of the page/folio. > > 这两个操作和前两类不同,修改page/folio的虚拟地址。 > > > > > > > If only the size changes and the start address remains the same, there > > > is no impact. > > > > > > If the start address changes, the page is moved from (vma, addr) to > > > (new_vma, new_addr). In this case: > > > > > > vma_mapping_base(new_vma) = > > > vma_mapping_base(vma) + new_addr - old_addr > > > > You say above that the mapping base never changes? But here it changes? > > > > For the newly created new_vma, vma_mapping_base(new_vma) is not equal to vma_mapping_base(vma), > while vma_mapping_base(vma) itself remains unchanged. > > 新创建的new_vma的vma_mapping_base(new_vma) 不等于vma_mapping_base(vma),但是vma_mapping_base(vma)不变。 > > > > > > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). > > > > What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a > > 'normal' one. > > > > As above, this is fraught with lock inversion issues. > > > Yes, it upgrades from a lazy_vma to a regular anon_vma. > As mentioned earlier, during this process we hold the mmap write lock, the vma write lock, > and the pte lock, so acquiring the folio_lock is unnecessary. What's preventing a concurrent rmap walk? > > 是的,从lazy_vma升级成regular anon_vma。 > 如前边所说,这个过程中我们有mmap写锁、vma写锁和pte锁,可以不获取folio_lock。 > > > > > > > If performance becomes a concern, ANON_VMA_LAZY can be enabled > > only > > > for relatively small VMAs. > > > > I think you've got serious correctness, lock management and complexity > > issues and it's all a non-starter as the costs deeply exceed the benefits. > > > > I think the approach is feasible: For complexity and architectural reasons it's not. > > 1. During merge/split, the newly created vma_a satisfies > vma_mapping_base(vma_a) == vma_mapping_base(vma) == > vma_mapping_base(root_vma). Therefore, we can use root_vma to > compute the virtual address of the folio/page mapped by vma_a. I don't love these formulas. You're storing the originally faulted-in address in a VMA that you've pinned for the purpose of that. If you happen to merge right a lot of times you keep around dead VMAs just for this purpose. We're not having VMAs have a dual-role as a 'store of the address first faulted in' as well as being a virtual memory range. > > 2. During fork and mremap, we hold the mmap write lock, the vma > write lock, and the pte lock. In particular, the pte lock ensures > that rmap and fork operations on a folio/page within a specific > vma are atomic. If folio->mapping is upgraded during How does a lock exclude something that doesn't also hold that lock? This is also adding _yet more_ complexity and subtlety. It's really a hack. > rmap_walk_anon(folio), we can simply let rmap_walk_anon retry > once. Again, 'just repeating' if something changes like this without proper serialisation is not sufficient. > > > 我认为方案可行: > 1. merge/split时新创建的vma_a有vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma) > 所以我们可利用root_vma计算vma_a映射的folio/page的虚拟地址。 > 2. fork和mremap时我们持有mmap写锁、vma写锁和pte锁。 > 特别的pte锁能确保rmap和fork在folio/page在某个vma上的操作是原子的。 > 如果rmap_walk_anon(folio)过程中folio->mapping有升级变化,我们让rmap_walk_anon retry一次即可。 > > > This is one of the fundamental, frustrating aspects of the anon rmap - you > > keep thinking that 'surely' you can do sensible thing X, but it turns out you > > can't for various annoying reasons. > > > > It's one of the reasons it's really fraught for somebody coming to make > > changes, and one of the reasons why I am very keen on fundamentally > > changing it. > > > > And also on a not-wasting-time basis - I was already working in parallel on a > > rework here, so I think the civil thing is to at least wait for my work before > > issuing alternative solutions. > > > > Thanks, Lorenzo > >
>
> Thanks for your replies, but I really have to stop doing deeper analyses like
> these for time management purposes.
Of course I will respond to technical discussions.
>
> I did this more so to make the point from [0] as to why, in lower trust
> environments, this is just not feasible.
>
> We could loop around for hours and hours and hours here.
>
> In general as before, even if all worked perfectly (I'm very much not at all
> convinced), extending anon_vma and pinning VMAs is simply a no-go for
> architectural and complexity reasons.
>
> I also find the locking story dubious and the lack of tests or anything
> corroborating correctness is additionally fatal.
>
During rmap, anon_vma provides a superset of VMAs. We first confirm
with vma_address(), and then in each rmap_one we further check whether
the VMA needs to be processed through page_vma_mapped_walk() and
check_pte().
The lazy VMA used by ANON_VMA_LAZY provides only one VMA: if there is
no fork or mremap, then this single VMA is sufficient. To avoid taking
the folio_lock during fork and mremap, after anon_walk_anon, if
folio->mapping is upgraded to anon_vma, we retry once.
If your concern is about the lack of locking during rmap, you could
also refer to folio_wait_table and add a set of anon_vma_locks. That
was how I handled it during my initial debugging. Later, after
reviewing the code flow, I found that the lock might not be necessary,
so I removed it.
> And finally, I was already working on a replacement for anon_vma, and the
> generally done thing in these situations is for my work to take precedence.
>
> So I'm going to bail out on futher deeper analyses here as otherwise I simply
> can't work on anything else :)
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/
>
>
> On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote:
> > > >
> > >
> > > Against my better judgment I'll address the stuff here...
> > >
> > > > VMA operations can be roughly divided into three categories. The
> > > > handling of ANON_VMA_LAZY is briefly described below.
> > >
> > > I don't agree, there are plenty more VMA operations. But with
> > > respect to anon rmap there are:
> > >
> > > - fork
> > > - merge/split
> > > - remap
> > >
> >
> > Yes, these are the three categories. I originally intended to explain
> > them by classifying based on system calls; I should have used mremap
> instead of move_vma.
>
> I don't think you mentioned move_vma()? Maybe I missed it.
>
> The categorisation is most usefully based on callers of anon_vma_clone().
>
> >
> > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma
> 换成mremap的。
> >
> > > Your approach seems to completely ignore VMA split and the need to
> > > maintain an interval tree to _multiple_ VMAs from a single anon_vma.
> > >
> >
> > The folio uses vma->root_vma to compute folio_address. A VMA split
> > from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute
> folio_address.
> > During rmap, once folio_address is obtained, the VMA can be found
> > through mm_mt. Without fork, there is no need to maintain the interval
> tree.
>
> Well you need to search for every possible split VMA in mm_mt now, so you
> have to go page-by-page searching for each page for the rmap walked range.
>
ANON_VMA_LAZY has only one VMA. When I first looked at
rmap_walk_ksm, I also thought it would need to search page by page,
which seemed unacceptable. Later I realized that it only needs to
check whether this VMA falls within the rmap walk range.
@@ -3173,20 +3171,20 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
- anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
+ anon_rmap_foreach_vma(vma, vmac, anon_rmap,
0, ULONG_MAX) {
> You're also potentially racing against a remap, as you say below you don't
> folio lock on remap so concurrent rmap walkers can be present, the VMA can
> already be copied.
>
> We already have VMA lifecycle state around detached VMAs, so a VMA
> could be in a detached state, assumed by the existing logic to be entirely
> unavailable for use, out of the maple tree altogether but kept around in a
> zombie state.
>
> We'd then have lifecycle issues and races and edge cases around process
> teardown otherwise we might leak memory.
>
> Also, presumably you set vma->anon_vma to some lazy sentinel value so
> that mremap doesn't change vma->vm_pgoff when unfaulted?
>
> You would need to update any path that manipulates vma->anon_vma also
> so it doesn't incorrectly dereference it.
>
Yes, most of the code in this patch series is intended to prevent
incorrect dereferencing of anon_vma. If we assume it will not be
misused, some of the code could be simplified or removed.
> >
> > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,
> 使用vma_a->root_vma =
> > folio使用vma->vma->root_vma计算folio_address。
> > rmap时得到folio_address就可以通过mm_mt查找到vma。
> > 不fork就不需要维护interval tree。
> >
> > > You may also actually split a VMA against a single large folio
> > > (waiting on the deferred shrinker) and have a SINGLE _leaf_
> > > anonymous folio that is mapped in two places.
> > >
> > > The lazy approach doesn't seem to address this properly. And fatally
> > > it ties an actual VMA afaict to the folio and has to implement a VMA
> > > reference count mechanism which interferes with the ordinarily VMA
> lifecycle to do it.
> > >
> > > The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> > > 'leaves' is something that my approach is exactly taking into account.
> > >
> > > Of course also extending anon_vma is a real non-starter.
> > >
> > > Also the below + the series ignores MAP_PRIVATE file-backed mappings
> > > which is a pretty fatal flaw.
> > >
> > > It also, as Harry says, has zero description of correctness in a way
> > > we'd want and no tests.
> > >
> >
> > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的
> sub_page使用如下方式计算地址。
> > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。
> >
> > It can correctly handle the case where a VMA is split within a large
> > page. The address of a sub_page in the split VMA (vma_a or vma_b) is
> > computed using the following method.
> >
> > For COW anonymous pages originating from file VMAs, the page/folio
> > address is also computed using the same method.
> >
> > subpage_address = vma_address(vma_a, subpage_pgoff, 1) =
> > vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE =
> > vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff *
> > PAGE_SIZE = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
> =
> > vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE
>
> OK but you want to walk entries in a _range_ in the interval tree.
>
> So you are then now looking up VMAs (in a racey way) using mm_mt (which
> is the whole basis of my work actually) which could change under you.
>
> I guess what you're doing is using the pinned 'root' VMA as the basis of
> everything, and the second a VMA is moved you (somehow) walk the page
> tables to update the folio->mapping.
>
> Again pinning the VMA like this and putting it in a folio is really not something
> we want to do.
>
> It adds a ton of complexity and also impacts VMA lifecycle which is already
> fairly fraught.
>
> It makes the VMA no longer just a VMA but rather also a 'memory' of where
> something was first faulted in as a hack more or less.
>
Maybe you're right. mm/mm_mt/vma/pagetable each have their own roles
in implementing VM. Perhaps considering them together could lead to
better ideas.
OK I've had a look through more thoroughly now and: NAK and NAK any approach like this. Not only is this structurally all wrong, it does some insane stuff (pinning VMAs - no), the RCU usage is highly dubious and I suspect you've completely broken the anon rmap for things like migration, or have at least added very dubious edge cases. You've added insane complexity, and also have failed to add even perfunctory tests, which is also totally unacceptable. The implementation is wrong, and the approach is wrong - we do not want to extend or build on anon_vma. So this is unmergeable, or any approach like it. I also, unfortunately, strongly suspect AI here. The turn of phrase, and poor commit messages, you doing this out of nowhere with absolutely no rmap experience before, your total lack of communication before. Claude puts the probability of heavy AI usage at 85-90%, and I'm pretty convinced. Either way it's utterly unmergeable but that you (likely) used AI to generate this much work for us makes me actually pretty annoyed. As a result, I would strongly suggest you no longer submit patches for the reverse mapping part of mm, as there is now a real lack of trust. If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, e.g. send stuff on-list with a [DISCUSSION] tag, and engage with the community, and go from there. It's also important to synchronise - I'm working on an anon rmap replacement that I'm more than happy to discuss with you or anybody else which should achieve the same numbers in an architecturally sound way. You going off and, in a vacuum, generating a bunch of code with an unacceptable approach is not a civil way of engaging nor is it a good use of your time, or maintainer time looking at it. Thanks, Lorenzo
> Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > anon_vma creation > > OK I've had a look through more thoroughly now and: > > NAK and NAK any approach like this. > > > Not only is this structurally all wrong, it does some insane stuff (pinning > VMAs - no), the RCU usage is highly dubious and I suspect you've completely > broken the anon rmap for things like migration, or have at least added very > dubious edge cases. > > You've added insane complexity, and also have failed to add even > perfunctory tests, which is also totally unacceptable. > > The implementation is wrong, and the approach is wrong - we do not want to > extend or build on anon_vma. So this is unmergeable, or any approach like it. > > I also, unfortunately, strongly suspect AI here. The turn of phrase, and poor > commit messages, you doing this out of nowhere with absolutely no rmap > experience before, your total lack of communication before. > > Claude puts the probability of heavy AI usage at 85-90%, and I'm pretty > convinced. Either way it's utterly unmergeable but that you (likely) used AI to > generate this much work for us makes me actually pretty annoyed. > > As a result, I would strongly suggest you no longer submit patches for the > reverse mapping part of mm, as there is now a real lack of trust. > > If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, e.g. > send stuff on-list with a [DISCUSSION] tag, and engage with the community, > and go from there. > > It's also important to synchronise - I'm working on an anon rmap replacement > that I'm more than happy to discuss with you or anybody else which should > achieve the same numbers in an architecturally sound way. > > You going off and, in a vacuum, generating a bunch of code with an > unacceptable approach is not a civil way of engaging nor is it a good use of > your time, or maintainer time looking at it. > > Thanks, Lorenzo Your email is very unfriendly. I hope you can point out the specific problems so we can discuss how to solve them. I am not good at English and need to use AI to translate commit messages and comments. This reply email is also translated with AI. However, the code is written by me. I do not know which AI you are referring to, but the AI tools I use currently cannot effectively write kernel code.
On Thu, May 28, 2026 at 07:57:31AM +0000, wangtao wrote: > > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > > anon_vma creation > > > > OK I've had a look through more thoroughly now and: > > > > NAK and NAK any approach like this. > > > > > > Not only is this structurally all wrong, it does some insane stuff (pinning > > VMAs - no), the RCU usage is highly dubious and I suspect you've completely > > broken the anon rmap for things like migration, or have at least added very > > dubious edge cases. > > > > You've added insane complexity, and also have failed to add even > > perfunctory tests, which is also totally unacceptable. > > > > The implementation is wrong, and the approach is wrong - we do not want to > > extend or build on anon_vma. So this is unmergeable, or any approach like it. > > > > I also, unfortunately, strongly suspect AI here. The turn of phrase, and poor > > commit messages, you doing this out of nowhere with absolutely no rmap > > experience before, your total lack of communication before. > > > > Claude puts the probability of heavy AI usage at 85-90%, and I'm pretty > > convinced. Either way it's utterly unmergeable but that you (likely) used AI to > > generate this much work for us makes me actually pretty annoyed. > > > > As a result, I would strongly suggest you no longer submit patches for the > > reverse mapping part of mm, as there is now a real lack of trust. > > > > If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, e.g. > > send stuff on-list with a [DISCUSSION] tag, and engage with the community, > > and go from there. > > > > It's also important to synchronise - I'm working on an anon rmap replacement > > that I'm more than happy to discuss with you or anybody else which should > > achieve the same numbers in an architecturally sound way. > > > > You going off and, in a vacuum, generating a bunch of code with an > > unacceptable approach is not a civil way of engaging nor is it a good use of > > your time, or maintainer time looking at it. > > > > Thanks, Lorenzo > > Your email is very unfriendly. I hope you can point out the specific > problems so we can discuss how to solve them. I already did, you've not responded to any of them, and I'm simply not spending any more time on this. The series is totally unmergeable, please do not make further rmap submissions. > > I am not good at English and need to use AI to translate commit > messages and comments. This reply email is also translated with AI. > However, the code is written by me. I do not know which AI you are > referring to, but the AI tools I use currently cannot effectively > write kernel code. > We're fine with using AI for language, or in general as long as there's a clear understanding of what's being submitted. However I'm very unconvinced that this series wasn't generated. You have 2 patches in the kernel for the entirety of 2026. One in bluetooth and one in the scheduler. Prior to that you have patches from 2018 in device tree drivers. You have exactly 0 contributions to mm. Out of nowhere this year you have a big series for DMA, this series for anon_vma, having done no work or any contributions to rmap, let alone one of the trickiest and most complicated areas of mm. You have a total of 39 mails on the linux-mm mailing list. Suddenly doing a giant bit of work like this using code that looks entirely like it's AI-generated, and which after assessment by AI gives an 85-90% probability of AI generation is really suspicious. Now, if I'm mistaken, and you have a different name/email/identity I missed with many mm contributes - I will eat my words here (the series is still unmergeable either way though). So sorry, there's simply no trust and as a maintainer of rmap again I must strongly suggest that you no longer submit patches for this part of the kernel. If you wish to build trust up again, begin with discussions, and maybe try some smaller patches in mm to demonstrate that you're genuinely acting in good faith? Thanks, Lorenzo
On Thu, May 28, 2026 at 4:15 PM Lorenzo Stoakes <ljs@kernel.org> wrote: > > On Thu, May 28, 2026 at 07:57:31AM +0000, wangtao wrote: > > > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > > > anon_vma creation > > > > > > OK I've had a look through more thoroughly now and: > > > > > > NAK and NAK any approach like this. > > > > > > > > > Not only is this structurally all wrong, it does some insane stuff (pinning > > > VMAs - no), the RCU usage is highly dubious and I suspect you've completely > > > broken the anon rmap for things like migration, or have at least added very > > > dubious edge cases. > > > > > > You've added insane complexity, and also have failed to add even > > > perfunctory tests, which is also totally unacceptable. > > > > > > The implementation is wrong, and the approach is wrong - we do not want to > > > extend or build on anon_vma. So this is unmergeable, or any approach like it. > > > > > > I also, unfortunately, strongly suspect AI here. The turn of phrase, and poor > > > commit messages, you doing this out of nowhere with absolutely no rmap > > > experience before, your total lack of communication before. > > > > > > Claude puts the probability of heavy AI usage at 85-90%, and I'm pretty > > > convinced. Either way it's utterly unmergeable but that you (likely) used AI to > > > generate this much work for us makes me actually pretty annoyed. > > > > > > As a result, I would strongly suggest you no longer submit patches for the > > > reverse mapping part of mm, as there is now a real lack of trust. > > > > > > If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, e.g. > > > send stuff on-list with a [DISCUSSION] tag, and engage with the community, > > > and go from there. > > > > > > It's also important to synchronise - I'm working on an anon rmap replacement > > > that I'm more than happy to discuss with you or anybody else which should > > > achieve the same numbers in an architecturally sound way. > > > > > > You going off and, in a vacuum, generating a bunch of code with an > > > unacceptable approach is not a civil way of engaging nor is it a good use of > > > your time, or maintainer time looking at it. > > > > > > Thanks, Lorenzo > > > > Your email is very unfriendly. I hope you can point out the specific > > problems so we can discuss how to solve them. Hi Tao, Lorenzo had a discussion about rmap in Zagreb here: https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279-c4aa604206ac@lucifer.local/ He also shared the PoC code here: https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context and the slides were shared as well. In case you can't find them on linux-mm (I actually couldn't find them myself), I am attaching them again here - "scalable-cow-lsf-longer-version.pdf" After coming back from Zagreb, I kept trying to find one or two full days to read Lorenzo's code and slides carefully and write a blog about them. Unfortunately, I have been completely busy with other work. Sigh... we always seem to have too many non-upstream tasks. If possible, I'd really appreciate it if you could take a deep dive into it and write a detailed blog post. I'd be very eager to read it and better understand the overall design. Otherwise, I'll try to find some time next week or later to go through it myself. > > I already did, you've not responded to any of them, and I'm simply not > spending any more time on this. > > The series is totally unmergeable, please do not make further rmap > submissions. > > > > > I am not good at English and need to use AI to translate commit > > messages and comments. This reply email is also translated with AI. > > However, the code is written by me. I do not know which AI you are > > referring to, but the AI tools I use currently cannot effectively > > write kernel code. > > > > We're fine with using AI for language, or in general as long as there's a > clear understanding of what's being submitted. > > However I'm very unconvinced that this series wasn't generated. > > You have 2 patches in the kernel for the entirety of 2026. One in bluetooth > and one in the scheduler. > > Prior to that you have patches from 2018 in device tree drivers. > > You have exactly 0 contributions to mm. > > Out of nowhere this year you have a big series for DMA, this series for > anon_vma, having done no work or any contributions to rmap, let alone one > of the trickiest and most complicated areas of mm. > > You have a total of 39 mails on the linux-mm mailing list. > > Suddenly doing a giant bit of work like this using code that looks entirely > like it's AI-generated, and which after assessment by AI gives an 85-90% > probability of AI generation is really suspicious. > > Now, if I'm mistaken, and you have a different name/email/identity I missed > with many mm contributes - I will eat my words here (the series is still > unmergeable either way though). > > So sorry, there's simply no trust and as a maintainer of rmap again I must > strongly suggest that you no longer submit patches for this part of the > kernel. > > If you wish to build trust up again, begin with discussions, and maybe try > some smaller patches in mm to demonstrate that you're genuinely acting in > good faith? Hi Lorenzo, I truly believe Tao is acting with good intentions, although the way this is being done is quite messy. Memory costs are increasing significantly these days, and as I understand the patchset, he is trying to save memory. However, I don't think this is being done at the right time or in the right way. This may also be due to cultural differences, language barriers, information gaps, and a lack of familiarity with the mm community. As a non-native speaker, I can see how difficult this can sometimes be. I would really ask you to give Tao more chances to build trust step by step. Best Regards Barry
> > On Thu, May 28, 2026 at 07:57:31AM +0000, wangtao wrote: > > > > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for > deferred > > > > anon_vma creation > > > > > > > > OK I've had a look through more thoroughly now and: > > > > > > > > NAK and NAK any approach like this. > > > > > > > > > > > > Not only is this structurally all wrong, it does some insane stuff > > > > (pinning VMAs - no), the RCU usage is highly dubious and I suspect > > > > you've completely broken the anon rmap for things like migration, > > > > or have at least added very dubious edge cases. > > > > > > > > You've added insane complexity, and also have failed to add even > > > > perfunctory tests, which is also totally unacceptable. > > > > > > > > The implementation is wrong, and the approach is wrong - we do not > > > > want to extend or build on anon_vma. So this is unmergeable, or any > approach like it. > > > > > > > > I also, unfortunately, strongly suspect AI here. The turn of > > > > phrase, and poor commit messages, you doing this out of nowhere > > > > with absolutely no rmap experience before, your total lack of > communication before. > > > > > > > > Claude puts the probability of heavy AI usage at 85-90%, and I'm > > > > pretty convinced. Either way it's utterly unmergeable but that you > > > > (likely) used AI to generate this much work for us makes me actually > pretty annoyed. > > > > > > > > As a result, I would strongly suggest you no longer submit patches > > > > for the reverse mapping part of mm, as there is now a real lack of trust. > > > > > > > > If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, > e.g. > > > > send stuff on-list with a [DISCUSSION] tag, and engage with the > > > > community, and go from there. > > > > > > > > It's also important to synchronise - I'm working on an anon rmap > > > > replacement that I'm more than happy to discuss with you or > > > > anybody else which should achieve the same numbers in an > architecturally sound way. > > > > > > > > You going off and, in a vacuum, generating a bunch of code with an > > > > unacceptable approach is not a civil way of engaging nor is it a > > > > good use of your time, or maintainer time looking at it. > > > > > > > > Thanks, Lorenzo > > > > > > Your email is very unfriendly. I hope you can point out the specific > > > problems so we can discuss how to solve them. > > Hi Tao, > > Lorenzo had a discussion about rmap in Zagreb here: > https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279- > c4aa604206ac@lucifer.local/ > > He also shared the PoC code here: > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/ > cow-context > > and the slides were shared as well. In case you can't find them on linux-mm (I > actually couldn't find them myself), I am attaching them again here - > "scalable-cow-lsf-longer-version.pdf" > > After coming back from Zagreb, I kept trying to find one or two full days to > read Lorenzo's code and slides carefully and write a blog about them. > Unfortunately, I have been completely busy with other work. Sigh... we > always seem to have too many non-upstream tasks. > > If possible, I'd really appreciate it if you could take a deep dive into it and > write a detailed blog post. I'd be very eager to read it and better understand > the overall design. > Otherwise, I'll try to find some time next week or later to go through it > myself. > Hi Barry, Thank you very much for your reply. I took an initial look at the cow-context code, and a few points might be worth noting: 1. cow_context_walk currently assumes that the rmap walk runs under RCU protection. This may need to be adjusted early, since paths such as try_to_unmap_one, page_vma_mkclean_one, and try_to_migrate_one may involve task switching. 2. In cow_context_walk, traverse_contexts appears to involve multiple nested loops. When there are many child processes across several fork layers, it may not be as simple or efficient as the current anon_vma approach. It needs to traverse all child cow_ctx, and within each cow_ctx, remaps_for_each() has two levels of iteration: remaps_for_each_entry and remaps_for_each_entry_offset. In other words, it first iterates over cow_ctx and then traverses rmap_mt inside each one. The rough complexity seems to be O(#proc * log(#rmap_entries_in_cow)), which may be somewhat higher than anon_vma's O(#vmas_in_anon_vma). However, in most cases the number of processes is not large, so the impact may be limited. Previously, I also considered converting anon_vma's rb_tree to a mapletree. If one entry records a single VMA, the average overhead could be less than two longs per VMA. However, unlike rb_tree, mapletree does not support storing multiple elements under a single key. The key would need to look like (vma_id/mm_id + pgoff). On 32-bit platforms, since 64-bit mapletree keys are not supported yet, the remaining 12 bits are not enough for vma_id/mm_id. Because of this limitation, I later started thinking about ways to reduce anon_vma allocations instead. I will try to find some time next week to analyze the cow-context design and code more thoroughly, and then write up a summary. Thanks, Tao > > > > I already did, you've not responded to any of them, and I'm simply not > > spending any more time on this. > > > > The series is totally unmergeable, please do not make further rmap > > submissions. > > > > > > > > I am not good at English and need to use AI to translate commit > > > messages and comments. This reply email is also translated with AI. > > > However, the code is written by me. I do not know which AI you are > > > referring to, but the AI tools I use currently cannot effectively > > > write kernel code. > > > > > > > We're fine with using AI for language, or in general as long as > > there's a clear understanding of what's being submitted. > > > > However I'm very unconvinced that this series wasn't generated. > > > > You have 2 patches in the kernel for the entirety of 2026. One in > > bluetooth and one in the scheduler. > > > > Prior to that you have patches from 2018 in device tree drivers. > > > > You have exactly 0 contributions to mm. > > > > Out of nowhere this year you have a big series for DMA, this series > > for anon_vma, having done no work or any contributions to rmap, let > > alone one of the trickiest and most complicated areas of mm. > > > > You have a total of 39 mails on the linux-mm mailing list. > > > > Suddenly doing a giant bit of work like this using code that looks > > entirely like it's AI-generated, and which after assessment by AI > > gives an 85-90% probability of AI generation is really suspicious. > > > > Now, if I'm mistaken, and you have a different name/email/identity I > > missed with many mm contributes - I will eat my words here (the series > > is still unmergeable either way though). > > > > So sorry, there's simply no trust and as a maintainer of rmap again I > > must strongly suggest that you no longer submit patches for this part > > of the kernel. > > > > If you wish to build trust up again, begin with discussions, and maybe > > try some smaller patches in mm to demonstrate that you're genuinely > > acting in good faith? > > Hi Lorenzo, > > I truly believe Tao is acting with good intentions, although the way this is > being done is quite messy. > > Memory costs are increasing significantly these days, and as I understand the > patchset, he is trying to save memory. > > However, I don't think this is being done at the right time or in the right way. > This may also be due to cultural differences, language barriers, information > gaps, and a lack of familiarity with the mm community. > As a non-native speaker, I can see how difficult this can sometimes be. > > I would really ask you to give Tao more chances to build trust step by step. > > Best Regards > Barry
On Fri, May 29, 2026 at 09:41:20AM +0000, wangtao wrote: > > Hi Tao, > > > > Lorenzo had a discussion about rmap in Zagreb here: > > https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279- > > c4aa604206ac@lucifer.local/ > > > > He also shared the PoC code here: > > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/ > > cow-context > > > > and the slides were shared as well. In case you can't find them on linux-mm (I > > actually couldn't find them myself), I am attaching them again here - > > "scalable-cow-lsf-longer-version.pdf" > > > > After coming back from Zagreb, I kept trying to find one or two full days to > > read Lorenzo's code and slides carefully and write a blog about them. > > Unfortunately, I have been completely busy with other work. Sigh... we > > always seem to have too many non-upstream tasks. > > > > If possible, I'd really appreciate it if you could take a deep dive into it and > > write a detailed blog post. I'd be very eager to read it and better understand > > the overall design. > > Otherwise, I'll try to find some time next week or later to go through it > > myself. > > > Hi Barry, > > Thank you very much for your reply. > > I took an initial look at the cow-context code, and a few points > might be worth noting: > > 1. cow_context_walk currently assumes that the rmap walk runs > under RCU protection. This may need to be adjusted early, > since paths such as try_to_unmap_one, page_vma_mkclean_one, > and try_to_migrate_one may involve task switching. > > 2. In cow_context_walk, traverse_contexts appears to involve > multiple nested loops. When there are many child processes > across several fork layers, it may not be as simple or > efficient as the current anon_vma approach. > > It needs to traverse all child cow_ctx, and within each > cow_ctx, remaps_for_each() has two levels of iteration: > remaps_for_each_entry and remaps_for_each_entry_offset. > > In other words, it first iterates over cow_ctx and then > traverses rmap_mt inside each one. The rough complexity > seems to be O(#proc * log(#rmap_entries_in_cow)), which > may be somewhat higher than anon_vma's > O(#vmas_in_anon_vma). However, in most cases the number > of processes is not large, so the impact may be limited. > > Previously, I also considered converting anon_vma's rb_tree > to a mapletree. If one entry records a single VMA, the > average overhead could be less than two longs per VMA. > > However, unlike rb_tree, mapletree does not support storing > multiple elements under a single key. The key would need to > look like (vma_id/mm_id + pgoff). On 32-bit platforms, since > 64-bit mapletree keys are not supported yet, the remaining > 12 bits are not enough for vma_id/mm_id. > > Because of this limitation, I later started thinking about > ways to reduce anon_vma allocations instead. > > I will try to find some time next week to analyze the > cow-context design and code more thoroughly, and then > write up a summary. Tao, This response is so full of misunderstandings it's not really worth me responding to any of it. You've even hallucinated an imaginary field which is REALLY suspicious. You've no mm expertise or history and came up with this in a few hours. I asked Claude to analyse it and it puts it at 75-80% chance of being solely LLM-generated from cow_context.c. I simply don't have the time to deal with this, so unfortunately I'm going to have to withdraw the suggestion of further discussion with you on this topic. I am working on the scalable CoW project and will solicit opinions of those with relevant expertise. We are not interested in your approach or analysis. Thanks, Lorenzo
> > Previously, I also considered converting anon_vma's rb_tree to a > > mapletree. If one entry records a single VMA, the average overhead > > could be less than two longs per VMA. > > > > However, unlike rb_tree, mapletree does not support storing multiple > > elements under a single key. The key would need to look like > > (vma_id/mm_id + pgoff). On 32-bit platforms, since 64-bit mapletree > > keys are not supported yet, the remaining > > 12 bits are not enough for vma_id/mm_id. > > > > Because of this limitation, I later started thinking about ways to > > reduce anon_vma allocations instead. > > > > I will try to find some time next week to analyze the cow-context > > design and code more thoroughly, and then write up a summary. > > Tao, > > This response is so full of misunderstandings it's not really worth me > responding to any of it. You've even hallucinated an imaginary field which is > REALLY suspicious. > > You've no mm expertise or history and came up with this in a few hours. I > asked Claude to analyse it and it puts it at 75-80% chance of being solely LLM- > generated from cow_context.c. > > I simply don't have the time to deal with this, so unfortunately I'm going to > have to withdraw the suggestion of further discussion with you on this topic. > > I am working on the scalable CoW project and will solicit opinions of those > with relevant expertise. > > We are not interested in your approach or analysis. > > Thanks, Lorenzo You said discussion was welcome, yet when someone offered even a small comment, you refused to continue the discussion. If I had known you would be this inconsistent, I would not have replied to you in the first place. This will be my last reply to you. I will not respond again. Consider the following test case: Process P creates 1000 VMAs with mmap, named vma_1, vma_2, ..., vma_1000. Then it forks child processes C_1, C_2, ..., C_1000. Each child process C_k keeps only vma_k and munmaps all other vma_i. With the current anon_vma, reclaim walking each page only needs to handle two VMAs (vma_k in process P and vma_k in process C_k). But under the CoW approach, reclaiming each page needs to walk 1000 processes, then spend O(log(#remap_entries)) time to check whether a remap_entry exists, and then O(log(#vmas)) time to locate the VMA. Both the code complexity and the time complexity of the reverse walk are much higher than the current anon_vma approach.
On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: [...] > > You said discussion was welcome, yet when someone offered even a > small comment, you refused to continue the discussion. > > If I had known you would be this inconsistent, I would not have > replied to you in the first place. > > This will be my last reply to you. I will not respond again. > Hi Tao, Please don't walk away from the linux-mm community. I read your patchset and found it quite valuable. It not only reduces memory overhead, but also eliminates rmap costs for exclusive folios. Since I'm not very confident discussing technical topics in English, I wrote a blog post in Chinese about your patchset: https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA I have to admit that I found the implementation quite complex and in need of significant improvement. However, I think the underlying idea is very interesting and worth exploring further. I'm looking forward to seeing a v2 RFC with a cleaner and simpler implementation while preserving the core concept. Regardless of whether it ultimately gets merged, I hope the discussion can continue. Best regards, Barry
On 6/2/26 11:15 AM, Barry Song wrote: > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > [...] >> >> You said discussion was welcome, yet when someone offered even a >> small comment, you refused to continue the discussion. >> >> If I had known you would be this inconsistent, I would not have >> replied to you in the first place. >> >> This will be my last reply to you. I will not respond again. > > Hi Tao, > > Please don't walk away from the linux-mm community. I read your > patchset and found it quite valuable. It not only reduces memory > overhead, but also eliminates rmap costs for exclusive folios. > > Since I'm not very confident discussing technical topics in English, > I wrote a blog post in Chinese about your patchset: > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA The cover letter and commit messages should have been elaborated to a much greater degree instead of making people guess the design and intent from the code. > I have to admit that I found the implementation quite complex and > in need of significant improvement. > However, I think the underlying> idea is very interesting and worth exploring further. No. What it is trying to achieve is ambitious, but the idea itself is not worth exploring further as-is unless the correctness and complexity concerns are addressed. > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > implementation while preserving the core concept. I'm afraid this encouragement would mislead us in the wrong direction, where all of us end up wasting time. There isn't much point in posting v2 without addressing fundamental questions about the design. > Regardless of whether it ultimately gets merged, I hope the discussion > can continue. Regarding the "improving the reverse mapping subsystem" topic, a more constructive direction would be to carefully revisit the design decisions and discuss what we can do about them (that's exactly what Lorenzo has been doing). But that's not the first thing I would recommend to a relatively new contributor given that it's really complicated and even the people who have designed and reworked the reverse mapping subsystem over the past 20+ years haven't come up with a fundamentally better design. Reverse mapping is a frustratingly complicated subsystem. Without carefully revisiting the current design, there is not much hope of improving things at the design level, even slightly. What I would recommend to new people instead is: 1) starting by reviewing other people's work, so that you have enough time to learn the historical context and subtleties of the subsystem without making intrusive changes (which also keeps in touch with the community), and 2) making progress on smaller tasks with less intrusive changes, to gradually build trust and be able to do more valuable work. Unfortunately, looking at how this thread went, I see that the author is now in a worse position than an entirely new contributor. -- Cheers, Harry / Hyeonggon
On Wed, Jun 3, 2026 at 3:57 AM Harry Yoo <harry@kernel.org> wrote: > > > > On 6/2/26 11:15 AM, Barry Song wrote: > > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > > [...] > >> > >> You said discussion was welcome, yet when someone offered even a > >> small comment, you refused to continue the discussion. > >> > >> If I had known you would be this inconsistent, I would not have > >> replied to you in the first place. > >> > >> This will be my last reply to you. I will not respond again. > > > > Hi Tao, > > > > Please don't walk away from the linux-mm community. I read your > > patchset and found it quite valuable. It not only reduces memory > > overhead, but also eliminates rmap costs for exclusive folios. > > > > Since I'm not very confident discussing technical topics in English, > > I wrote a blog post in Chinese about your patchset: > > > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > The cover letter and commit messages should have been elaborated to a > much greater degree instead of making people guess the design and intent > from the code. Indeed. The cover letter does not clearly tell the story, and yesterday I needed quite some time to understand what the patchset was trying to achieve. > > > I have to admit that I found the implementation quite complex and > > in need of significant improvement. > > > However, I think the underlying> idea is very interesting and worth > exploring further. > > No. What it is trying to achieve is ambitious, but the idea itself is > not worth exploring further as-is unless the correctness and complexity > concerns are addressed. Can we give Tao more time to address the concerns and explain the correctness of the approach? That said, I don't think the patchset is entirely without merit. The idea that caught my attention is whether knowing that a process is guaranteed to be a leaf process could allow us to simplify parts of the rmap machinery and reduce some of the associated overhead. Assuming that a fork server (e.g. systemd or zygote) is preferable to having each application perform its own fork(), Linux already largely relies on fork servers in practice. Matthew also pointed out that calling fork() in multithreaded applications is a terrible idea [1]. This may suggest that, in general, processes outside of a fork-server model should avoid using fork(). If we were to introduce an API such as prctl(PR_SET_NOFORK) or something similar, could we eliminate a significant portion of the rmap-related overhead for such leaf processes, while still avoiding the complexity of the lazy allocation scheme proposed by Tao? I assume that the vast majority of processes in a real system are leaf processes? It also seems somewhat unusual that a few Android applications invoke fork() directly in a multithreaded context, while most use the zygote to create multiple processes for an app. Perhaps the Android framework should discourage this pattern entirely, and require applications to create child processes via the zygote? If, in real-world systems, more than 95% of processes are leaf processes, could that imply that the rmap design might be reconsidered for a different optimization path? [1] https://marc.info/?l=linuxppc-embedded&m=177912107460825&w=2 > > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > > implementation while preserving the core concept. > > I'm afraid this encouragement would mislead us in the wrong direction, > where all of us end up wasting time. > > There isn't much point in posting v2 without addressing fundamental > questions about the design. I suggested a v2 because the current patchset does not clearly state what it is trying to achieve. A revised version might help clarify the intent and make it easier to understand. Even if the overall complexity (such as lazy allocation) makes it hard to move forward, we may still be able to learn from it and gain some useful inspiration. > > > Regardless of whether it ultimately gets merged, I hope the discussion > > can continue. > > Regarding the "improving the reverse mapping subsystem" topic, a more > constructive direction would be to carefully revisit the design > decisions and discuss what we can do about them (that's exactly what > Lorenzo has been doing). I have no doubt at all about Lorenzo’s expertise in rmap and many other mm areas. That is well understood and widely recognized. I just think that hearing more perspectives could help us gain additional insight and inspiration. > > But that's not the first thing I would recommend to a relatively new > contributor given that it's really complicated and even the people who > have designed and reworked the reverse mapping subsystem over the past > 20+ years haven't come up with a fundamentally better design. > > Reverse mapping is a frustratingly complicated subsystem. Without > carefully revisiting the current design, there is not much hope of > improving things at the design level, even slightly. > > What I would recommend to new people instead is: > > 1) starting by reviewing other people's work, so that you have enough > time to learn the historical context and subtleties of the subsystem > without making intrusive changes (which also keeps in touch with the > community), and > > 2) making progress on smaller tasks with less intrusive changes, to > gradually build trust and be able to do more valuable work. > Yes, that is a good approach for new contributors. > Unfortunately, looking at how this thread went, I see that the author is > now in a worse position than an entirely new contributor. > > -- > Cheers, > Harry / Hyeonggon Thanks Barry
On 2026/6/2 10:15, Barry Song wrote: > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > [...] >> >> You said discussion was welcome, yet when someone offered even a >> small comment, you refused to continue the discussion. >> >> If I had known you would be this inconsistent, I would not have >> replied to you in the first place. >> >> This will be my last reply to you. I will not respond again. >> > > Hi Tao, > > Please don't walk away from the linux-mm community. I read your > patchset and found it quite valuable. It not only reduces memory > overhead, but also eliminates rmap costs for exclusive folios. > > Since I'm not very confident discussing technical topics in English, > I wrote a blog post in Chinese about your patchset: > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > > I have to admit that I found the implementation quite complex and > in need of significant improvement. However, I think the underlying > idea is very interesting and worth exploring further. > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > implementation while preserving the core concept. > > Regardless of whether it ultimately gets merged, I hope the discussion > can continue. Same here :) Tao, please don't let this thread get you down. No first RFC is perfect, and the idea still looks worth discussing :) Thanks for working on this! Cheers, Lance
On Tue, Jun 02, 2026 at 10:46:35AM +0800, Lance Yang wrote: > > > On 2026/6/2 10:15, Barry Song wrote: > > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > > [...] > > > > > > You said discussion was welcome, yet when someone offered even a > > > small comment, you refused to continue the discussion. > > > > > > If I had known you would be this inconsistent, I would not have > > > replied to you in the first place. > > > > > > This will be my last reply to you. I will not respond again. > > > > > > > Hi Tao, > > > > Please don't walk away from the linux-mm community. I read your > > patchset and found it quite valuable. It not only reduces memory > > overhead, but also eliminates rmap costs for exclusive folios. > > > > Since I'm not very confident discussing technical topics in English, > > I wrote a blog post in Chinese about your patchset: > > > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > > > > I have to admit that I found the implementation quite complex and > > in need of significant improvement. However, I think the underlying > > idea is very interesting and worth exploring further. > > > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > > implementation while preserving the core concept. > > > > Regardless of whether it ultimately gets merged, I hope the discussion > > can continue. > > Same here :) > > Tao, please don't let this thread get you down. No first RFC is > perfect, and the idea still looks worth discussing :) > > Thanks for working on this! Guys, this isn't helpful. We aren't extending anon_vma, and I am working on replacing it, that's the bottom line. I have presented compelling evidence suggesting this is AI generated. In response I got more AI-generated nonsense. There's no trust, the code and analysis are all wrong, end of discussion. > > Cheers, Lance > Thanks, Lorenzo P.S. maintainership is utterly thankless, and I don't really expect much in return, but honestly reading this, given the case I've made here, was really quite disappointing.
On Tue, Jun 2, 2026 at 11:37 PM Lorenzo Stoakes <ljs@kernel.org> wrote: > > On Tue, Jun 02, 2026 at 10:46:35AM +0800, Lance Yang wrote: > > > > > > On 2026/6/2 10:15, Barry Song wrote: > > > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > > > [...] > > > > > > > > You said discussion was welcome, yet when someone offered even a > > > > small comment, you refused to continue the discussion. > > > > > > > > If I had known you would be this inconsistent, I would not have > > > > replied to you in the first place. > > > > > > > > This will be my last reply to you. I will not respond again. > > > > > > > > > > Hi Tao, > > > > > > Please don't walk away from the linux-mm community. I read your > > > patchset and found it quite valuable. It not only reduces memory > > > overhead, but also eliminates rmap costs for exclusive folios. > > > > > > Since I'm not very confident discussing technical topics in English, > > > I wrote a blog post in Chinese about your patchset: > > > > > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > > > > > > I have to admit that I found the implementation quite complex and > > > in need of significant improvement. However, I think the underlying > > > idea is very interesting and worth exploring further. > > > > > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > > > implementation while preserving the core concept. > > > > > > Regardless of whether it ultimately gets merged, I hope the discussion > > > can continue. > > > > Same here :) > > > > Tao, please don't let this thread get you down. No first RFC is > > perfect, and the idea still looks worth discussing :) > > > > Thanks for working on this! > > Guys, this isn't helpful. > > We aren't extending anon_vma, and I am working on replacing it, that's the > bottom line. Not trying to challenge your bottom line. As explained to Harry, I have no doubt about your expertise in rmap and many other mm areas, and I deeply respect your work on rmap. With more discussion, we might gain additional insight and inspiration. What Tao has inspired me with is the idea that if we assume most real-world processes are leaf processes, could we simplify parts of the design? This is why I suggested a v2, to improve the clarity of the cover letter and make the code easier to understand, and to see whether there is something worth considering further, even if it is not suitable for merging. > > I have presented compelling evidence suggesting this is AI generated. In > response I got more AI-generated nonsense. There's no trust, the code and > analysis are all wrong, end of discussion. I am not an AI expert, and I do not really use AI in kernel work, so I am not really sure what counts as AI versus non-AI. Sorry. > > > > > Cheers, Lance > > > > Thanks, Lorenzo > > P.S. maintainership is utterly thankless, and I don't really expect much in > return, but honestly reading this, given the case I've made here, was > really quite disappointing. Understood. I see your position, and I personally have great respect and appreciation for your work on maintenance. Sorry if my words came across as disappointing. Best Regards Barry
On Wed, Jun 03, 2026 at 07:03:53AM +0800, Barry Song wrote: > On Tue, Jun 2, 2026 at 11:37 PM Lorenzo Stoakes <ljs@kernel.org> wrote: > > > > On Tue, Jun 02, 2026 at 10:46:35AM +0800, Lance Yang wrote: > > > > > > > > > On 2026/6/2 10:15, Barry Song wrote: > > > > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > > > > [...] > > > > > > > > > > You said discussion was welcome, yet when someone offered even a > > > > > small comment, you refused to continue the discussion. > > > > > > > > > > If I had known you would be this inconsistent, I would not have > > > > > replied to you in the first place. > > > > > > > > > > This will be my last reply to you. I will not respond again. > > > > > > > > > > > > > Hi Tao, > > > > > > > > Please don't walk away from the linux-mm community. I read your > > > > patchset and found it quite valuable. It not only reduces memory > > > > overhead, but also eliminates rmap costs for exclusive folios. > > > > > > > > Since I'm not very confident discussing technical topics in English, > > > > I wrote a blog post in Chinese about your patchset: > > > > > > > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > > > > > > > > I have to admit that I found the implementation quite complex and > > > > in need of significant improvement. However, I think the underlying > > > > idea is very interesting and worth exploring further. > > > > > > > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > > > > implementation while preserving the core concept. > > > > > > > > Regardless of whether it ultimately gets merged, I hope the discussion > > > > can continue. > > > > > > Same here :) > > > > > > Tao, please don't let this thread get you down. No first RFC is > > > perfect, and the idea still looks worth discussing :) > > > > > > Thanks for working on this! > > > > Guys, this isn't helpful. > > > > We aren't extending anon_vma, and I am working on replacing it, that's the > > bottom line. > > Not trying to challenge your bottom line. As explained to Harry, I > have no doubt about your expertise in rmap and many other mm > areas, and I deeply respect your work on rmap. Thanks I appreciate that. I don't mean to be 'mean' here, I'm only acting in what I feel are the best interests of mm and the kernel. > > With more discussion, we might gain additional insight and > inspiration. What Tao has inspired me with is the idea that if we > assume most real-world processes are leaf processes, could we > simplify parts of the design? Maybe I didn't express it clearly enough at LSF, but this is entirely a key point of my CoW context design :) It's true most stuff is leaf, and yes we can take advantage of this, and CoW context allows us to do it while also unravelling the issues with anon_vma. I am actually thinking of doing some incremental changes as part of my work possibly if I can. I maybe need to expedite that to bring some clarity to things here... > > This is why I suggested a v2, to improve the clarity of the cover > letter and make the code easier to understand, and to see whether > there is something worth considering further, even if it is not > suitable for merging. Right, I see. Again I'm really trying to tread a fine line here between the technical discussion and not pouring more and more time into a discussion that's not useful to me or the community. See [0] as to my reasoning on this :) [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/ > > > > > I have presented compelling evidence suggesting this is AI generated. In > > response I got more AI-generated nonsense. There's no trust, the code and > > analysis are all wrong, end of discussion. > > I am not an AI expert, and I do not really use AI in kernel work, > so I am not really sure what counts as AI versus non-AI. Sorry. No worries! > > > > > > > > > Cheers, Lance > > > > > > > Thanks, Lorenzo > > > > P.S. maintainership is utterly thankless, and I don't really expect much in > > return, but honestly reading this, given the case I've made here, was > > really quite disappointing. > > Understood. I see your position, and I personally have great > respect and appreciation for your work on maintenance. Sorry if > my words came across as disappointing. Thanks, appreciate it. And no worries! > > Best Regards > Barry Cheers, Lorenzo
On Tue, Jun 02, 2026 at 04:37:14PM +0100, Lorenzo Stoakes wrote: > On Tue, Jun 02, 2026 at 10:46:35AM +0800, Lance Yang wrote: > > > > > > On 2026/6/2 10:15, Barry Song wrote: > > > On Mon, Jun 1, 2026 at 9:46 AM wangtao <tao.wangtao@honor.com> wrote: > > > [...] > > > > > > > > You said discussion was welcome, yet when someone offered even a > > > > small comment, you refused to continue the discussion. > > > > > > > > If I had known you would be this inconsistent, I would not have > > > > replied to you in the first place. > > > > > > > > This will be my last reply to you. I will not respond again. > > > > > > > > > > Hi Tao, > > > > > > Please don't walk away from the linux-mm community. I read your > > > patchset and found it quite valuable. It not only reduces memory > > > overhead, but also eliminates rmap costs for exclusive folios. > > > > > > Since I'm not very confident discussing technical topics in English, > > > I wrote a blog post in Chinese about your patchset: > > > > > > https://mp.weixin.qq.com/s/k00tzhTl8HbL3k4G6ev4SA > > > > > > I have to admit that I found the implementation quite complex and > > > in need of significant improvement. However, I think the underlying > > > idea is very interesting and worth exploring further. > > > > > > I'm looking forward to seeing a v2 RFC with a cleaner and simpler > > > implementation while preserving the core concept. > > > > > > Regardless of whether it ultimately gets merged, I hope the discussion > > > can continue. > > > > Same here :) > > > > Tao, please don't let this thread get you down. No first RFC is > > perfect, and the idea still looks worth discussing :) > > > > Thanks for working on this! > > Guys, this isn't helpful. > > We aren't extending anon_vma, and I am working on replacing it, that's the > bottom line. > > I have presented compelling evidence suggesting this is AI generated. In > response I got more AI-generated nonsense. There's no trust, the code and > analysis are all wrong, end of discussion. 100% agree. I think plenty of technical/process/etc reasons as to why this idea/contribution is not mergeable have been listed. Overriding this with "keep it up!!!111!11!!" is not helpful. -- Pedro
On Fri, May 29, 2026 at 01:04:08PM +0100, Lorenzo Stoakes wrote: > On Fri, May 29, 2026 at 09:41:20AM +0000, wangtao wrote: > > > Hi Tao, > > > > > > Lorenzo had a discussion about rmap in Zagreb here: > > > https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279- > > > c4aa604206ac@lucifer.local/ > > > > > > He also shared the PoC code here: > > > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/ > > > cow-context > > > > > > and the slides were shared as well. In case you can't find them on linux-mm (I > > > actually couldn't find them myself), I am attaching them again here - > > > "scalable-cow-lsf-longer-version.pdf" > > > > > > After coming back from Zagreb, I kept trying to find one or two full days to > > > read Lorenzo's code and slides carefully and write a blog about them. > > > Unfortunately, I have been completely busy with other work. Sigh... we > > > always seem to have too many non-upstream tasks. > > > > > > If possible, I'd really appreciate it if you could take a deep dive into it and > > > write a detailed blog post. I'd be very eager to read it and better understand > > > the overall design. > > > Otherwise, I'll try to find some time next week or later to go through it > > > myself. > > > > > Hi Barry, > > > > Thank you very much for your reply. > > > > I took an initial look at the cow-context code, and a few points > > might be worth noting: > > > > 1. cow_context_walk currently assumes that the rmap walk runs > > under RCU protection. This may need to be adjusted early, > > since paths such as try_to_unmap_one, page_vma_mkclean_one, > > and try_to_migrate_one may involve task switching. > > > > 2. In cow_context_walk, traverse_contexts appears to involve > > multiple nested loops. When there are many child processes > > across several fork layers, it may not be as simple or > > efficient as the current anon_vma approach. > > > > It needs to traverse all child cow_ctx, and within each > > cow_ctx, remaps_for_each() has two levels of iteration: > > remaps_for_each_entry and remaps_for_each_entry_offset. > > > > In other words, it first iterates over cow_ctx and then > > traverses rmap_mt inside each one. The rough complexity > > > seems to be O(#proc * log(#rmap_entries_in_cow)), which > > may be somewhat higher than anon_vma's > > O(#vmas_in_anon_vma). However, in most cases the number > > of processes is not large, so the impact may be limited. > > > > Previously, I also considered converting anon_vma's rb_tree > > to a mapletree. If one entry records a single VMA, the > > average overhead could be less than two longs per VMA. > > > > However, unlike rb_tree, mapletree does not support storing > > multiple elements under a single key. The key would need to > > look like (vma_id/mm_id + pgoff). On 32-bit platforms, since > > 64-bit mapletree keys are not supported yet, the remaining > > 12 bits are not enough for vma_id/mm_id. > > > > Because of this limitation, I later started thinking about > > ways to reduce anon_vma allocations instead. > > > > I will try to find some time next week to analyze the > > cow-context design and code more thoroughly, and then > > write up a summary. > > Tao, > > This response is so full of misunderstandings it's not really worth me > responding to any of it. You've even hallucinated an imaginary field which > is REALLY suspicious. > > You've no mm expertise or history and came up with this in a few hours. I > asked Claude to analyse it and it puts it at 75-80% chance of being solely > LLM-generated from cow_context.c. > > I simply don't have the time to deal with this, so unfortunately I'm going > to have to withdraw the suggestion of further discussion with you on this > topic. > > I am working on the scalable CoW project and will solicit opinions of those > with relevant expertise. > > We are not interested in your approach or analysis. > > Thanks, Lorenzo Apparently there's some misunderstanding about this situation here, sigh. So for avoidance of doubt - I've now spent many hours on this, and unfortunately (as I've already said in multiple places) this series has serious architectural and code flaws. And unfortunately, the anon_vma approach is not something we wish to extend, for reasons I've gone into elsewhere - but broadly because it's a broken abstraction, that uses lots of memory and causes lock contention. The approach here has multiple technical issues, so many that getting into each one would require hours more of my time to analyse, maybe all week? And then if there were further replies and replies to the replies and respins... However, I also feel there's substantive, overlapping, evidence of the _logic_ (not the text, we are FINE with using AI to assist text for non-native speakers) being LLM-generated. However you can never prove this for 100% certain. But you can certainly be more or less sure. I would never suggest this unless I was really pretty certain. I am very keen to avoid 'witch hunts', or rash accusations. This is not that. It's a _carefully considered_ opinion, based on evidence. But of course - I do not know for SURE. You can never know. The big problem here is asymmetry of maintainer resource. I simply _cannot_ respond to every single issue here. And when the architecture is something we don't want, then it's not really necessary to. And my big deep underlying concern with all this is - people can generate a very significant amount of this kind of work, and we have limited reviewer time. I've already dealt with burnout recently that I'm thankfully recovering from. I'm not really keen to go back to that. I really truly worry that if we don't have a means by which we can quickly dismiss/deprioritise things when we have a _significant_ evidence of wholesale AI generation, then maintainer overload will increase exponentially. And that's really a serious problem. If we treat it like simply a technically incorrect solution, then it means we open it up to further discussion on and onx, as we're actually observing here. If the responses are also LLM-generated then it's even more problematic. This is why I bring it up, and proactively say it's lead to a real loss in trust in this case, and why, after there was a response that included a hallucinated field in it, I went further and said that I really don't want to have a discussion either. It's because of this asymmetry. And even this reply, written at 9.45pm at night, after several hours of discussion about this off-list, is evidence of the problem we have with this kind of asymmetry. It's nothing personal, it's about managing time and resources. Thanks, Lorenzo
Barry Song <baohua@kernel.org> writes: > After coming back from Zagreb, I kept trying to find one or > two full days to read Lorenzo's code and slides carefully and > write a blog about them. Unfortunately, I have been completely > busy with other work. Sigh... we always seem to have too many > non-upstream tasks. > > If possible, I'd really appreciate it if you could take a > deep dive into it and write a detailed blog post. I'd be > very eager to read it and better understand the overall design. > Otherwise, I'll try to find some time next week or later to > go through it myself. It's still somewhat superficial, but in case it's helpful: https://lwn.net/Articles/1072378/ jon
On Fri, May 29, 2026 at 09:07:30AM -0600, Jonathan Corbet wrote: > Barry Song <baohua@kernel.org> writes: > > > After coming back from Zagreb, I kept trying to find one or > > two full days to read Lorenzo's code and slides carefully and > > write a blog about them. Unfortunately, I have been completely > > busy with other work. Sigh... we always seem to have too many > > non-upstream tasks. > > > > If possible, I'd really appreciate it if you could take a > > deep dive into it and write a detailed blog post. I'd be > > very eager to read it and better understand the overall design. > > Otherwise, I'll try to find some time next week or later to > > go through it myself. > > It's still somewhat superficial, but in case it's helpful: > > https://lwn.net/Articles/1072378/ > I found it to be great as usual :) > jon Cheers, Lorenzo
On Fri, May 29, 2026 at 11:40 PM Lorenzo Stoakes <ljs@kernel.org> wrote: > > On Fri, May 29, 2026 at 09:07:30AM -0600, Jonathan Corbet wrote: > > Barry Song <baohua@kernel.org> writes: > > > > > After coming back from Zagreb, I kept trying to find one or > > > two full days to read Lorenzo's code and slides carefully and > > > write a blog about them. Unfortunately, I have been completely > > > busy with other work. Sigh... we always seem to have too many > > > non-upstream tasks. > > > > > > If possible, I'd really appreciate it if you could take a > > > deep dive into it and write a detailed blog post. I'd be > > > very eager to read it and better understand the overall design. > > > Otherwise, I'll try to find some time next week or later to > > > go through it myself. > > > > It's still somewhat superficial, but in case it's helpful: > > > > https://lwn.net/Articles/1072378/ > > > > I found it to be great as usual :) +1 Thanks very much, Jon. That's really helpful. > > > jon > > Cheers, Lorenzo Best Regards Barry
> -----Original Message----- > From: Barry Song <baohua@kernel.org> > Sent: Friday, May 29, 2026 7:31 AM > To: Lorenzo Stoakes <ljs@kernel.org> > Cc: wangtao <tao.wangtao@honor.com>; catalin.marinas@arm.com; > will@kernel.org; tglx@kernel.org; mingo@redhat.com; bp@alien8.de; > dave.hansen@linux.intel.com; x86@kernel.org; akpm@linux-foundation.org; > david@kernel.org; willy@infradead.org; sj@kernel.org; kees@kernel.org; > luizcap@redhat.com; zhangjiao2@cmss.chinamobile.com; kas@kernel.org; > hpa@zytor.com; liam@infradead.org; vbabka@kernel.org; rppt@kernel.org; > surenb@google.com; mhocko@suse.com; jack@suse.cz; riel@surriel.com; > harry@kernel.org; jannh@google.com; jgg@ziepe.ca; jhubbard@nvidia.com; > peterx@redhat.com; ziy@nvidia.com; baolin.wang@linux.alibaba.com; > npache@redhat.com; ryan.roberts@arm.com; dev.jain@arm.com; > lance.yang@linux.dev; xu.xin16@zte.com.cn; chengming.zhou@linux.dev; > nao.horiguchi@gmail.com; matthew.brost@intel.com; > joshua.hahnjy@gmail.com; rakie.kim@sk.com; byungchul@sk.com; > gourry@gourry.net; ying.huang@linux.alibaba.com; apopple@nvidia.com; > pfalcato@suse.de; linux-arm-kernel@lists.infradead.org; linux- > kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux- > mm@kvack.org; damon@lists.linux.dev; shakeel.butt@linux.dev; > ryncsn@gmail.com; jparsana@google.com; dvander@google.com; zhangji > <zhangji1@honor.com>; wangzicheng <wangzicheng@honor.com> > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > anon_vma creation > > On Thu, May 28, 2026 at 4:15 PM Lorenzo Stoakes <ljs@kernel.org> wrote: > > > > On Thu, May 28, 2026 at 07:57:31AM +0000, wangtao wrote: > > > > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for > deferred > > > > anon_vma creation > > > > > > > > OK I've had a look through more thoroughly now and: > > > > > > > > NAK and NAK any approach like this. > > > > > > > > > > > > Not only is this structurally all wrong, it does some insane stuff > > > > (pinning VMAs - no), the RCU usage is highly dubious and I suspect > > > > you've completely broken the anon rmap for things like migration, > > > > or have at least added very dubious edge cases. > > > > > > > > You've added insane complexity, and also have failed to add even > > > > perfunctory tests, which is also totally unacceptable. > > > > > > > > The implementation is wrong, and the approach is wrong - we do not > > > > want to extend or build on anon_vma. So this is unmergeable, or any > approach like it. > > > > > > > > I also, unfortunately, strongly suspect AI here. The turn of > > > > phrase, and poor commit messages, you doing this out of nowhere > > > > with absolutely no rmap experience before, your total lack of > communication before. > > > > > > > > Claude puts the probability of heavy AI usage at 85-90%, and I'm > > > > pretty convinced. Either way it's utterly unmergeable but that you > > > > (likely) used AI to generate this much work for us makes me actually > pretty annoyed. > > > > > > > > As a result, I would strongly suggest you no longer submit patches > > > > for the reverse mapping part of mm, as there is now a real lack of trust. > > > > > > > > If you wish to rebuild that, I suggest you _discuss_ concepts and ideas, > e.g. > > > > send stuff on-list with a [DISCUSSION] tag, and engage with the > > > > community, and go from there. > > > > > > > > It's also important to synchronise - I'm working on an anon rmap > > > > replacement that I'm more than happy to discuss with you or > > > > anybody else which should achieve the same numbers in an > architecturally sound way. > > > > > > > > You going off and, in a vacuum, generating a bunch of code with an > > > > unacceptable approach is not a civil way of engaging nor is it a > > > > good use of your time, or maintainer time looking at it. > > > > > > > > Thanks, Lorenzo > > > > > > Your email is very unfriendly. I hope you can point out the specific > > > problems so we can discuss how to solve them. > > Hi Tao, > > Lorenzo had a discussion about rmap in Zagreb here: > https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279- > c4aa604206ac@lucifer.local/ > > He also shared the PoC code here: > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/ > cow-context > > and the slides were shared as well. In case you can't find them on linux-mm (I > actually couldn't find them myself), I am attaching them again here - > "scalable-cow-lsf-longer-version.pdf" > > After coming back from Zagreb, I kept trying to find one or two full days to > read Lorenzo's code and slides carefully and write a blog about them. > Unfortunately, I have been completely busy with other work. Sigh... we > always seem to have too many non-upstream tasks. > > If possible, I'd really appreciate it if you could take a deep dive into it and > write a detailed blog post. I'd be very eager to read it and better understand > the overall design. > Otherwise, I'll try to find some time next week or later to go through it > myself. > Hi Barry, Thank you for your guidance, it is very much appreciated. I work with Tao at Honor. The motivation behind this work is genuine and practical. The memory cost has increased significantly, and we have spent real effort investigating and prototyping solutions to reduce it. We're happy to join "constructive" discussions and learn from the community. Thanks, Zicheng > > > > I already did, you've not responded to any of them, and I'm simply not > > spending any more time on this. > > > > The series is totally unmergeable, please do not make further rmap > > submissions. > > > > > > > > I am not good at English and need to use AI to translate commit > > > messages and comments. This reply email is also translated with AI. > > > However, the code is written by me. I do not know which AI you are > > > referring to, but the AI tools I use currently cannot effectively > > > write kernel code. > > > > > > > We're fine with using AI for language, or in general as long as > > there's a clear understanding of what's being submitted. > > > > However I'm very unconvinced that this series wasn't generated. > > > > You have 2 patches in the kernel for the entirety of 2026. One in > > bluetooth and one in the scheduler. > > > > Prior to that you have patches from 2018 in device tree drivers. > > > > You have exactly 0 contributions to mm. > > > > Out of nowhere this year you have a big series for DMA, this series > > for anon_vma, having done no work or any contributions to rmap, let > > alone one of the trickiest and most complicated areas of mm. > > > > You have a total of 39 mails on the linux-mm mailing list. > > > > Suddenly doing a giant bit of work like this using code that looks > > entirely like it's AI-generated, and which after assessment by AI > > gives an 85-90% probability of AI generation is really suspicious. > > > > Now, if I'm mistaken, and you have a different name/email/identity I > > missed with many mm contributes - I will eat my words here (the series > > is still unmergeable either way though). > > > > So sorry, there's simply no trust and as a maintainer of rmap again I > > must strongly suggest that you no longer submit patches for this part > > of the kernel. > > > > If you wish to build trust up again, begin with discussions, and maybe > > try some smaller patches in mm to demonstrate that you're genuinely > > acting in good faith? > > Hi Lorenzo, > > I truly believe Tao is acting with good intentions, although the way this is > being done is quite messy. > > Memory costs are increasing significantly these days, and as I understand the > patchset, he is trying to save memory. > > However, I don't think this is being done at the right time or in the right way. > This may also be due to cultural differences, language barriers, information > gaps, and a lack of familiarity with the mm community. > As a non-native speaker, I can see how difficult this can sometimes be. > > I would really ask you to give Tao more chances to build trust step by step. > > Best Regards > Barry
On Fri, May 29, 2026 at 02:20:38AM +0000, wangzicheng wrote: > Hi Barry, > > Thank you for your guidance, it is very much appreciated. > > I work with Tao at Honor. The motivation behind this work is genuine and practical. > The memory cost has increased significantly, and we have spent real effort investigating > and prototyping solutions to reduce it. Thanks, appreciate the input from your side Zicheng. The series is unmergeable as-is, regardless of provenance. What's unfortunate is that early discussion could have saved effort (and/or tokens :). This is often the case with firms that develop something in-house in isolation then present it to the community suddenly. And as I said to Barry (+ Tao previously), the circumstances surrounding this series are additionally very suspicious, and while we are fine with LLM assistance where the authors fully understand it, it feels that this is not the case here. > > We're happy to join "constructive" discussions and learn from the community. So with the negativity above said, I'd really like us to move to a more positive and constructive situation :) One thing that is clear is that - we all want the same thing. Reduced memory usage, reduced lock contention in the anon rmap. So one thing that could be very useful is for you guys to help assist me with testing of my anon rmap approach as it develops, and also provide input, critique, and review. I'd also love to be made aware of any testing you guys have done or input on that, also any workloads where you have observed particularly problematic memory usage or lock contention. Please note that my work is currently under heavy development so the proof-of-concept code provided is incomplete and not yet functional (it's just there to give a sense of the shape). I will likely provide a pre-RFC series to interested parties prior to sending an RFC on-list. I'd be more than happy to include you guys in that. And as I said previously, I'm more than happy to engage in discussion on-list or privately regarding this work :) > > Thanks, > Zicheng Cheers, Lorenzo
On Fri, May 29, 2026 at 07:31:12AM +0800, Barry Song wrote: > Hi Tao, > > Lorenzo had a discussion about rmap in Zagreb here: > https://lore.kernel.org/linux-mm/aec533b2-37a7-4f44-a279-c4aa604206ac@lucifer.local/ > > He also shared the PoC code here: > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context > > and the slides were shared as well. In case you can't find > them on linux-mm (I actually couldn't find them myself), I am > attaching them again here - > "scalable-cow-lsf-longer-version.pdf" > > After coming back from Zagreb, I kept trying to find one or > two full days to read Lorenzo's code and slides carefully and > write a blog about them. Unfortunately, I have been completely > busy with other work. Sigh... we always seem to have too many > non-upstream tasks. > > If possible, I'd really appreciate it if you could take a > deep dive into it and write a detailed blog post. I'd be > very eager to read it and better understand the overall design. > Otherwise, I'll try to find some time next week or later to > go through it myself. Not sure if you're asking Tao or me about a blog post here? :) > Hi Lorenzo, > > I truly believe Tao is acting with good intentions, although > the way this is being done is quite messy. > > Memory costs are increasing significantly these days, and as I > understand the patchset, he is trying to save memory. I think there's broad awareness (from myself in particular...!) of this. > > However, I don't think this is being done at the right time > or in the right way. This may also be due to cultural > differences, language barriers, information gaps, and a lack > of familiarity with the mm community. > As a non-native speaker, I can see how difficult this can > sometimes be. > > I would really ask you to give Tao more chances to build > trust step by step. > > Best Regards > Barry I understand and empathise with language difficulties - I have zero objection to using LLMs to assist with that. But none of my objections relate to this. We have received a huge, invasive, unmergeable series with code that reads exactly as you'd expect from LLM-generated code, that Claude assigns a high probability of being AI generated, from somebody with: - 0 previous mm contributions - 0 interactions in rmap - 2 patches in 2026 (neither mm) - prior to that only devicetree contributions from 8 years ago What would you have me do under those circumstances? Unfortunately this means I have very little trust in Tao, and given limited maintainership resource, as I said, I suggest he attempts no further code contributions to rmap. And as I said elsewhere, he can rebuild trust through constructive discussion. Also perhaps building up credibility in mm through smaller series showing understanding? Thanks, Lorenzo
I'm sorry but this is not how kernel development is done. You're sending a series that's very invasive, that you've not coordinated with anybody else, nor have you mentioned it at a conference, nor engaged with in discussion with anybody else in the community in any way. And you've sent it without an RFC, at -rc5 is... quite something. We do NOT want to extend or expand or hack in anything like this on top of the existing anon_vma machinery. It's a mess that requires replacement, not more hacks or expansion. I've been working on a replacement for the anonymous rmap, recently presenting at LSF/MM, and all of that has been very public. In fact I have engaged in recent work which reduced lock contention in anon_vma, it's really quite discourteous for you not to have contacted me or the community in addition to the above. On Wed, May 27, 2026 at 07:01:32PM +0800, tao wrote: > TL;DR > ----- > > This series introduces ANON_VMA_LAZY, which defers anon_vma creation > until it is actually required. > > - anon_vma memory reduced by ~92-97%, anon_vma_chain reduced by ~50-57% > - rmap operations on ANON_VMA_LAZY VMAs do not require anon_vma locking > > Background > ---------- > > Currently anon_vma structures are created eagerly when anonymous VMAs > are initialized. However, many VMAs never participate in fork or rmap What are you talking about? 'Initialized'? They are created when memory is faulted in, and we explicity need to know that that's the case. Also the folio->mapping is required to point to something to allow for anon rmap... > operations that require anon_vma chains, so the allocated anon_vma and > anon_vma_chain objects are often unnecessary. Right, because we never split or merge VMAs nor require anon rmap? > > Design overview > --------------- > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > (for example during fork). VMAs that never participate in sharing can > avoid creating anon_vma structures entirely. Well, it's needed the second something's faulted in so you can perform anon rmap. > > Before an anon_vma exists, rmap operations rely directly on VMA > information, so no anon_vma locking is required. An anon_vma is created > and linked only when sharing semantics are required. Err 'directly on VMA information'... a VMA pointer? That can change at any point? What about remaps?... I guess I'll see in the code. > > This series introduces anon_rmap helpers to make rmap less dependent on > direct anon_vma access. It also introduces anon_vma_tree_t as a container > to support both the lazy and the existing anon_vma layouts. Super invasive, extending the already broken abstraction further. We don't want this. > > Once a VMA becomes associated with an anon_vma, the normal behavior > remains unchanged. > > Memory impact > ------------- > > Preliminary measurements show significant reductions in anon_vma-related > slab allocations. > > After boot: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 117035 | 118176 | +1.0% > anon_vma_chain | 18865.8 | 8112.06 | -57.0% > anon_vma | 20426.4 | 613.75 | -97.0% > > After launching 24 apps: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 196873 | 197345 | +0.2% > anon_vma_chain | 31477.1 | 15576.8 | -50.5% > anon_vma | 33280 | 2648.12 | -92.0% > > Simple fork microbenchmarks also show a slight improvement in fork > performance, since child VMAs do not need to allocate anon_vma > structures during fork. This seems completely broken too re: anon_vma propagation on fork? The above is only meaningful if you're not fundamentally breaking anon rmap which is very easily done, but in addition, I'm not interested in seeing the anon_vma machinery extended further. > > Feedback and suggestions are welcome. This is what you should have sought AHEAD of sending this. I'll look at the code, but in general you've gone about this in a really unfortuate way with respect to the community. This is not to collaborate. > > > tao (15): > mm/rmap: introduce anon_rmap APIs for anonymous folios > mm: convert anon_vma rmap APIs to anon_rmap > mm: introduce anon_vma_tree_t for multiple anon_vma topologies > mm: switch to anon_vma_tree_t APIs in preparation for ANON_VMA_LAZY > mm: add CONFIG_ANON_VMA_LAZY and folio helpers > mm: add CONFIG_VMA_REF and VMA helpers > mm: replace direct FOLIO_MAPPING_ANON usage with helpers > mm: prepare rmap infrastructure for ANON_VMA_LAZY > mm: implement ANON_VMA_LAZY rmap semantics > mm: defer anon_vma creation with ANON_VMA_LAZY > mm: handle ANON_VMA_LAZY in huge page operations > mm: handle ANON_VMA_LAZY during migration > mm: support setup and upgrade of ANON_VMA_LAZY folios > mm: support merging of ANON_VMA_LAZY VMAs > mm: enable CONFIG_ANON_VMA_LAZY on arm64 and x86_64 > > arch/arm64/Kconfig | 1 + > arch/x86/Kconfig | 1 + > fs/proc/page.c | 6 +- > include/linux/mm.h | 38 ++ > include/linux/mm_types.h | 9 +- > include/linux/page-flags.h | 34 +- > include/linux/pagemap.h | 2 +- > include/linux/rmap.h | 165 ++++++++- > mm/Kconfig | 22 ++ > mm/damon/ops-common.c | 4 +- > mm/debug.c | 2 +- > mm/debug_vm_pgtable.c | 2 +- > mm/gup.c | 6 +- > mm/huge_memory.c | 16 +- > mm/internal.h | 171 +++++++++ > mm/khugepaged.c | 13 +- > mm/ksm.c | 43 ++- > mm/memory-failure.c | 11 +- > mm/memory.c | 19 +- > mm/migrate.c | 126 ++++--- > mm/mmap.c | 15 +- > mm/mremap.c | 4 +- > mm/page_idle.c | 2 +- > mm/rmap.c | 690 ++++++++++++++++++++++++++++++++++--- > mm/vma.c | 76 ++-- > mm/vma.h | 4 +- > mm/vma_exec.c | 2 +- > mm/vma_init.c | 1 + > 28 files changed, 1279 insertions(+), 206 deletions(-) > > -- > 2.17.1 > > Lorenzo
> Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > anon_vma creation > > I'm sorry but this is not how kernel development is done. > > You're sending a series that's very invasive, that you've not coordinated with > anybody else, nor have you mentioned it at a conference, nor engaged with > in discussion with anybody else in the community in any way. > > And you've sent it without an RFC, at -rc5 is... quite something. > > We do NOT want to extend or expand or hack in anything like this on top of > the existing anon_vma machinery. It's a mess that requires replacement, not > more hacks or expansion. > > I've been working on a replacement for the anonymous rmap, recently > presenting at LSF/MM, and all of that has been very public. > > In fact I have engaged in recent work which reduced lock contention in > anon_vma, it's really quite discourteous for you not to have contacted me or > the community in addition to the above. > First of all, thank you very much for your reply. I'm also glad to learn that you have been working on optimizations related to anon_vma. I noticed your work from another email in the thread. I apologize if my approach caused any inconvenience. My English is not very good, and I have rarely participated in community discussions before, so I'm still learning how things are usually done in the kernel community. If anything I did came across as discourteous, please understand that it was not my intention. Recently, due to increasing memory costs, I revisited the memory usage of anon_vma and found that it might be possible to reduce its overhead. My original intention was simply to experiment with an anon_vma_lazy mechanism to reduce the memory footprint of anon_vma. However, while working on it I realized that anon_vma handling is quite complex. In particular, after multiple levels of fork the topology of anonymous pages can become quite complicated, and it also interacts with other subsystems such as reclaim, KSM, and migration. Because of this, I tried separating the functionality of anon_vma into two parts: anon_rmap_t for anonymous page reverse mapping, and anon_vma_tree_t for topology management. anon_rmap_t provides the reverse mapping interface used by reclaim, KSM, migration, and similar components. anon_vma_tree_t manages the topology internally for operations such as fork, clone, split, and merge, and can also indicate whether a VMA has experienced page faults. My hope is that by separating these responsibilities, it may become easier to reason about and further improve the anon_vma design in the future.
On Thu, May 28, 2026 at 07:11:19AM +0000, wangtao wrote: > > > > Subject: Re: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred > > anon_vma creation > > > > I'm sorry but this is not how kernel development is done. > > > > You're sending a series that's very invasive, that you've not coordinated with > > anybody else, nor have you mentioned it at a conference, nor engaged with > > in discussion with anybody else in the community in any way. > > > > And you've sent it without an RFC, at -rc5 is... quite something. > > > > We do NOT want to extend or expand or hack in anything like this on top of > > the existing anon_vma machinery. It's a mess that requires replacement, not > > more hacks or expansion. > > > > I've been working on a replacement for the anonymous rmap, recently > > presenting at LSF/MM, and all of that has been very public. > > > > In fact I have engaged in recent work which reduced lock contention in > > anon_vma, it's really quite discourteous for you not to have contacted me or > > the community in addition to the above. > > > First of all, thank you very much for your reply. > > I'm also glad to learn that you have been working on optimizations > related to anon_vma. I noticed your work from another email in the > thread. > > I apologize if my approach caused any inconvenience. My English is not > very good, and I have rarely participated in community discussions > before, so I'm still learning how things are usually done in the > kernel community. If anything I did came across as discourteous, please > understand that it was not my intention. > > Recently, due to increasing memory costs, I revisited the memory usage > of anon_vma and found that it might be possible to reduce its > overhead. > > My original intention was simply to experiment with an anon_vma_lazy > mechanism to reduce the memory footprint of anon_vma. However, while > working on it I realized that anon_vma handling is quite complex. In > particular, after multiple levels of fork the topology of anonymous > pages can become quite complicated, and it also interacts with other > subsystems such as reclaim, KSM, and migration. > > Because of this, I tried separating the functionality of anon_vma > into two parts: anon_rmap_t for anonymous page reverse mapping, and > anon_vma_tree_t for topology management. > > anon_rmap_t provides the reverse mapping interface used by reclaim, > KSM, migration, and similar components. > > anon_vma_tree_t manages the topology internally for operations such > as fork, clone, split, and merge, and can also indicate whether a > VMA has experienced page faults. > > My hope is that by separating these responsibilities, it may become > easier to reason about and further improve the anon_vma design in the > future. I understand your approach, as discussed it's not viable. As mentioned elsewhere, please refrain from further code contributions to rmap, as there is now a trust issue due to a high likelihood of undeclared AI-generated code. You are, however, welcome to engage in discussion, and I'm happy to discuss the approach publicly on list or via private email whichever you prefer :) You are also welcome to engage in discussion/review/critique once I produce my RFC for this. Thanks, Lorenzo
On Wed, May 27, 2026 at 07:01:32PM +0800, tao wrote: > TL;DR > ----- > > This series introduces ANON_VMA_LAZY, which defers anon_vma creation > until it is actually required. > > - anon_vma memory reduced by ~92-97%, anon_vma_chain reduced by ~50-57% > - rmap operations on ANON_VMA_LAZY VMAs do not require anon_vma locking > > Background > ---------- > > Currently anon_vma structures are created eagerly when anonymous VMAs > are initialized. However, many VMAs never participate in fork or rmap This is not true, they are created on fault + a few other places. > operations that require anon_vma chains, so the allocated anon_vma and > anon_vma_chain objects are often unnecessary. > > Design overview > --------------- > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > (for example during fork). VMAs that never participate in sharing can > avoid creating anon_vma structures entirely. > > Before an anon_vma exists, rmap operations rely directly on VMA > information, so no anon_vma locking is required. An anon_vma is created > and linked only when sharing semantics are required. > > This series introduces anon_rmap helpers to make rmap less dependent on > direct anon_vma access. It also introduces anon_vma_tree_t as a container > to support both the lazy and the existing anon_vma layouts. > > Once a VMA becomes associated with an anon_vma, the normal behavior > remains unchanged. > > Memory impact > ------------- > > Preliminary measurements show significant reductions in anon_vma-related > slab allocations. > > After boot: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 117035 | 118176 | +1.0% > anon_vma_chain | 18865.8 | 8112.06 | -57.0% > anon_vma | 20426.4 | 613.75 | -97.0% > > After launching 24 apps: > > Object | Before (active KB) | After (active KB) | Change > vm_area_struct | 196873 | 197345 | +0.2% > anon_vma_chain | 31477.1 | 15576.8 | -50.5% > anon_vma | 33280 | 2648.12 | -92.0% > > Simple fork microbenchmarks also show a slight improvement in fork > performance, since child VMAs do not need to allocate anon_vma > structures during fork. > > Feedback and suggestions are welcome. I'm afraid, per previous discussions[1], that no one is really willing to maintain extra complexity for the current state of anon rmap and anon vmas. Sorry :/ Also, please don't send series this large without previous discussion and _at least_ an RFC tag. [1] https://lore.kernel.org/all/aec533b2-37a7-4f44-a279-c4aa604206ac@lucifer.local/ -- Pedro
> > > > Feedback and suggestions are welcome. > > I'm afraid, per previous discussions[1], that no one is really willing to maintain > extra complexity for the current state of anon rmap and anon vmas. > Sorry :/ > > Also, please don't send series this large without previous discussion and _at > least_ an RFC tag. > > [1] https://lore.kernel.org/all/aec533b2-37a7-4f44-a279- > c4aa604206ac@lucifer.local/ > > -- > Pedro Thank you very much for your reply. As I am not very good at english, I haven't participated much in community discussions before and I'm still not very familiar with the usual process. I realize now that I should probably have started with a discussion thread first, and that this patch series would have been more appropriate with an RFC tag. I apologize for that. I will read the discussion in [1] more carefully. I also noticed the related code here: https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context Many years ago I had already noticed that data structures such as vma, page_table, and anon_vma consume a significant amount of memory. However, since the mm subsystem is quite complex, I didn't look into it in depth at the time. Recently, with memory costs increasing, I revisited these structures and analyzed their memory usage again. Since anon_vma seems to have a relatively smaller impact compared to vma and page tables, I started by exploring possible optimizations for anon_vma first. Although anon_vma is relatively simple, there are still quite a few uncertainties. So I waited until the basic functionality was implemented before sending the patches for discussion. Thanks again for taking the time to reply. -- Tao
On Thu, May 28, 2026 at 06:45:07AM +0000, wangtao wrote: > > > > > > Feedback and suggestions are welcome. > > > > I'm afraid, per previous discussions[1], that no one is really willing to maintain > > extra complexity for the current state of anon rmap and anon vmas. > > Sorry :/ > > > > Also, please don't send series this large without previous discussion and _at > > least_ an RFC tag. > > > > [1] https://lore.kernel.org/all/aec533b2-37a7-4f44-a279- > > c4aa604206ac@lucifer.local/ > > > > -- > > Pedro > > Thank you very much for your reply. > > As I am not very good at english, I haven't participated much in community discussions before and I'm still not very familiar with the usual process. > I realize now that I should probably have started with a discussion thread first, and that this patch series would have been more appropriate with an RFC tag. > I apologize for that. Thanks, appreciate it. It's also for your benefit - regardless of AI usage or not, you've spent time on this needlessly, which a discussion could have avoided. Also as I said, going this way has damaged trust, which also doesn't benefit anybody. mm is a welcoming and open community, the best approach when looking at something like this is to engage with us :) > > I will read the discussion in [1] more carefully. I also noticed the related code here: > https://git.kernel.org/pub/scm/linux/kernel/git/ljs/linux.git/log/?h=project/cow-context > > Many years ago I had already noticed that data structures such as vma, page_table, and anon_vma consume a significant amount of memory. > However, since the mm subsystem is quite complex, I didn't look into it in depth at the time. > Recently, with memory costs increasing, I revisited these structures and analyzed their memory usage again. > Since anon_vma seems to have a relatively smaller impact compared to vma and page tables, I started by exploring possible optimizations for anon_vma first. > > Although anon_vma is relatively simple, there are still quite a few uncertainties. > So I waited until the basic functionality was implemented before sending the patches for discussion. Thanks, I am more than happy to discuss my approach. You can also see slides from my talk on this at LSF/MM at https://ljs.io/talks (Note that the code linked is an incomplete implementation, simply some early code to give a sense of the approach taken!) I would ask, however, in general that you hold off on anything code-wise before I am able to issue my own RFC implementing this approach so we can avoid any overlap/confusion. > > Thanks again for taking the time to reply. > > -- > Tao > Cheers, lorenzo
© 2016 - 2026 Red Hat, Inc.