RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation

wangtao posted 15 patches 4 days, 4 hours ago
Only 0 patches received!
There is a newer version of this series
RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation
Posted by wangtao 4 days, 4 hours ago
> 
> Thanks for your replies, but I really have to stop doing deeper analyses like
> these for time management purposes.
Of course I will respond to technical discussions.

> 
> I did this more so to make the point from [0] as to why, in lower trust
> environments, this is just not feasible.
> 
> We could loop around for hours and hours and hours here.
> 
> In general as before, even if all worked perfectly (I'm very much not at all
> convinced), extending anon_vma and pinning VMAs is simply a no-go for
> architectural and complexity reasons.
> 
> I also find the locking story dubious and the lack of tests or anything
> corroborating correctness is additionally fatal.
> 
During rmap, anon_vma provides a superset of VMAs. We first confirm
with vma_address(), and then in each rmap_one we further check whether
the VMA needs to be processed through page_vma_mapped_walk() and
check_pte().

The lazy VMA used by ANON_VMA_LAZY provides only one VMA: if there is
no fork or mremap, then this single VMA is sufficient. To avoid taking
the folio_lock during fork and mremap, after anon_walk_anon, if
folio->mapping is upgraded to anon_vma, we retry once.

If your concern is about the lack of locking during rmap, you could
also refer to folio_wait_table and add a set of anon_vma_locks. That
was how I handled it during my initial debugging. Later, after
reviewing the code flow, I found that the lock might not be necessary,
so I removed it.

> And finally, I was already working on a replacement for anon_vma, and the
> generally done thing in these situations is for my work to take precedence.
> 
> So I'm going to bail out on futher deeper analyses here as otherwise I simply
> can't work on anything else :)
> 
> Thanks, Lorenzo
> 
> [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/
> 
> 
> On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote:
> > > >
> > >
> > > Against my better judgment I'll address the stuff here...
> > >
> > > > VMA operations can be roughly divided into three categories. The
> > > > handling of ANON_VMA_LAZY is briefly described below.
> > >
> > > I don't agree, there are plenty more VMA operations. But with
> > > respect to anon rmap there are:
> > >
> > > - fork
> > > - merge/split
> > > - remap
> > >
> >
> > Yes, these are the three categories. I originally intended to explain
> > them by classifying based on system calls; I should have used mremap
> instead of move_vma.
> 
> I don't think you mentioned move_vma()? Maybe I missed it.
> 
> The categorisation is most usefully based on callers of anon_vma_clone().
> 
> >
> > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma
> 换成mremap的。
> >
> > > Your approach seems to completely ignore VMA split and the need to
> > > maintain an interval tree to _multiple_ VMAs from a single anon_vma.
> > >
> >
> > The folio uses vma->root_vma to compute folio_address. A VMA split
> > from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute
> folio_address.
> > During rmap, once folio_address is obtained, the VMA can be found
> > through mm_mt. Without fork, there is no need to maintain the interval
> tree.
> 
> Well you need to search for every possible split VMA in mm_mt now, so you
> have to go page-by-page searching for each page for the rmap walked range.
> 
ANON_VMA_LAZY has only one VMA. When I first looked at
rmap_walk_ksm, I also thought it would need to search page by page,
which seemed unacceptable. Later I realized that it only needs to
check whether this VMA falls within the rmap walk range.

@@ -3173,20 +3171,20 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
-		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
+		anon_rmap_foreach_vma(vma, vmac, anon_rmap,
 					       0, ULONG_MAX) {

> You're also potentially racing against a remap, as you say below you don't
> folio lock on remap so concurrent rmap walkers can be present, the VMA can
> already be copied.
> 
> We already have VMA lifecycle state around detached VMAs, so a VMA
> could be in a detached state, assumed by the existing logic to be entirely
> unavailable for use, out of the maple tree altogether but kept around in a
> zombie state.
> 
> We'd then have lifecycle issues and races and edge cases around process
> teardown otherwise we might leak memory.
> 
> Also, presumably you set vma->anon_vma to some lazy sentinel value so
> that mremap doesn't change vma->vm_pgoff when unfaulted?
> 
> You would need to update any path that manipulates vma->anon_vma also
> so it doesn't incorrectly dereference it.
> 
Yes, most of the code in this patch series is intended to prevent
incorrect dereferencing of anon_vma. If we assume it will not be
misused, some of the code could be simplified or removed.

> >
> > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,
> 使用vma_a->root_vma =
> > folio使用vma->vma->root_vma计算folio_address。
> > rmap时得到folio_address就可以通过mm_mt查找到vma。
> > 不fork就不需要维护interval tree。
> >
> > > You may also actually split a VMA against a single large folio
> > > (waiting on the deferred shrinker) and have a SINGLE _leaf_
> > > anonymous folio that is mapped in two places.
> > >
> > > The lazy approach doesn't seem to address this properly. And fatally
> > > it ties an actual VMA afaict to the folio and has to implement a VMA
> > > reference count mechanism which interferes with the ordinarily VMA
> lifecycle to do it.
> > >
> > > The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> > > 'leaves' is something that my approach is exactly taking into account.
> > >
> > > Of course also extending anon_vma is a real non-starter.
> > >
> > > Also the below + the series ignores MAP_PRIVATE file-backed mappings
> > > which is a pretty fatal flaw.
> > >
> > > It also, as Harry says, has zero description of correctness in a way
> > > we'd want and no tests.
> > >
> >
> > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的
> sub_page使用如下方式计算地址。
> > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。
> >
> > It can correctly handle the case where a VMA is split within a large
> > page. The address of a sub_page in the split VMA (vma_a or vma_b) is
> > computed using the following method.
> >
> > For COW anonymous pages originating from file VMAs, the page/folio
> > address is also computed using the same method.
> >
> > subpage_address = vma_address(vma_a, subpage_pgoff, 1)  =
> > vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE  =
> > vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff *
> > PAGE_SIZE  = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
> =
> > vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE
> 
> OK but you want to walk entries in a _range_ in the interval tree.
> 
> So you are then now looking up VMAs (in a racey way) using mm_mt (which
> is the whole basis of my work actually) which could change under you.
> 
> I guess what you're doing is using the pinned 'root' VMA as the basis of
> everything, and the second a VMA is moved you (somehow) walk the page
> tables to update the folio->mapping.
> 
> Again pinning the VMA like this and putting it in a folio is really not something
> we want to do.
> 
> It adds a ton of complexity and also impacts VMA lifecycle which is already
> fairly fraught.
> 
> It makes the VMA no longer just a VMA but rather also a 'memory' of where
> something was first faulted in as a hack more or less.
> 
Maybe you're right. mm/mm_mt/vma/pagetable each have their own roles
in implementing VM. Perhaps considering them together could lead to
better ideas.