> On 5/27/26 8:01 PM, tao wrote:
> > Design overview
> > ---------------
> >
> > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed
> > (for example during fork). VMAs that never participate in sharing can
> > avoid creating anon_vma structures entirely.
> >
> > Before an anon_vma exists, rmap operations rely directly on VMA
> > information, so no anon_vma locking is required. An anon_vma is
> > created and linked only when sharing semantics are required.
>
> It is unfortunate that the design overview doesn't cover correctness aspect
> at all. VMAs are subject to change (even before being shared with other
> processes), and rmap needs something that doesn't go away across VMA
> merging, split, etc.
>
> I'm not sure how the idea is supposed work correctly.
>
> --
> Cheers,
> Harry / Hyeonggon
VMA operations can be roughly divided into three categories. The handling
of ANON_VMA_LAZY is briefly described below.
1. fork
fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is
not involved here.) This can be viewed as copying the VMAs with identical
virtual addresses into a new address space.
If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a
regular anon_vma. The corresponding folio->mapping is then fixed in
try_dup_anon_rmap().
2. mmap / brk / mprotect / munmap
These operations create, modify, or remove VMAs in the current mm. They
may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt.
When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized
and the VMA is inserted into mm_mt. Although these fields may later be
modified, the following value remains invariant:
(vm_start - vm_pgoff * PAGE_SIZE)
We refer to this value as:
vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE
This value also remains unchanged when the VMA is removed from mm_mt.
If a VMA is split and produces new_vma, the following holds:
vma_mapping_base(new_vma) == vma_mapping_base(vma)
If two adjacent VMAs vma_a and vma_b are merged into vma_x, then:
vma_mapping_base(vma_a) == vma_mapping_base(vma_b) ==
vma_mapping_base(vma_x)
Assume the VMA where the first page fault occurs is called root_vma, and
ensure that any VMA produced by split or merge holds a reference to
root_vma.
During rmap we can compute the folio address using root_vma:
vma_address(vma, pgoff, 1) =
vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma_mapping_base(vma) + pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
We can then use folio_addr to locate the VMA covering this folio.
3. mremap / uffd_move
If only the size changes and the start address remains the same, there
is no impact.
If the start address changes, the page is moved from (vma, addr) to
(new_vma, new_addr). In this case:
vma_mapping_base(new_vma) =
vma_mapping_base(vma) + new_addr - old_addr
We first upgrade the VMA, and then fix folio->mapping in move_ptes().
If performance becomes a concern, ANON_VMA_LAZY can be enabled only for
relatively small VMAs.
vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理:
1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。
这可以理解为在一个新的地址空间复制一份相同地址的VMAs.
如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping.
2. mmap/brk/mprotect/munmap
创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。
创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持
(vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。
这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。
从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma)
合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x)
如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。
在rmap时我们可以用root_vma计算folio地址:
vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma_mapping_base(vma) + pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
然后用folio_addr查找folio所在的vma。
3. mremap/uffd_move
如果只是修改大小,起始地址不变,不影响。
如果改变起始地址,将page从vma/addr移动到new_vma/new_addr
这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr
我们先升级vma,在move_ptes中再修正folio->mapping。
如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。
> During rmap we can compute the folio address using root_vma: > > vma_address(vma, pgoff, 1) = > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > It is inconsistent here. The offset should remain pgoff throughout. It should be: vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma_mapping_base(vma) + pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE
On Wed, Jun 03, 2026 at 02:59:04AM +0000, wangtao wrote: > > On 5/27/26 8:01 PM, tao wrote: > > > Design overview > > > --------------- > > > > > > ANON_VMA_LAZY defers anon_vma allocation until it is actually needed > > > (for example during fork). VMAs that never participate in sharing can > > > avoid creating anon_vma structures entirely. > > > > > > Before an anon_vma exists, rmap operations rely directly on VMA > > > information, so no anon_vma locking is required. An anon_vma is > > > created and linked only when sharing semantics are required. > > > > It is unfortunate that the design overview doesn't cover correctness aspect > > at all. VMAs are subject to change (even before being shared with other > > processes), and rmap needs something that doesn't go away across VMA > > merging, split, etc. > > > > I'm not sure how the idea is supposed work correctly. > > > > -- > > Cheers, > > Harry / Hyeonggon > Against my better judgment I'll address the stuff here... > VMA operations can be roughly divided into three categories. The handling > of ANON_VMA_LAZY is briefly described below. I don't agree, there are plenty more VMA operations. But with respect to anon rmap there are: - fork - merge/split - remap Your approach seems to completely ignore VMA split and the need to maintain an interval tree to _multiple_ VMAs from a single anon_vma. You may also actually split a VMA against a single large folio (waiting on the deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped in two places. The lazy approach doesn't seem to address this properly. And fatally it ties an actual VMA afaict to the folio and has to implement a VMA reference count mechanism which interferes with the ordinarily VMA lifecycle to do it. The fact of us taking advantage of most stuff being AnonExclusive, i.e. 'leaves' is something that my approach is exactly taking into account. Of course also extending anon_vma is a real non-starter. Also the below + the series ignores MAP_PRIVATE file-backed mappings which is a pretty fatal flaw. It also, as Harry says, has zero description of correctness in a way we'd want and no tests. > > 1. fork > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap and is > not involved here.) This can be viewed as copying the VMAs with identical > virtual addresses into a new address space. > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > regular anon_vma. The corresponding folio->mapping is then fixed in > try_dup_anon_rmap(). And so we make fork, a very sensitive path in the kernel more expensive. I also question the locking situation with the conversion mentioned, updating folios in this manner is extremely difficult. > > 2. mmap / brk / mprotect / munmap > > These operations create, modify, or remove VMAs in the current mm. They > may split existing VMAs, merge adjacent VMAs, or remove a VMA from mm_mt. mmap and brk are not at all relevant to anon_vma, as no anon_vma is assigned upon mapping. It's on fault. mprotect/mlock/munmap/etc. might split, but I don't see how the lazy approach in any way addresses any of that. > > When a new VMA is created, vm_start, vm_end and vm_pgoff are initialized > and the VMA is inserted into mm_mt. Although these fields may later be > modified, the following value remains invariant: > > (vm_start - vm_pgoff * PAGE_SIZE) Err no it doesn't at all? If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. Then if I remap it, vm_start changes, vm_pgoff stays the same, so: vm_start - vm_pgoff * PAGE_SIZE Changes right? And then that becomes essentially the offset from where it was faulted in. > > We refer to this value as: > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE This is mysteriously close to being the offset I mention in my CoW context work... I'm not sure what 'mapping base' means here. > > This value also remains unchanged when the VMA is removed from mm_mt. Why does it matter what this value is on unmap? > > If a VMA is split and produces new_vma, the following holds: > > vma_mapping_base(new_vma) == vma_mapping_base(vma) This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > vma_mapping_base(vma_x) This is just a roundabout way of saying the pgoff has to be aligned. > > Assume the VMA where the first page fault occurs is called root_vma, and > ensure that any VMA produced by split or merge holds a reference to > root_vma. But this VMA can be unmapped later? Or remapped? Holding on to a VMA and treating it as some kind of canonical reference with a reference count completely changes what VMAs are, impacts the VMA lifecycle, and produces unwanted memory overhead in itself. It also raises concerns and issues around lock order which is very sensitive. > > During rmap we can compute the folio address using root_vma: > > vma_address(vma, pgoff, 1) = What's the parameters here? What's 1? > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > We can then use folio_addr to locate the VMA covering this folio. I'm really confused by this, you're kind of mixing and match parameters here. What I think you're saying is that, if a folio hasn't been remapped, you can figure out its address based on page offset. That's completely broken for MAP_PRIVATE file-backed mappings which also use anon_vma and also have to keep on working. It seems that for the lazy approach what you are doing is essentially caching the 'root' VMA in the folio. But this doesn't account for large folios and split VMAs. Even if you disabled it for those cases (which adds a ton of complexity in itself), you then have issues with locking - the anon_vma lock has to take a lock (that cannot be a VMA-level lock - results in lock inversion) even on these leaf entries, or you break locking. And we can't reasonably start pinning VMAs and using them as a sort of proto cached thing on top of the existing anon_vma logic. You also then need to, on remap, undo all this, which requires updating folio->mapping on remap, something I tried doing previously myself, but that's fraught with issues around lock inversion itself. > > 3. mremap / uffd_move userfaultfd moving is not relevant as it actually updates the folio correctly. > > If only the size changes and the start address remains the same, there > is no impact. > > If the start address changes, the page is moved from (vma, addr) to > (new_vma, new_addr). In this case: > > vma_mapping_base(new_vma) = > vma_mapping_base(vma) + new_addr - old_addr You say above that the mapping base never changes? But here it changes? > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a 'normal' one. As above, this is fraught with lock inversion issues. > > If performance becomes a concern, ANON_VMA_LAZY can be enabled only for > relatively small VMAs. I think you've got serious correctness, lock management and complexity issues and it's all a non-starter as the costs deeply exceed the benefits. This is one of the fundamental, frustrating aspects of the anon rmap - you keep thinking that 'surely' you can do sensible thing X, but it turns out you can't for various annoying reasons. It's one of the reasons it's really fraught for somebody coming to make changes, and one of the reasons why I am very keen on fundamentally changing it. And also on a not-wasting-time basis - I was already working in parallel on a rework here, so I think the civil thing is to at least wait for my work before issuing alternative solutions. Thanks, Lorenzo > > > vma操作可以分为3类,下面简单说明下ANON_VMA_LAZY的处理: > > 1. fork 从父进程复制mm/mmap;(exev 创建一个新的mm/mmap,不涉及)。 > 这可以理解为在一个新的地址空间复制一份相同地址的VMAs. > 如果pvma是ANON_VMA_LAZY,先升级为regular anon_vma,并在try_dup_anon_rmap中升级修正folio->mapping. > > 2. mmap/brk/mprotect/munmap > 创建、修改或删除当前mm的VMA,可能合并或拆分出新的VMAs或者将VMA从mm_mt删除。 > 创建一个新的vma并设置vm_start、vm_end、vm_pgoff插入mm_mt后,虽然后续可能修改这个VMA的vm_start、vm_end、vm_pgoff,但是保持 > (vm_start - vm_pgoff * PAGE_SIZE)不变,我们可以把这个称之为vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE。 > 这个vma从mm_mt删除时,vma_mapping_base(vma)也保持不变。 > 从这个vma拆分出的new_vma,有vma_mapping_base(new_vma) == vma_mapping_base(vma) > 合并相邻vma_a、vma_b为vma_x时,也有vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == vma_mapping_base(vma_x) > 如果我们第一次发生缺页的VMA称为root_vma,并在split或merge时都确保使用的vma持有root_vma的引用。 > 在rmap时我们可以用root_vma计算folio地址: > vma_address(vma, pgoff, 1) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > 然后用folio_addr查找folio所在的vma。 > > 3. mremap/uffd_move > 如果只是修改大小,起始地址不变,不影响。 > 如果改变起始地址,将page从vma/addr移动到new_vma/new_addr > 这时vma_mapping_base(new_vma) = vma_mapping_base(vma) + new_addr - old_addr > 我们先升级vma,在move_ptes中再修正folio->mapping。 > 如果担心性能影响,可以只在较小的vma上使能ANON_VMA_LAZY。 >
> > > > Against my better judgment I'll address the stuff here... > > > VMA operations can be roughly divided into three categories. The > > handling of ANON_VMA_LAZY is briefly described below. > > I don't agree, there are plenty more VMA operations. But with respect to > anon rmap there are: > > - fork > - merge/split > - remap > Yes, these are the three categories. I originally intended to explain them by classifying based on system calls; I should have used mremap instead of move_vma. 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma换成mremap的。 > Your approach seems to completely ignore VMA split and the need to > maintain an interval tree to _multiple_ VMAs from a single anon_vma. > The folio uses vma->root_vma to compute folio_address. A VMA split from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute folio_address. During rmap, once folio_address is obtained, the VMA can be found through mm_mt. Without fork, there is no need to maintain the interval tree. folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,使用vma_a->root_vma = vma->root_vma计算folio_address。 rmap时得到folio_address就可以通过mm_mt查找到vma。 不fork就不需要维护interval tree。 > You may also actually split a VMA against a single large folio (waiting on the > deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped > in two places. > > The lazy approach doesn't seem to address this properly. And fatally it ties an > actual VMA afaict to the folio and has to implement a VMA reference count > mechanism which interferes with the ordinarily VMA lifecycle to do it. > > The fact of us taking advantage of most stuff being AnonExclusive, i.e. > 'leaves' is something that my approach is exactly taking into account. > > Of course also extending anon_vma is a real non-starter. > > Also the below + the series ignores MAP_PRIVATE file-backed mappings > which is a pretty fatal flaw. > > It also, as Harry says, has zero description of correctness in a way we'd want > and no tests. > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的sub_page使用如下方式计算地址。 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。 It can correctly handle the case where a VMA is split within a large page. The address of a sub_page in the split VMA (vma_a or vma_b) is computed using the following method. For COW anonymous pages originating from file VMAs, the page/folio address is also computed using the same method. subpage_address = vma_address(vma_a, subpage_pgoff, 1) = vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE = vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff * PAGE_SIZE = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE > > > > 1. fork > > > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap > and > > is not involved here.) This can be viewed as copying the VMAs with > > identical virtual addresses into a new address space. > > > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > > regular anon_vma. The corresponding folio->mapping is then fixed in > > try_dup_anon_rmap(). > > And so we make fork, a very sensitive path in the kernel more expensive. > > I also question the locking situation with the conversion mentioned, updating > folios in this manner is extremely difficult. > Because rmap takes the PTE lock, while fork takes the mmap write lock, the VMA write lock, and the PTE lock. Given the rule that folio->mapping can only transition in one direction from lazy_vma to a regular anon_vma, the situation can be handled correctly even without taking the folio_lock. When rmap and fork run concurrently: If rmap observes folio->mapping as a regular anon_vma, there is obviously no issue. If rmap observes folio->mapping as lazy_vma, then rmap only processes the parent's pvma. At the end of rmap_walk_anon(), if we see that folio->mapping has changed to a regular anon_vma, we simply process it once more. The various rmap_one implementations are idempotent anyway. BTW: the commit message of patch 13 says a retry is needed, but the retry handling was accidentally omitted in the posted patch. 因为rmap获取pte锁;fork时获取mmap写锁、vma写锁、pte锁。 只允许folio->mapping从lazy_vma单向变成regular anon_vma的原则,不获取folio_lock也可以正确处理。 当rmap和fork并发处理时: 假如rmap看到的folio->mapping是regular anon_vma,显然没有问题。 假如rmap看到的folio->mapping是lazy_vma,则rmap只处理了父进程的pvma; 我们在rmap_walk_anon结束时如果看到folio->mapping变成了regular anon_vma,则再来一次处理即可,毕竟各种rmap_one实现是幂等的。 btw:patch 13的commit msg说要retry,但是发送的patch由于操作失误漏掉了重试处理。 > > > > 2. mmap / brk / mprotect / munmap > > > > These operations create, modify, or remove VMAs in the current mm. > > They may split existing VMAs, merge adjacent VMAs, or remove a VMA > from mm_mt. > > mmap and brk are not at all relevant to anon_vma, as no anon_vma is > assigned upon mapping. It's on fault. > mmap/brk 指定地址时可能导致匿名 VMA merge 或 split。 mmap()/brk() with a specified address may cause anonymous VMA merge or split. > mprotect/mlock/munmap/etc. might split, but I don't see how the lazy > approach in any way addresses any of that. > 上边说了,split后rmap仍使用root_vma计算folio_address或page_address。 As mentioned above, after the split, rmap still uses root_vma to compute folio_address or page_address. > > > > When a new VMA is created, vm_start, vm_end and vm_pgoff are > > initialized and the VMA is inserted into mm_mt. Although these fields > > may later be modified, the following value remains invariant: > > > > (vm_start - vm_pgoff * PAGE_SIZE) > > Err no it doesn't at all? > > If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. > > Then if I remap it, vm_start changes, vm_pgoff stays the same, so: > > vm_start - vm_pgoff * PAGE_SIZE > > Changes right? And then that becomes essentially the offset from where it > was faulted in. > If mremap modifies vm_start, i.e., move_vma, a new VMA will be created. This corresponds exactly to the third point mentioned later: upgrading anon_vma_lazy to a regular anon_vma and updating folio->mapping. mremap时如果修改vm_start,即move_vma则创建新的vma,这正是我后边第三点说的: 将anon_vma_lazy升级成regular anon_vma并修改folio->mapping。 > > > > We refer to this value as: > > > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE > > This is mysteriously close to being the offset I mention in my CoW context > work... > > I'm not sure what 'mapping base' means here. > vma_addrss(vma, pgoff, nr_pages) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE = vma_mapping_base(vma) + pgoff * PAGE_SIZE vma_mapping_base depends only on the VMA and is independent of the page. Alternatively, we could also call it vma_rmap_base. vma_mapping_base只和vma相关,和page无关,或者我们也可以叫他vma_rmap_base? > > > > This value also remains unchanged when the VMA is removed from > mm_mt. > > Why does it matter what this value is on unmap? > If root_vma is removed from mm_mt due to munmap, it will still remain valid as long as other VMAs hold references to it. root_vma如果被munmap从mm_mt中删除。其他vma持有引用,就仍有效。 > > > > If a VMA is split and produces new_vma, the following holds: > > > > vma_mapping_base(new_vma) == vma_mapping_base(vma) > > This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > > > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > > vma_mapping_base(vma_x) > > This is just a roundabout way of saying the pgoff has to be aligned. > > > > > Assume the VMA where the first page fault occurs is called root_vma, > > and ensure that any VMA produced by split or merge holds a reference > > to root_vma. > > But this VMA can be unmapped later? Or remapped? > It can be unmapped. As mentioned earlier, if mremap modifies vm_start, a new VMA will be created. 可以被munmap。前边说了mremap如果修改vm_start则创建新的vma。 > Holding on to a VMA and treating it as some kind of canonical reference with > a reference count completely changes what VMAs are, impacts the VMA > lifecycle, and produces unwanted memory overhead in itself. > During split/merge operations, we can try to preferentially use root_vma so as to avoid deleting it. 在split/merge时,我们可以尽量优先使用root_vma,避免删除root_vma。 > It also raises concerns and issues around lock order which is very sensitive. > Both rmap and fork acquire the PTE lock, which ensures that handling a page with respect to a particular VMA is atomic. There is no need to add folio_lock. When fork converts folio->mapping into a regular anon_vma, rmap_walk_anon can simply check and retry. rmap和fork时都要获取pte锁,可以确保rmap/fork在处理page的某个vma是原子的。 不需要增加folio_lock,当fork将folio->mapping变成regular anon_vma后,rmap_walk_anon检查retry即可。 > > > > During rmap we can compute the folio address using root_vma: > > > > vma_address(vma, pgoff, 1) = > > What's the parameters here? What's 1? > > > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > > > We can then use folio_addr to locate the VMA covering this folio. > I overlooked this earlier. We can unify it by using pgoff as follows. page_addr = vma_address(vma, pgoff, nr_pages) = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE = vma_mapping_base(vma) + pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE > I'm really confused by this, you're kind of mixing and match parameters here. > > What I think you're saying is that, if a folio hasn't been remapped, you can > figure out its address based on page offset. > > That's completely broken for MAP_PRIVATE file-backed mappings which also > use anon_vma and also have to keep on working. > > It seems that for the lazy approach what you are doing is essentially caching > the 'root' VMA in the folio. But this doesn't account for large folios and split > VMAs. > As mentioned earlier: subpage_address = vma_address(vma_a, subpage_pgoff, 1) = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE > Even if you disabled it for those cases (which adds a ton of complexity in > itself), you then have issues with locking - the anon_vma lock has to take a > lock (that cannot be a VMA-level lock - results in lock inversion) even on > these leaf entries, or you break locking. > When there is no fork/mremap, we do not need the interval tree or the anon_vma lock. 不fork/mremap时我们不需要interval tree,不需要anon_vma锁。 > And we can't reasonably start pinning VMAs and using them as a sort of > proto cached thing on top of the existing anon_vma logic. > In most cases, root_vma is actively used. Although it may be removed by munmap, overall it still saves memory. 大部分情况下root_vma都是在被使用的,当然可能被munmap删除,但是整体上节省内存的。 > You also then need to, on remap, undo all this, which requires updating > folio->mapping on remap, something I tried doing previously myself, but > that's fraught with issues around lock inversion itself. > > > > > 3. mremap / uffd_move > > userfaultfd moving is not relevant as it actually updates the folio correctly. > These two operations are different from the previous two types, as they modify the virtual address of the page/folio. 这两个操作和前两类不同,修改page/folio的虚拟地址。 > > > > If only the size changes and the start address remains the same, there > > is no impact. > > > > If the start address changes, the page is moved from (vma, addr) to > > (new_vma, new_addr). In this case: > > > > vma_mapping_base(new_vma) = > > vma_mapping_base(vma) + new_addr - old_addr > > You say above that the mapping base never changes? But here it changes? > For the newly created new_vma, vma_mapping_base(new_vma) is not equal to vma_mapping_base(vma), while vma_mapping_base(vma) itself remains unchanged. 新创建的new_vma的vma_mapping_base(new_vma) 不等于vma_mapping_base(vma),但是vma_mapping_base(vma)不变。 > > > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). > > What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a > 'normal' one. > > As above, this is fraught with lock inversion issues. > Yes, it upgrades from a lazy_vma to a regular anon_vma. As mentioned earlier, during this process we hold the mmap write lock, the vma write lock, and the pte lock, so acquiring the folio_lock is unnecessary. 是的,从lazy_vma升级成regular anon_vma。 如前边所说,这个过程中我们有mmap写锁、vma写锁和pte锁,可以不获取folio_lock。 > > > > If performance becomes a concern, ANON_VMA_LAZY can be enabled > only > > for relatively small VMAs. > > I think you've got serious correctness, lock management and complexity > issues and it's all a non-starter as the costs deeply exceed the benefits. > I think the approach is feasible: 1. During merge/split, the newly created vma_a satisfies vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma). Therefore, we can use root_vma to compute the virtual address of the folio/page mapped by vma_a. 2. During fork and mremap, we hold the mmap write lock, the vma write lock, and the pte lock. In particular, the pte lock ensures that rmap and fork operations on a folio/page within a specific vma are atomic. If folio->mapping is upgraded during rmap_walk_anon(folio), we can simply let rmap_walk_anon retry once. 我认为方案可行: 1. merge/split时新创建的vma_a有vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma) 所以我们可利用root_vma计算vma_a映射的folio/page的虚拟地址。 2. fork和mremap时我们持有mmap写锁、vma写锁和pte锁。 特别的pte锁能确保rmap和fork在folio/page在某个vma上的操作是原子的。 如果rmap_walk_anon(folio)过程中folio->mapping有升级变化,我们让rmap_walk_anon retry一次即可。 > This is one of the fundamental, frustrating aspects of the anon rmap - you > keep thinking that 'surely' you can do sensible thing X, but it turns out you > can't for various annoying reasons. > > It's one of the reasons it's really fraught for somebody coming to make > changes, and one of the reasons why I am very keen on fundamentally > changing it. > > And also on a not-wasting-time basis - I was already working in parallel on a > rework here, so I think the civil thing is to at least wait for my work before > issuing alternative solutions. > > Thanks, Lorenzo >
Thanks for your replies, but I really have to stop doing deeper analyses like these for time management purposes. I did this more so to make the point from [0] as to why, in lower trust environments, this is just not feasible. We could loop around for hours and hours and hours here. In general as before, even if all worked perfectly (I'm very much not at all convinced), extending anon_vma and pinning VMAs is simply a no-go for architectural and complexity reasons. I also find the locking story dubious and the lack of tests or anything corroborating correctness is additionally fatal. And finally, I was already working on a replacement for anon_vma, and the generally done thing in these situations is for my work to take precedence. So I'm going to bail out on futher deeper analyses here as otherwise I simply can't work on anything else :) Thanks, Lorenzo [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/ On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote: > > > > > > > Against my better judgment I'll address the stuff here... > > > > > VMA operations can be roughly divided into three categories. The > > > handling of ANON_VMA_LAZY is briefly described below. > > > > I don't agree, there are plenty more VMA operations. But with respect to > > anon rmap there are: > > > > - fork > > - merge/split > > - remap > > > > Yes, these are the three categories. I originally intended to explain them > by classifying based on system calls; I should have used mremap instead of move_vma. I don't think you mentioned move_vma()? Maybe I missed it. The categorisation is most usefully based on callers of anon_vma_clone(). > > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma换成mremap的。 > > > Your approach seems to completely ignore VMA split and the need to > > maintain an interval tree to _multiple_ VMAs from a single anon_vma. > > > > The folio uses vma->root_vma to compute folio_address. A VMA split from it, > vma_a, also uses vma_a->root_vma = vma->root_vma to compute folio_address. > During rmap, once folio_address is obtained, the VMA can be found through > mm_mt. Without fork, there is no need to maintain the interval tree. Well you need to search for every possible split VMA in mm_mt now, so you have to go page-by-page searching for each page for the rmap walked range. You're also potentially racing against a remap, as you say below you don't folio lock on remap so concurrent rmap walkers can be present, the VMA can already be copied. We already have VMA lifecycle state around detached VMAs, so a VMA could be in a detached state, assumed by the existing logic to be entirely unavailable for use, out of the maple tree altogether but kept around in a zombie state. We'd then have lifecycle issues and races and edge cases around process teardown otherwise we might leak memory. Also, presumably you set vma->anon_vma to some lazy sentinel value so that mremap doesn't change vma->vm_pgoff when unfaulted? You would need to update any path that manipulates vma->anon_vma also so it doesn't incorrectly dereference it. > > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,使用vma_a->root_vma = vma->root_vma计算folio_address。 > rmap时得到folio_address就可以通过mm_mt查找到vma。 > 不fork就不需要维护interval tree。 > > > You may also actually split a VMA against a single large folio (waiting on the > > deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped > > in two places. > > > > The lazy approach doesn't seem to address this properly. And fatally it ties an > > actual VMA afaict to the folio and has to implement a VMA reference count > > mechanism which interferes with the ordinarily VMA lifecycle to do it. > > > > The fact of us taking advantage of most stuff being AnonExclusive, i.e. > > 'leaves' is something that my approach is exactly taking into account. > > > > Of course also extending anon_vma is a real non-starter. > > > > Also the below + the series ignores MAP_PRIVATE file-backed mappings > > which is a pretty fatal flaw. > > > > It also, as Harry says, has zero description of correctness in a way we'd want > > and no tests. > > > > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的sub_page使用如下方式计算地址。 > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。 > > It can correctly handle the case where a VMA is split within a large > page. The address of a sub_page in the split VMA (vma_a or vma_b) is > computed using the following method. > > For COW anonymous pages originating from file VMAs, the page/folio > address is also computed using the same method. > > subpage_address = vma_address(vma_a, subpage_pgoff, 1) > = vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE > = vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE OK but you want to walk entries in a _range_ in the interval tree. So you are then now looking up VMAs (in a racey way) using mm_mt (which is the whole basis of my work actually) which could change under you. I guess what you're doing is using the pinned 'root' VMA as the basis of everything, and the second a VMA is moved you (somehow) walk the page tables to update the folio->mapping. Again pinning the VMA like this and putting it in a folio is really not something we want to do. It adds a ton of complexity and also impacts VMA lifecycle which is already fairly fraught. It makes the VMA no longer just a VMA but rather also a 'memory' of where something was first faulted in as a hack more or less. > > > > > > > 1. fork > > > > > > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap > > and > > > is not involved here.) This can be viewed as copying the VMAs with > > > identical virtual addresses into a new address space. > > > > > > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a > > > regular anon_vma. The corresponding folio->mapping is then fixed in > > > try_dup_anon_rmap(). > > > > And so we make fork, a very sensitive path in the kernel more expensive. > > > > I also question the locking situation with the conversion mentioned, updating > > folios in this manner is extremely difficult. > > > > Because rmap takes the PTE lock, while fork takes the mmap write lock, > the VMA write lock, and the PTE lock. The PTE lock is not held for the duration of an anon_vma lock. You will break anything that needs to hold the anon_vma lock for the duration, e.g. migration. This is substantively the issue I am working on in my approach and as per https://ljs.io/scalable-cow-lsf.pdf you can see that's an open question that I am currently researching. > > Given the rule that folio->mapping can only transition in one direction > from lazy_vma to a regular anon_vma, the situation can be handled > correctly even without taking the folio_lock. Folio lock serialises against concurrent rmap walks, and you can end up reading a lazy_vma that later gets converted into an anon_vma concurrently. > > When rmap and fork run concurrently: > If rmap observes folio->mapping as a regular anon_vma, there is > obviously no issue. > If rmap observes folio->mapping as lazy_vma, then rmap only processes > the parent's pvma. At the end of rmap_walk_anon(), if we see that folio->mapping has > changed to a regular anon_vma, we simply process it once more. The > various rmap_one implementations are idempotent anyway. Hm this all seems very racey. > > BTW: the commit message of patch 13 says a retry is needed, but the > retry handling was accidentally omitted in the posted patch. :)) > > 因为rmap获取pte锁;fork时获取mmap写锁、vma写锁、pte锁。 > 只允许folio->mapping从lazy_vma单向变成regular anon_vma的原则,不获取folio_lock也可以正确处理。 > 当rmap和fork并发处理时: > 假如rmap看到的folio->mapping是regular anon_vma,显然没有问题。 > 假如rmap看到的folio->mapping是lazy_vma,则rmap只处理了父进程的pvma; > 我们在rmap_walk_anon结束时如果看到folio->mapping变成了regular anon_vma,则再来一次处理即可,毕竟各种rmap_one实现是幂等的。 > btw:patch 13的commit msg说要retry,但是发送的patch由于操作失误漏掉了重试处理。 > > > > > > > 2. mmap / brk / mprotect / munmap > > > > > > These operations create, modify, or remove VMAs in the current mm. > > > They may split existing VMAs, merge adjacent VMAs, or remove a VMA > > from mm_mt. > > > > mmap and brk are not at all relevant to anon_vma, as no anon_vma is > > assigned upon mapping. It's on fault. > > > mmap/brk 指定地址时可能导致匿名 VMA merge 或 split。 > > mmap()/brk() with a specified address may cause anonymous VMA merge or split. > > > mprotect/mlock/munmap/etc. might split, but I don't see how the lazy > > approach in any way addresses any of that. > > > 上边说了,split后rmap仍使用root_vma计算folio_address或page_address。 > > As mentioned above, after the split, rmap still uses root_vma to compute > folio_address or page_address. > > > > > > > When a new VMA is created, vm_start, vm_end and vm_pgoff are > > > initialized and the VMA is inserted into mm_mt. Although these fields > > > may later be modified, the following value remains invariant: > > > > > > (vm_start - vm_pgoff * PAGE_SIZE) > > > > Err no it doesn't at all? > > > > If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT. > > > > Then if I remap it, vm_start changes, vm_pgoff stays the same, so: > > > > vm_start - vm_pgoff * PAGE_SIZE > > > > Changes right? And then that becomes essentially the offset from where it > > was faulted in. > > > If mremap modifies vm_start, i.e., move_vma, a new VMA will be created. > This corresponds exactly to the third point mentioned later: upgrading > anon_vma_lazy to a regular anon_vma and updating folio->mapping. I think updating folio->mapping here is problematic, I know this because I worked on this very probably a year or so ago and found locking issues prevented this from being workable. > > mremap时如果修改vm_start,即move_vma则创建新的vma,这正是我后边第三点说的: > 将anon_vma_lazy升级成regular anon_vma并修改folio->mapping。 > > > > > > > We refer to this value as: > > > > > > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE > > > > This is mysteriously close to being the offset I mention in my CoW context > > work... > > > > I'm not sure what 'mapping base' means here. > > > > vma_addrss(vma, pgoff, nr_pages) > = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) > = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > vma_mapping_base depends only on the VMA and is independent of the page. > Alternatively, we could also call it vma_rmap_base. > > vma_mapping_base只和vma相关,和page无关,或者我们也可以叫他vma_rmap_base? > > > > > > > This value also remains unchanged when the VMA is removed from > > mm_mt. > > > > Why does it matter what this value is on unmap? > > > If root_vma is removed from mm_mt due to munmap, it will still remain > valid as long as other VMAs hold references to it. Yeah this is something we don't want. > > root_vma如果被munmap从mm_mt中删除。其他vma持有引用,就仍有效。 > > > > > > > If a VMA is split and produces new_vma, the following holds: > > > > > > vma_mapping_base(new_vma) == vma_mapping_base(vma) > > > > This is a roundabout way of saying we offset the vma->vm_pgoff after split. > > > > > > > > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then: > > > > > > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) == > > > vma_mapping_base(vma_x) > > > > This is just a roundabout way of saying the pgoff has to be aligned. > > > > > > > > Assume the VMA where the first page fault occurs is called root_vma, > > > and ensure that any VMA produced by split or merge holds a reference > > > to root_vma. > > > > But this VMA can be unmapped later? Or remapped? > > > It can be unmapped. As mentioned earlier, if mremap modifies vm_start, > a new VMA will be created. But everything's racey? > > 可以被munmap。前边说了mremap如果修改vm_start则创建新的vma。 > > > > Holding on to a VMA and treating it as some kind of canonical reference with > > a reference count completely changes what VMAs are, impacts the VMA > > lifecycle, and produces unwanted memory overhead in itself. > > > During split/merge operations, we can try to preferentially use root_vma > so as to avoid deleting it. Adding yet more complexity and edge cases, we really cannot do that, sorry. > > 在split/merge时,我们可以尽量优先使用root_vma,避免删除root_vma。 > > > It also raises concerns and issues around lock order which is very sensitive. > > > Both rmap and fork acquire the PTE lock, which ensures that handling a page > with respect to a particular VMA is atomic. The PTE lock only locks the PTE. That wasn't the issue I was raising at all. See the top of rmap.c for lock ordering. There's substantial complexity there. > > There is no need to add folio_lock. > When fork converts folio->mapping into a regular anon_vma, > rmap_walk_anon can simply check and retry. This seems like it won't work. And again you're adding a lot of new complexity. > > rmap和fork时都要获取pte锁,可以确保rmap/fork在处理page的某个vma是原子的。 > 不需要增加folio_lock,当fork将folio->mapping变成regular anon_vma后,rmap_walk_anon检查retry即可。 > > > > > > > > During rmap we can compute the folio address using root_vma: > > > > > > vma_address(vma, pgoff, 1) = > > > > What's the parameters here? What's 1? > > > > > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > > > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > > > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE > > > > > > We can then use folio_addr to locate the VMA covering this folio. > > > > I overlooked this earlier. We can unify it by using pgoff as follows. > > page_addr = vma_address(vma, pgoff, nr_pages) > = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) > = vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE) > = vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE > = vma_mapping_base(vma) + pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + pgoff * PAGE_SIZE It's really just saying > > > > I'm really confused by this, you're kind of mixing and match parameters here. > > > > What I think you're saying is that, if a folio hasn't been remapped, you can > > figure out its address based on page offset. > > > > That's completely broken for MAP_PRIVATE file-backed mappings which also > > use anon_vma and also have to keep on working. > > > > It seems that for the lazy approach what you are doing is essentially caching > > the 'root' VMA in the folio. But this doesn't account for large folios and split > > VMAs. > > > As mentioned earlier: > subpage_address = vma_address(vma_a, subpage_pgoff, 1) > = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE > = vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE I'm not sure what these > > > Even if you disabled it for those cases (which adds a ton of complexity in > > itself), you then have issues with locking - the anon_vma lock has to take a > > lock (that cannot be a VMA-level lock - results in lock inversion) even on > > these leaf entries, or you break locking. > > > When there is no fork/mremap, we do not need the interval tree or the anon_vma lock. We need to stabilise across the VMAs. > > 不fork/mremap时我们不需要interval tree,不需要anon_vma锁。 > > > And we can't reasonably start pinning VMAs and using them as a sort of > > proto cached thing on top of the existing anon_vma logic. > > > > In most cases, root_vma is actively used. > Although it may be removed by munmap, overall it still saves memory. For what workloads? Where? How? It's adding complexity we can't have. > > 大部分情况下root_vma都是在被使用的,当然可能被munmap删除,但是整体上节省内存的。 > > > You also then need to, on remap, undo all this, which requires updating > > folio->mapping on remap, something I tried doing previously myself, but > > that's fraught with issues around lock inversion itself. > > > > > > > > 3. mremap / uffd_move > > > > userfaultfd moving is not relevant as it actually updates the folio correctly. > > > These two operations are different from the previous two types, > as they modify the virtual address of the page/folio. > > 这两个操作和前两类不同,修改page/folio的虚拟地址。 > > > > > > > If only the size changes and the start address remains the same, there > > > is no impact. > > > > > > If the start address changes, the page is moved from (vma, addr) to > > > (new_vma, new_addr). In this case: > > > > > > vma_mapping_base(new_vma) = > > > vma_mapping_base(vma) + new_addr - old_addr > > > > You say above that the mapping base never changes? But here it changes? > > > > For the newly created new_vma, vma_mapping_base(new_vma) is not equal to vma_mapping_base(vma), > while vma_mapping_base(vma) itself remains unchanged. > > 新创建的new_vma的vma_mapping_base(new_vma) 不等于vma_mapping_base(vma),但是vma_mapping_base(vma)不变。 > > > > > > > We first upgrade the VMA, and then fix folio->mapping in move_ptes(). > > > > What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a > > 'normal' one. > > > > As above, this is fraught with lock inversion issues. > > > Yes, it upgrades from a lazy_vma to a regular anon_vma. > As mentioned earlier, during this process we hold the mmap write lock, the vma write lock, > and the pte lock, so acquiring the folio_lock is unnecessary. What's preventing a concurrent rmap walk? > > 是的,从lazy_vma升级成regular anon_vma。 > 如前边所说,这个过程中我们有mmap写锁、vma写锁和pte锁,可以不获取folio_lock。 > > > > > > > If performance becomes a concern, ANON_VMA_LAZY can be enabled > > only > > > for relatively small VMAs. > > > > I think you've got serious correctness, lock management and complexity > > issues and it's all a non-starter as the costs deeply exceed the benefits. > > > > I think the approach is feasible: For complexity and architectural reasons it's not. > > 1. During merge/split, the newly created vma_a satisfies > vma_mapping_base(vma_a) == vma_mapping_base(vma) == > vma_mapping_base(root_vma). Therefore, we can use root_vma to > compute the virtual address of the folio/page mapped by vma_a. I don't love these formulas. You're storing the originally faulted-in address in a VMA that you've pinned for the purpose of that. If you happen to merge right a lot of times you keep around dead VMAs just for this purpose. We're not having VMAs have a dual-role as a 'store of the address first faulted in' as well as being a virtual memory range. > > 2. During fork and mremap, we hold the mmap write lock, the vma > write lock, and the pte lock. In particular, the pte lock ensures > that rmap and fork operations on a folio/page within a specific > vma are atomic. If folio->mapping is upgraded during How does a lock exclude something that doesn't also hold that lock? This is also adding _yet more_ complexity and subtlety. It's really a hack. > rmap_walk_anon(folio), we can simply let rmap_walk_anon retry > once. Again, 'just repeating' if something changes like this without proper serialisation is not sufficient. > > > 我认为方案可行: > 1. merge/split时新创建的vma_a有vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma) > 所以我们可利用root_vma计算vma_a映射的folio/page的虚拟地址。 > 2. fork和mremap时我们持有mmap写锁、vma写锁和pte锁。 > 特别的pte锁能确保rmap和fork在folio/page在某个vma上的操作是原子的。 > 如果rmap_walk_anon(folio)过程中folio->mapping有升级变化,我们让rmap_walk_anon retry一次即可。 > > > This is one of the fundamental, frustrating aspects of the anon rmap - you > > keep thinking that 'surely' you can do sensible thing X, but it turns out you > > can't for various annoying reasons. > > > > It's one of the reasons it's really fraught for somebody coming to make > > changes, and one of the reasons why I am very keen on fundamentally > > changing it. > > > > And also on a not-wasting-time basis - I was already working in parallel on a > > rework here, so I think the civil thing is to at least wait for my work before > > issuing alternative solutions. > > > > Thanks, Lorenzo > >
>
> Thanks for your replies, but I really have to stop doing deeper analyses like
> these for time management purposes.
Of course I will respond to technical discussions.
>
> I did this more so to make the point from [0] as to why, in lower trust
> environments, this is just not feasible.
>
> We could loop around for hours and hours and hours here.
>
> In general as before, even if all worked perfectly (I'm very much not at all
> convinced), extending anon_vma and pinning VMAs is simply a no-go for
> architectural and complexity reasons.
>
> I also find the locking story dubious and the lack of tests or anything
> corroborating correctness is additionally fatal.
>
During rmap, anon_vma provides a superset of VMAs. We first confirm
with vma_address(), and then in each rmap_one we further check whether
the VMA needs to be processed through page_vma_mapped_walk() and
check_pte().
The lazy VMA used by ANON_VMA_LAZY provides only one VMA: if there is
no fork or mremap, then this single VMA is sufficient. To avoid taking
the folio_lock during fork and mremap, after anon_walk_anon, if
folio->mapping is upgraded to anon_vma, we retry once.
If your concern is about the lack of locking during rmap, you could
also refer to folio_wait_table and add a set of anon_vma_locks. That
was how I handled it during my initial debugging. Later, after
reviewing the code flow, I found that the lock might not be necessary,
so I removed it.
> And finally, I was already working on a replacement for anon_vma, and the
> generally done thing in these situations is for my work to take precedence.
>
> So I'm going to bail out on futher deeper analyses here as otherwise I simply
> can't work on anything else :)
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/all/ah887A5VkXOcmq-g@lucifer/
>
>
> On Wed, Jun 03, 2026 at 11:05:28AM +0000, wangtao wrote:
> > > >
> > >
> > > Against my better judgment I'll address the stuff here...
> > >
> > > > VMA operations can be roughly divided into three categories. The
> > > > handling of ANON_VMA_LAZY is briefly described below.
> > >
> > > I don't agree, there are plenty more VMA operations. But with
> > > respect to anon rmap there are:
> > >
> > > - fork
> > > - merge/split
> > > - remap
> > >
> >
> > Yes, these are the three categories. I originally intended to explain
> > them by classifying based on system calls; I should have used mremap
> instead of move_vma.
>
> I don't think you mentioned move_vma()? Maybe I missed it.
>
> The categorisation is most usefully based on callers of anon_vma_clone().
>
> >
> > 是的,是这三类,我本想从系统调用去分类说明,应该将move_vma
> 换成mremap的。
> >
> > > Your approach seems to completely ignore VMA split and the need to
> > > maintain an interval tree to _multiple_ VMAs from a single anon_vma.
> > >
> >
> > The folio uses vma->root_vma to compute folio_address. A VMA split
> > from it, vma_a, also uses vma_a->root_vma = vma->root_vma to compute
> folio_address.
> > During rmap, once folio_address is obtained, the VMA can be found
> > through mm_mt. Without fork, there is no need to maintain the interval
> tree.
>
> Well you need to search for every possible split VMA in mm_mt now, so you
> have to go page-by-page searching for each page for the rmap walked range.
>
ANON_VMA_LAZY has only one VMA. When I first looked at
rmap_walk_ksm, I also thought it would need to search page by page,
which seemed unacceptable. Later I realized that it only needs to
check whether this VMA falls within the rmap walk range.
@@ -3173,20 +3171,20 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
- anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
+ anon_rmap_foreach_vma(vma, vmac, anon_rmap,
0, ULONG_MAX) {
> You're also potentially racing against a remap, as you say below you don't
> folio lock on remap so concurrent rmap walkers can be present, the VMA can
> already be copied.
>
> We already have VMA lifecycle state around detached VMAs, so a VMA
> could be in a detached state, assumed by the existing logic to be entirely
> unavailable for use, out of the maple tree altogether but kept around in a
> zombie state.
>
> We'd then have lifecycle issues and races and edge cases around process
> teardown otherwise we might leak memory.
>
> Also, presumably you set vma->anon_vma to some lazy sentinel value so
> that mremap doesn't change vma->vm_pgoff when unfaulted?
>
> You would need to update any path that manipulates vma->anon_vma also
> so it doesn't incorrectly dereference it.
>
Yes, most of the code in this patch series is intended to prevent
incorrect dereferencing of anon_vma. If we assume it will not be
misused, some of the code could be simplified or removed.
> >
> > folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,
> 使用vma_a->root_vma =
> > folio使用vma->vma->root_vma计算folio_address。
> > rmap时得到folio_address就可以通过mm_mt查找到vma。
> > 不fork就不需要维护interval tree。
> >
> > > You may also actually split a VMA against a single large folio
> > > (waiting on the deferred shrinker) and have a SINGLE _leaf_
> > > anonymous folio that is mapped in two places.
> > >
> > > The lazy approach doesn't seem to address this properly. And fatally
> > > it ties an actual VMA afaict to the folio and has to implement a VMA
> > > reference count mechanism which interferes with the ordinarily VMA
> lifecycle to do it.
> > >
> > > The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> > > 'leaves' is something that my approach is exactly taking into account.
> > >
> > > Of course also extending anon_vma is a real non-starter.
> > >
> > > Also the below + the series ignores MAP_PRIVATE file-backed mappings
> > > which is a pretty fatal flaw.
> > >
> > > It also, as Harry says, has zero description of correctness in a way
> > > we'd want and no tests.
> > >
> >
> > 可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的
> sub_page使用如下方式计算地址。
> > 对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。
> >
> > It can correctly handle the case where a VMA is split within a large
> > page. The address of a sub_page in the split VMA (vma_a or vma_b) is
> > computed using the following method.
> >
> > For COW anonymous pages originating from file VMAs, the page/folio
> > address is also computed using the same method.
> >
> > subpage_address = vma_address(vma_a, subpage_pgoff, 1) =
> > vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE =
> > vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff *
> > PAGE_SIZE = vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
> =
> > vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE
>
> OK but you want to walk entries in a _range_ in the interval tree.
>
> So you are then now looking up VMAs (in a racey way) using mm_mt (which
> is the whole basis of my work actually) which could change under you.
>
> I guess what you're doing is using the pinned 'root' VMA as the basis of
> everything, and the second a VMA is moved you (somehow) walk the page
> tables to update the folio->mapping.
>
> Again pinning the VMA like this and putting it in a folio is really not something
> we want to do.
>
> It adds a ton of complexity and also impacts VMA lifecycle which is already
> fairly fraught.
>
> It makes the VMA no longer just a VMA but rather also a 'memory' of where
> something was first faulted in as a hack more or less.
>
Maybe you're right. mm/mm_mt/vma/pagetable each have their own roles
in implementing VM. Perhaps considering them together could lead to
better ideas.
© 2016 - 2026 Red Hat, Inc.