mm/memory.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
When an mTHP folio is allocated in do_anonymous_page() and the target
pte range is not fully empty, current code would release the folio
and return.
This results an illusion that a page fault has already been processed
even if the fact is vmf->address itself is still pte_none(). Another
page fault will be triggered again.
The race scenario as below, use 64KB mTHP for example, two threads
of the same process, base page 4KB, range = [X, X + 64KB),
X < Y < X + 64KB
CPU 0 (writer, faults at X) CPU 1 (reader, faults at Y)
-------------------------------- -----------------------------
do_anonymous_page() do_anonymous_page()
alloc_anon_folio()
pte_range_none(R) --> true
vma_alloc_folio() --> 64KB
pte_offset_map_lock(Y)
install zero_pfn PTE at Y
pte_unmap_unlock()
pte_offset_map_lock(X)
pte_range_none(R) -> false, Y is populated
/* but pte at X is still none */
goto release
return 0
In order to avoid this, check if vmf->address has been mapped, if not
mapped, try alloc_anon_folio and subsequent operations again. On
retry, alloc_anon_folio() re-checks pte_range_none() and falls back
to a smaller order, so no infinite loop situation.
Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
---
Reproducer (not included in the patch, available on request):
two threads hammer the same 64K mTHP range, writer at offset 0,
reader at offset 32K, per-round barrier, 1024 rounds.
Minor faults before: writer=1951 reader=973 (927 extra faults)
Minor faults after: writer=1024 reader=1022
I'm not sure if this situation often occurs in real workloads.
---
mm/memory.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 0c9d9c2cbf0e..104f5be1de36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5339,10 +5339,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
+ unsigned long fault_offset;
struct folio *folio;
vm_fault_t ret = 0;
int nr_pages;
pte_t entry;
+ bool should_retry = false;
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
@@ -5389,6 +5391,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
ret = vmf_anon_prepare(vmf);
if (ret)
return ret;
+retry:
/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
folio = alloc_anon_folio(vmf);
if (IS_ERR(folio))
@@ -5413,14 +5416,26 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
update_mmu_tlb(vma, addr, vmf->pte);
goto release;
} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
- update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
- goto release;
+ fault_offset = (vmf->address - addr) >> PAGE_SHIFT;
+ if (!pte_none(ptep_get(vmf->pte + fault_offset))) {
+ update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
+ goto release;
+ }
+
+ should_retry = true;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto release;
+ if (should_retry) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ folio_put(folio);
+ should_retry = false;
+ goto retry;
+ }
+
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.43.0
On 5/12/26 11:50, Wandun Chen wrote: > When an mTHP folio is allocated in do_anonymous_page() and the target > pte range is not fully empty, current code would release the folio > and return. > > This results an illusion that a page fault has already been processed > even if the fact is vmf->address itself is still pte_none(). Another > page fault will be triggered again. Yes. Why is that a problem? -- Cheers, David
On 5/12/26 18:35, David Hildenbrand (Arm) wrote: > On 5/12/26 11:50, Wandun Chen wrote: >> When an mTHP folio is allocated in do_anonymous_page() and the target >> pte range is not fully empty, current code would release the folio >> and return. >> >> This results an illusion that a page fault has already been processed >> even if the fact is vmf->address itself is still pte_none(). Another >> page fault will be triggered again. > Yes. Why is that a problem? Honestly, the only data I have is the reproducer; I haven't been able to show a measurable impact on a real workload. The motivation was "we did the work of allocatingan mTHP folio and then throw it away just to redo it from #PF, and this #PF can be avoided by adding a small check + retry". But as you point out, the second fault path can handle this case properly, the behaviour is already correct.Please drop the patch. I'll come back with numbers if I run into a workload where this race is actually hot.Thanks for taking a look. Best regards, Wandun >
© 2016 - 2026 Red Hat, Inc.