[v2] Misc rework on hugetlb faulting path

[PATCH v2 2/5] mm,hugetlb: Sort out folio locking in the faulting path

Posted by Oscar Salvador 3 months, 2 weeks ago

Recent conversations showed that there was a misunderstanding about why we
were locking the folio prior to call in hugetlb_wp().
In fact, as soon as we have the folio mapped into the pagetables, we no longer
need to hold it locked, because we know that no concurrent truncation could have
happened.
There is only one case where the folio needs to be locked, and that is when we
are handling an anonymous folio, because hugetlb_wp() will check whether it can
re-use it exclusively for the process that is faulting it in.

So, pass the folio locked to hugetlb_wp() when that is the case.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 175edafeec67..1a5f713c1e4c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6437,6 +6437,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	pte_t new_pte;
 	bool new_folio, new_pagecache_folio = false;
 	u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff);
+	bool folio_locked = true;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -6602,6 +6603,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+		/* No need to lock file folios. See comment in hugetlb_fault() */
+		if (!anon_rmap) {
+			folio_locked = false;
+			folio_unlock(folio);
+		}
 		/* Optimization, do the COW without a second fault */
 		ret = hugetlb_wp(vmf);
 	}
@@ -6616,7 +6622,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	if (new_folio)
 		folio_set_hugetlb_migratable(folio);
 
-	folio_unlock(folio);
+	if (folio_locked)
+		folio_unlock(folio);
 out:
 	hugetlb_vma_unlock_read(vma);
 
@@ -6636,7 +6643,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	if (new_folio && !new_pagecache_folio)
 		restore_reserve_on_error(h, vma, vmf->address, folio);
 
-	folio_unlock(folio);
+	if (folio_locked)
+		folio_unlock(folio);
 	folio_put(folio);
 	goto out;
 }
@@ -6670,7 +6678,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	vm_fault_t ret;
 	u32 hash;
-	struct folio *folio;
+	struct folio *folio = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	struct vm_fault vmf = {
@@ -6687,6 +6695,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * be hard to debug if called functions make assumptions
 		 */
 	};
+	bool folio_locked = false;
 
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
@@ -6801,13 +6810,24 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Fallthrough to CoW */
 	}
 
-	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
-	folio = page_folio(pte_page(vmf.orig_pte));
-	folio_lock(folio);
-	folio_get(folio);
-
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(vmf.orig_pte)) {
+			/*
+			 * Anonymous folios need to be lock since hugetlb_wp()
+			 * checks whether we can re-use the folio exclusively
+			 * for us in case we are the only user of it.
+			 */
+			folio = page_folio(pte_page(vmf.orig_pte));
+			folio_get(folio);
+			if (folio_test_anon(folio)) {
+				spin_unlock(vmf.ptl);
+				folio_lock(folio);
+				folio_locked = true;
+				spin_lock(vmf.ptl);
+				if (unlikely(!pte_same(vmf.orig_pte, huge_ptep_get(mm,
+						   vmf.address, vmf.pte))))
+					goto out_put_page;
+			}
 			ret = hugetlb_wp(&vmf);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
@@ -6819,8 +6839,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, vmf.address, vmf.pte);
 out_put_page:
-	folio_unlock(folio);
-	folio_put(folio);
+	if (folio) {
+		if (folio_locked)
+			folio_unlock(folio);
+		folio_put(folio);
+	}
 out_ptl:
 	spin_unlock(vmf.ptl);
 out_mutex:
-- 
2.50.0

Re: [PATCH v2 2/5] mm,hugetlb: Sort out folio locking in the faulting path

Posted by David Hildenbrand 3 months, 2 weeks ago

On 20.06.25 14:30, Oscar Salvador wrote:
> Recent conversations showed that there was a misunderstanding about why we
> were locking the folio prior to call in hugetlb_wp().
> In fact, as soon as we have the folio mapped into the pagetables, we no longer
> need to hold it locked, because we know that no concurrent truncation could have
> happened.
> There is only one case where the folio needs to be locked, and that is when we
> are handling an anonymous folio, because hugetlb_wp() will check whether it can
> re-use it exclusively for the process that is faulting it in.
> 
> So, pass the folio locked to hugetlb_wp() when that is the case.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++----------
>   1 file changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 175edafeec67..1a5f713c1e4c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6437,6 +6437,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   	pte_t new_pte;
>   	bool new_folio, new_pagecache_folio = false;
>   	u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff);
> +	bool folio_locked = true;
>   
>   	/*
>   	 * Currently, we are forced to kill the process in the event the
> @@ -6602,6 +6603,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   
>   	hugetlb_count_add(pages_per_huge_page(h), mm);
>   	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> +		/* No need to lock file folios. See comment in hugetlb_fault() */
> +		if (!anon_rmap) {
> +			folio_locked = false;
> +			folio_unlock(folio);
> +		}
>   		/* Optimization, do the COW without a second fault */
>   		ret = hugetlb_wp(vmf);
>   	}
> @@ -6616,7 +6622,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   	if (new_folio)
>   		folio_set_hugetlb_migratable(folio);
>   
> -	folio_unlock(folio);
> +	if (folio_locked)
> +		folio_unlock(folio);
>   out:
>   	hugetlb_vma_unlock_read(vma);
>   
> @@ -6636,7 +6643,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   	if (new_folio && !new_pagecache_folio)
>   		restore_reserve_on_error(h, vma, vmf->address, folio);
>   
> -	folio_unlock(folio);
> +	if (folio_locked)
> +		folio_unlock(folio);
>   	folio_put(folio);
>   	goto out;
>   }
> @@ -6670,7 +6678,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   {
>   	vm_fault_t ret;
>   	u32 hash;
> -	struct folio *folio;
> +	struct folio *folio = NULL;
>   	struct hstate *h = hstate_vma(vma);
>   	struct address_space *mapping;
>   	struct vm_fault vmf = {
> @@ -6687,6 +6695,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		 * be hard to debug if called functions make assumptions
>   		 */
>   	};
> +	bool folio_locked = false;
>   
>   	/*
>   	 * Serialize hugepage allocation and instantiation, so that we don't
> @@ -6801,13 +6810,24 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		/* Fallthrough to CoW */
>   	}
>   
> -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
> -	folio = page_folio(pte_page(vmf.orig_pte));
> -	folio_lock(folio);
> -	folio_get(folio);
> -
>   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>   		if (!huge_pte_write(vmf.orig_pte)) {
> +			/*
> +			 * Anonymous folios need to be lock since hugetlb_wp()
> +			 * checks whether we can re-use the folio exclusively
> +			 * for us in case we are the only user of it.
> +			 */

Should we move that comment to hugetlb_wp() instead? And if we are 
already doing this PTL unlock dance now, why not do it in hugetlb_wp() 
instead so we can simplify this code?

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 2/5] mm,hugetlb: Sort out folio locking in the faulting path

Posted by Oscar Salvador 3 months, 2 weeks ago

On Mon, Jun 23, 2025 at 04:11:38PM +0200, David Hildenbrand wrote:
> > @@ -6801,13 +6810,24 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >   		/* Fallthrough to CoW */
> >   	}
> > -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
> > -	folio = page_folio(pte_page(vmf.orig_pte));
> > -	folio_lock(folio);
> > -	folio_get(folio);
> > -
> >   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
> >   		if (!huge_pte_write(vmf.orig_pte)) {
> > +			/*
> > +			 * Anonymous folios need to be lock since hugetlb_wp()
> > +			 * checks whether we can re-use the folio exclusively
> > +			 * for us in case we are the only user of it.
> > +			 */
> 
> Should we move that comment to hugetlb_wp() instead? And if we are already
> doing this PTL unlock dance now, why not do it in hugetlb_wp() instead so we
> can simplify this code?

Yes, probably we can move it further up.
Let me see how it would look.

thanks! 

-- 
Oscar Salvador
SUSE Labs

Re: [PATCH v2 2/5] mm,hugetlb: Sort out folio locking in the faulting path

Posted by Oscar Salvador 3 months, 2 weeks ago

On Wed, Jun 25, 2025 at 09:47:00AM +0200, Oscar Salvador wrote:
> On Mon, Jun 23, 2025 at 04:11:38PM +0200, David Hildenbrand wrote:
> > > @@ -6801,13 +6810,24 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >   		/* Fallthrough to CoW */
> > >   	}
> > > -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
> > > -	folio = page_folio(pte_page(vmf.orig_pte));
> > > -	folio_lock(folio);
> > > -	folio_get(folio);
> > > -
> > >   	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
> > >   		if (!huge_pte_write(vmf.orig_pte)) {
> > > +			/*
> > > +			 * Anonymous folios need to be lock since hugetlb_wp()
> > > +			 * checks whether we can re-use the folio exclusively
> > > +			 * for us in case we are the only user of it.
> > > +			 */
> > 
> > Should we move that comment to hugetlb_wp() instead? And if we are already
> > doing this PTL unlock dance now, why not do it in hugetlb_wp() instead so we
> > can simplify this code?
> 
> Yes, probably we can move it further up.
> Let me see how it would look.

So, I've been thinking about this, and I'm not so sure.
By default, the state of the folio in hugetlb_no_page and hugetlb_fault is
different.

hugetlb_no_page() has the folio locked already, and hugetlb_fault() hasn't, which
means that if we want to move this further up, 1) hugetlb_no_page() would have to
unlock the folio to then lock it in hugetlb_wp() in case it's anonymous or
2) pass a parameter to hugetlb_wp() to let it know whether the folio is already locked.

Don't really like any of them. Case 1) seems suboptimal as right now (with this patch)
we only unlock the folio in !anon case in hugetlb_no_page(). If we want to move the 'dance'
from hugetlb_fault() to hugetlb_wp(), we'd have to unlock and then lock it again.

-- 
Oscar Salvador
SUSE Labs

[PATCH v2 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping
[PATCH v2 2/5] mm,hugetlb: Sort out folio locking in the faulting path
[PATCH v2 3/5] mm,hugetlb: Rename anon_rmap to new_anon_folio and make it boolean
[PATCH v2 4/5] mm,hugetlb: Drop obsolete comment about non-present pte and second faults
[PATCH v2 5/5] mm,hugetlb: Drop unlikelys from hugetlb_fault