[v1] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)

[PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by David Hildenbrand (Red Hat) 2 months ago

Ever since we stopped using the page count to detect shared PMD
page tables, these comments are outdated.

The only reason we have to flush the TLB early is because once we drop
the i_mmap_rwsem, the previously shared page table could get freed (to
then get reallocated and used for other purpose). So we really have to
flush the TLB before that could happen.

So let's simplify the comments a bit.

The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
correctly after huge_pmd_unshare") was confusing: sure it is recorded
in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
anything. So let's drop that comment while at it as well.

We'll centralize these comments in a single helper as we rework the code
next.

Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Cc: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 mm/hugetlb.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51273baec9e5d..3c77cdef12a32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_end_vma(tlb, vma);
 
 	/*
-	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
-	 * could defer the flush until now, since by holding i_mmap_rwsem we
-	 * guaranteed that the last reference would not be dropped. But we must
-	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
-	 * dropped and the last reference to the shared PMDs page might be
-	 * dropped as well.
-	 *
-	 * In theory we could defer the freeing of the PMD pages as well, but
-	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
-	 * detect sharing, so we cannot defer the release of the page either.
-	 * Instead, do flush now.
+	 * There is nothing protecting a previously-shared page table that we
+	 * unshared through huge_pmd_unshare() from getting freed after we
+	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
+	 * succeeded, flush the range corresponding to the pud.
 	 */
 	if (force_flush)
 		tlb_flush_mmu_tlbonly(tlb);
@@ -6536,11 +6529,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 		cond_resched();
 	}
 	/*
-	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
-	 * may have cleared our pud entry and done put_page on the page table:
-	 * once we release i_mmap_rwsem, another task can do the final put_page
-	 * and that page table be reused and filled with junk.  If we actually
-	 * did unshare a page of pmds, flush the range corresponding to the pud.
+	 * There is nothing protecting a previously-shared page table that we
+	 * unshared through huge_pmd_unshare() from getting freed after we
+	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
+	 * succeeded, flush the range corresponding to the pud.
 	 */
 	if (shared_pmd)
 		flush_hugetlb_tlb_range(vma, range.start, range.end);
-- 
2.52.0

Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by Oscar Salvador 1 month, 4 weeks ago

On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
> Ever since we stopped using the page count to detect shared PMD
> page tables, these comments are outdated.
> 
> The only reason we have to flush the TLB early is because once we drop
> the i_mmap_rwsem, the previously shared page table could get freed (to
> then get reallocated and used for other purpose). So we really have to
> flush the TLB before that could happen.
> 
> So let's simplify the comments a bit.
> 
> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
> correctly after huge_pmd_unshare") was confusing: sure it is recorded
> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
> anything. So let's drop that comment while at it as well.
> 
> We'll centralize these comments in a single helper as we rework the code
> next.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Acked-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by Lorenzo Stoakes 2 months ago

On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
> Ever since we stopped using the page count to detect shared PMD
> page tables, these comments are outdated.
>
> The only reason we have to flush the TLB early is because once we drop
> the i_mmap_rwsem, the previously shared page table could get freed (to
> then get reallocated and used for other purpose). So we really have to
> flush the TLB before that could happen.
>
> So let's simplify the comments a bit.
>
> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
> correctly after huge_pmd_unshare") was confusing: sure it is recorded
> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
> anything. So let's drop that comment while at it as well.
>
> We'll centralize these comments in a single helper as we rework the code
> next.
>
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/hugetlb.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 51273baec9e5d..3c77cdef12a32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	tlb_end_vma(tlb, vma);
>
>  	/*
> -	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
> -	 * could defer the flush until now, since by holding i_mmap_rwsem we
> -	 * guaranteed that the last reference would not be dropped. But we must
> -	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
> -	 * dropped and the last reference to the shared PMDs page might be
> -	 * dropped as well.
> -	 *
> -	 * In theory we could defer the freeing of the PMD pages as well, but
> -	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
> -	 * detect sharing, so we cannot defer the release of the page either.

Was it this comment that led you to question the page_count issue? :)

> -	 * Instead, do flush now.
> +	 * There is nothing protecting a previously-shared page table that we
> +	 * unshared through huge_pmd_unshare() from getting freed after we
> +	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> +	 * succeeded, flush the range corresponding to the pud.
>  	 */
>  	if (force_flush)
>  		tlb_flush_mmu_tlbonly(tlb);
> @@ -6536,11 +6529,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  		cond_resched();
>  	}
>  	/*
> -	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
> -	 * may have cleared our pud entry and done put_page on the page table:
> -	 * once we release i_mmap_rwsem, another task can do the final put_page
> -	 * and that page table be reused and filled with junk.  If we actually
> -	 * did unshare a page of pmds, flush the range corresponding to the pud.
> +	 * There is nothing protecting a previously-shared page table that we
> +	 * unshared through huge_pmd_unshare() from getting freed after we
> +	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> +	 * succeeded, flush the range corresponding to the pud.
>  	 */
>  	if (shared_pmd)
>  		flush_hugetlb_tlb_range(vma, range.start, range.end);
> --
> 2.52.0
>

Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by David Hildenbrand (Red Hat) 1 month, 4 weeks ago

On 12/10/25 12:22, Lorenzo Stoakes wrote:
> On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
>> Ever since we stopped using the page count to detect shared PMD
>> page tables, these comments are outdated.
>>
>> The only reason we have to flush the TLB early is because once we drop
>> the i_mmap_rwsem, the previously shared page table could get freed (to
>> then get reallocated and used for other purpose). So we really have to
>> flush the TLB before that could happen.
>>
>> So let's simplify the comments a bit.
>>
>> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
>> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
>> correctly after huge_pmd_unshare") was confusing: sure it is recorded
>> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
>> anything. So let's drop that comment while at it as well.
>>
>> We'll centralize these comments in a single helper as we rework the code
>> next.
>>
>> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
>> Cc: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> 
> LGTM, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>   mm/hugetlb.c | 24 ++++++++----------------
>>   1 file changed, 8 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 51273baec9e5d..3c77cdef12a32 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>   	tlb_end_vma(tlb, vma);
>>
>>   	/*
>> -	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
>> -	 * could defer the flush until now, since by holding i_mmap_rwsem we
>> -	 * guaranteed that the last reference would not be dropped. But we must
>> -	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
>> -	 * dropped and the last reference to the shared PMDs page might be
>> -	 * dropped as well.
>> -	 *
>> -	 * In theory we could defer the freeing of the PMD pages as well, but
>> -	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
>> -	 * detect sharing, so we cannot defer the release of the page either.
> 
> Was it this comment that led you to question the page_count issue? :)

Heh, no, I know about the changed handling already. I stumbled over the 
page_count() remaining usage while working on some cleanups I previously 
had as part of this series :)

-- 
Cheers

David

Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by Lorenzo Stoakes 1 month, 3 weeks ago

On Thu, Dec 11, 2025 at 02:58:51AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/10/25 12:22, Lorenzo Stoakes wrote:
> > On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
> > > Ever since we stopped using the page count to detect shared PMD
> > > page tables, these comments are outdated.
> > >
> > > The only reason we have to flush the TLB early is because once we drop
> > > the i_mmap_rwsem, the previously shared page table could get freed (to
> > > then get reallocated and used for other purpose). So we really have to
> > > flush the TLB before that could happen.
> > >
> > > So let's simplify the comments a bit.
> > >
> > > The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
> > > part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
> > > correctly after huge_pmd_unshare") was confusing: sure it is recorded
> > > in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
> > > anything. So let's drop that comment while at it as well.
> > >
> > > We'll centralize these comments in a single helper as we rework the code
> > > next.
> > >
> > > Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> > > Cc: Liu Shixin <liushixin2@huawei.com>
> > > Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> >
> > LGTM, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!
>
> >
> > > ---
> > >   mm/hugetlb.c | 24 ++++++++----------------
> > >   1 file changed, 8 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 51273baec9e5d..3c77cdef12a32 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >   	tlb_end_vma(tlb, vma);
> > >
> > >   	/*
> > > -	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
> > > -	 * could defer the flush until now, since by holding i_mmap_rwsem we
> > > -	 * guaranteed that the last reference would not be dropped. But we must
> > > -	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
> > > -	 * dropped and the last reference to the shared PMDs page might be
> > > -	 * dropped as well.
> > > -	 *
> > > -	 * In theory we could defer the freeing of the PMD pages as well, but
> > > -	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
> > > -	 * detect sharing, so we cannot defer the release of the page either.
> >
> > Was it this comment that led you to question the page_count issue? :)
>
> Heh, no, I know about the changed handling already. I stumbled over the
> page_count() remaining usage while working on some cleanups I previously had
> as part of this series :)

Ah :)

>
> --
> Cheers
>
> David

Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()

Posted by Rik van Riel 2 months ago

On Fri, 2025-12-05 at 22:35 +0100, David Hildenbrand (Red Hat) wrote:
> 
> So let's simplify the comments a bit.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared
> count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.