[v1] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

[PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by SeongJae Park 11 months ago

Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() does tlb flushing for each invocation.  Split
out the body of zap_page_range_single() except mmu_gather object
initialization and gathered tlb entries flushing parts for such batched
tlb flushing usage.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/memory.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78c7ee62795e..88c478e2ed1a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 	mmu_notifier_invalidate_range_end(&range);
 }
 
-/**
- * zap_page_range_single - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @address: starting address of pages to zap
- * @size: number of bytes to zap
- * @details: details of shared cache invalidation
- *
- * The range must fit into one VMA.
- */
-void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+static void unmap_vma_single(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *details)
 {
 	const unsigned long end = address + size;
 	struct mmu_notifier_range range;
-	struct mmu_gather tlb;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, end);
 	hugetlb_zap_begin(vma, &range.start, &range.end);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	/*
 	 * unmap 'address-end' not 'range.start-range.end' as range
 	 * could have been expanded for hugetlb pmd sharing.
 	 */
-	unmap_single_vma(&tlb, vma, address, end, details, false);
+	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb);
 	hugetlb_zap_end(vma, details);
 }
 
+/**
+ * zap_page_range_single - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
+ * @details: details of shared cache invalidation
+ *
+ * The range must fit into one VMA.
+ */
+void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
+		unsigned long size, struct zap_details *details)
+{
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	unmap_vma_single(&tlb, vma, address, size, details);
+	tlb_finish_mmu(&tlb);
+}
+
 /**
  * zap_vma_ptes - remove ptes mapping the vma
  * @vma: vm_area_struct holding ptes to be zapped
-- 
2.39.5

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by Liam R. Howlett 10 months, 1 week ago

* SeongJae Park <sj@kernel.org> [250310 13:24]:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() does tlb flushing for each invocation.  Split
> out the body of zap_page_range_single() except mmu_gather object
> initialization and gathered tlb entries flushing parts for such batched
> tlb flushing usage.
> 
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 78c7ee62795e..88c478e2ed1a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>  
> -/**
> - * zap_page_range_single - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> - * @details: details of shared cache invalidation
> - *
> - * The range must fit into one VMA.
> - */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void unmap_vma_single(struct mmu_gather *tlb,

I could not, for the life of me, figure out what was going on here until
I realised that is is a new function name and not unmap_single_vma(),
which is called below.

Can we name this differently somehow?  notify_unmap_single_vma() or
something better?

Also, maybe add a description of the function to this patch vs the next
patch?

> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
>  
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);
> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
>  	hugetlb_zap_end(vma, details);
>  }
>  
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	unmap_vma_single(&tlb, vma, address, size, details);
> +	tlb_finish_mmu(&tlb);
> +}
> +
>  /**
>   * zap_vma_ptes - remove ptes mapping the vma
>   * @vma: vm_area_struct holding ptes to be zapped
> -- 
> 2.39.5
> 

[1]. https://lore.kernel.org/lkml/1406212541-25975-1-git-send-email-joro@8bytes.org/

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by SeongJae Park 10 months, 1 week ago

On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * SeongJae Park <sj@kernel.org> [250310 13:24]:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation.  Split
> > out the body of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing parts for such batched
> > tlb flushing usage.
> > 
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 78c7ee62795e..88c478e2ed1a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >  
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> 
> I could not, for the life of me, figure out what was going on here until
> I realised that is is a new function name and not unmap_single_vma(),
> which is called below.

Agreed, definitely the name is confusing, especially given the existence of
unmap_single_vma().

> 
> Can we name this differently somehow?  notify_unmap_single_vma() or
> something better?

notify_unmap_single_vma() sounds good to me.  I'll use the name in the next
revision unless we find a better one.

> 
> Also, maybe add a description of the function to this patch vs the next
> patch?

That makes sense.  In the next revision, I will add the kernel-doc comment
here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
/**) since this function is a static function as of this patch.  On the next
patch that makes this non-static, I will make the comment a valid kernel-doc
comment with a minimum change.

I prefer not having a valid kernel-doc comment for static function, but that's
just a personal preferrence and I have no strong reason to object other way.
Please feel free to let me know if you prefer making it valid kernel doc
comment starting from this patch.

Thank you for nice suggestions, Liam.

Thanks,
SJ

[...]

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by Liam R. Howlett 10 months, 1 week ago

* SeongJae Park <sj@kernel.org> [250331 22:48]:
> On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> 
> > * SeongJae Park <sj@kernel.org> [250310 13:24]:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() does tlb flushing for each invocation.  Split
> > > out the body of zap_page_range_single() except mmu_gather object
> > > initialization and gathered tlb entries flushing parts for such batched
> > > tlb flushing usage.
> > > 
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 78c7ee62795e..88c478e2ed1a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >  
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > - * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > - * @details: details of shared cache invalidation
> > > - *
> > > - * The range must fit into one VMA.
> > > - */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void unmap_vma_single(struct mmu_gather *tlb,
> > 
> > I could not, for the life of me, figure out what was going on here until
> > I realised that is is a new function name and not unmap_single_vma(),
> > which is called below.
> 
> Agreed, definitely the name is confusing, especially given the existence of
> unmap_single_vma().
> 
> > 
> > Can we name this differently somehow?  notify_unmap_single_vma() or
> > something better?
> 
> notify_unmap_single_vma() sounds good to me.  I'll use the name in the next
> revision unless we find a better one.

Thanks.  I don't really mind if you have anything else to name it, as
long as it reduces the confusion.

> 
> > 
> > Also, maybe add a description of the function to this patch vs the next
> > patch?
> 
> That makes sense.  In the next revision, I will add the kernel-doc comment
> here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
> /**) since this function is a static function as of this patch.  On the next
> patch that makes this non-static, I will make the comment a valid kernel-doc
> comment with a minimum change.
> 
> I prefer not having a valid kernel-doc comment for static function, but that's
> just a personal preferrence and I have no strong reason to object other way.
> Please feel free to let me know if you prefer making it valid kernel doc
> comment starting from this patch.
> 

Yes, that was what I was thinking as well.

...

Thanks,
Liam

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by SeongJae Park 10 months, 1 week ago

On Tue, 1 Apr 2025 10:03:17 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * SeongJae Park <sj@kernel.org> [250331 22:48]:
> > On Mon, 31 Mar 2025 21:45:40 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> > 
> > > * SeongJae Park <sj@kernel.org> [250310 13:24]:
[...] 
> Thanks.  I don't really mind if you have anything else to name it, as
> long as it reduces the confusion.

Fully agreed and thanks again for the nice name suggestion.

[...]
> > That makes sense.  In the next revision, I will add the kernel-doc comment
> > here, but not as a valid kernel-doc comment (maybe wtarts with /* instead of
> > /**) since this function is a static function as of this patch.  On the next
> > patch that makes this non-static, I will make the comment a valid kernel-doc
> > comment with a minimum change.
> > 
> > I prefer not having a valid kernel-doc comment for static function, but that's
> > just a personal preferrence and I have no strong reason to object other way.
> > Please feel free to let me know if you prefer making it valid kernel doc
> > comment starting from this patch.
> > 
> 
> Yes, that was what I was thinking as well.

Thanks for kindly clarifying, Liam :)


Thanks,
SJ

[...]

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by Lorenzo Stoakes 11 months ago

On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> Some of zap_page_range_single() callers such as [process_]madvise() with
> MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> zap_page_range_single() does tlb flushing for each invocation.  Split
> out the body of zap_page_range_single() except mmu_gather object
> initialization and gathered tlb entries flushing parts for such batched
> tlb flushing usage.
>
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/memory.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 78c7ee62795e..88c478e2ed1a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>  	mmu_notifier_invalidate_range_end(&range);
>  }
>
> -/**
> - * zap_page_range_single - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @address: starting address of pages to zap
> - * @size: number of bytes to zap
> - * @details: details of shared cache invalidation
> - *
> - * The range must fit into one VMA.
> - */
> -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +static void unmap_vma_single(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long address,
>  		unsigned long size, struct zap_details *details)
>  {
>  	const unsigned long end = address + size;
>  	struct mmu_notifier_range range;
> -	struct mmu_gather tlb;
>
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>  				address, end);
>  	hugetlb_zap_begin(vma, &range.start, &range.end);
> -	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	update_hiwater_rss(vma->vm_mm);
>  	mmu_notifier_invalidate_range_start(&range);
>  	/*
>  	 * unmap 'address-end' not 'range.start-range.end' as range
>  	 * could have been expanded for hugetlb pmd sharing.
>  	 */
> -	unmap_single_vma(&tlb, vma, address, end, details, false);
> +	unmap_single_vma(tlb, vma, address, end, details, false);
>  	mmu_notifier_invalidate_range_end(&range);
> -	tlb_finish_mmu(&tlb);
>  	hugetlb_zap_end(vma, details);

Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
before?

This seems like a major problem with this change. If not you need to explain why
not in the commit message.

>  }
>
> +/**
> + * zap_page_range_single - remove user pages in a given range
> + * @vma: vm_area_struct holding the applicable pages
> + * @address: starting address of pages to zap
> + * @size: number of bytes to zap
> + * @details: details of shared cache invalidation
> + *
> + * The range must fit into one VMA.
> + */
> +void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> +		unsigned long size, struct zap_details *details)
> +{
> +	struct mmu_gather tlb;
> +
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
> +	unmap_vma_single(&tlb, vma, address, size, details);
> +	tlb_finish_mmu(&tlb);
> +}
> +
>  /**
>   * zap_vma_ptes - remove ptes mapping the vma
>   * @vma: vm_area_struct holding ptes to be zapped
> --
> 2.39.5

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by SeongJae Park 11 months ago

On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> > Some of zap_page_range_single() callers such as [process_]madvise() with
> > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > zap_page_range_single() does tlb flushing for each invocation.  Split
> > out the body of zap_page_range_single() except mmu_gather object
> > initialization and gathered tlb entries flushing parts for such batched
> > tlb flushing usage.
> >
> > Signed-off-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 78c7ee62795e..88c478e2ed1a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> >  	mmu_notifier_invalidate_range_end(&range);
> >  }
> >
> > -/**
> > - * zap_page_range_single - remove user pages in a given range
> > - * @vma: vm_area_struct holding the applicable pages
> > - * @address: starting address of pages to zap
> > - * @size: number of bytes to zap
> > - * @details: details of shared cache invalidation
> > - *
> > - * The range must fit into one VMA.
> > - */
> > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > +static void unmap_vma_single(struct mmu_gather *tlb,
> > +		struct vm_area_struct *vma, unsigned long address,
> >  		unsigned long size, struct zap_details *details)
> >  {
> >  	const unsigned long end = address + size;
> >  	struct mmu_notifier_range range;
> > -	struct mmu_gather tlb;
> >
> >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> >  				address, end);
> >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> >  	update_hiwater_rss(vma->vm_mm);
> >  	mmu_notifier_invalidate_range_start(&range);
> >  	/*
> >  	 * unmap 'address-end' not 'range.start-range.end' as range
> >  	 * could have been expanded for hugetlb pmd sharing.
> >  	 */
> > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > +	unmap_single_vma(tlb, vma, address, end, details, false);
> >  	mmu_notifier_invalidate_range_end(&range);
> > -	tlb_finish_mmu(&tlb);
> >  	hugetlb_zap_end(vma, details);
> 
> Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
> before?
> 
> This seems like a major problem with this change.

Oh, you're right.  This could re-introduce the racy hugetlb allocation failure
problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between
MADV_DONTNEED and page fault").  That is, this patch can make hugetlb
allocation failures increase while MADV_DONTNEED is going on.

Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all
vmas in a batched manner, similar to that for tlb flush.  For example, add a
list or an array for the vmas in 'struct madvise_behavior', let
'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for
gathered vmas at vector_madvise() or do_madvise().  Does that make sense?

Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs:
close race between MADV_DONTNEED and page fault") for a case that I'm missing
something important.

> If not you need to explain why
> not in the commit message.

I now think it is a problem.  If it turns out I'm wrong, I will of course add
the reason on the commit message.


Thanks,
SJ

[...]

Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single()

Posted by SeongJae Park 10 months, 1 week ago

On Tue, 11 Mar 2025 13:58:01 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> > On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote:
> > > Some of zap_page_range_single() callers such as [process_]madvise() with
> > > MADV_DONEED[_LOCKED] cannot batch tlb flushes because
> > > zap_page_range_single() does tlb flushing for each invocation.  Split
> > > out the body of zap_page_range_single() except mmu_gather object
> > > initialization and gathered tlb entries flushing parts for such batched
> > > tlb flushing usage.
> > >
> > > Signed-off-by: SeongJae Park <sj@kernel.org>
> > > ---
> > >  mm/memory.c | 36 ++++++++++++++++++++++--------------
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 78c7ee62795e..88c478e2ed1a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > >  	mmu_notifier_invalidate_range_end(&range);
> > >  }
> > >
> > > -/**
> > > - * zap_page_range_single - remove user pages in a given range
> > > - * @vma: vm_area_struct holding the applicable pages
> > > - * @address: starting address of pages to zap
> > > - * @size: number of bytes to zap
> > > - * @details: details of shared cache invalidation
> > > - *
> > > - * The range must fit into one VMA.
> > > - */
> > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> > > +static void unmap_vma_single(struct mmu_gather *tlb,
> > > +		struct vm_area_struct *vma, unsigned long address,
> > >  		unsigned long size, struct zap_details *details)
> > >  {
> > >  	const unsigned long end = address + size;
> > >  	struct mmu_notifier_range range;
> > > -	struct mmu_gather tlb;
> > >
> > >  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> > >  				address, end);
> > >  	hugetlb_zap_begin(vma, &range.start, &range.end);
> > > -	tlb_gather_mmu(&tlb, vma->vm_mm);
> > >  	update_hiwater_rss(vma->vm_mm);
> > >  	mmu_notifier_invalidate_range_start(&range);
> > >  	/*
> > >  	 * unmap 'address-end' not 'range.start-range.end' as range
> > >  	 * could have been expanded for hugetlb pmd sharing.
> > >  	 */
> > > -	unmap_single_vma(&tlb, vma, address, end, details, false);
> > > +	unmap_single_vma(tlb, vma, address, end, details, false);
> > >  	mmu_notifier_invalidate_range_end(&range);
> > > -	tlb_finish_mmu(&tlb);
> > >  	hugetlb_zap_end(vma, details);
> > 
> > Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens
> > before?
> > 
> > This seems like a major problem with this change.
> 
> Oh, you're right.  This could re-introduce the racy hugetlb allocation failure
> problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between
> MADV_DONTNEED and page fault").  That is, this patch can make hugetlb
> allocation failures increase while MADV_DONTNEED is going on.
> 
> Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all
> vmas in a batched manner, similar to that for tlb flush.  For example, add a
> list or an array for the vmas in 'struct madvise_behavior', let
> 'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for
> gathered vmas at vector_madvise() or do_madvise().  Does that make sense?
> 
> Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs:
> close race between MADV_DONTNEED and page fault") for a case that I'm missing
> something important.

I now think the straightforward fix I mentioned in the previous message might
be unnecessarily big change.  Maybe letting the unmap_vma_single() caller does
hugetlb_zap_end() and tlb_finish_mmu() on their own in a correct sequence could
be another way?  Then zap_page_range_single() can do the calls for each
invocation as it did before.  process_madvise() could do batched tlb flushes
only for non-hugetlb case.  That is, do the tlb entries gathering as this
version of patch series proposes in usual.  But see if the address range is for
hugetlb and therefore require hugetlb_zap_end() call in real.  If so, flush the
so far gathered tlb entries, call hugetlb_zap_end(), and then start next batch?

In other words, I'm proposing to split the batched flushes when a hugetlb is
encountered.  This means that tlb flush overhead reduction might be smaller
than expected if process_madvise() for unmapping hugetlb pages is intensively
invoked.  But I 't think that's not a common use case.  Having the benefit for
non-hugetlb pages with simple change first, and revisiting hugetlb case later
once the problem comes out might be a way, in my opinion.

For example, my idea could implemented like below, on top of this entire patch
series.

diff --git a/mm/madvise.c b/mm/madvise.c
index 4021db51aeda..e6a74e7ef864 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -861,6 +861,20 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *behavior,
 	};
 
 	unmap_vma_single(behavior->tlb, vma, start, end - start, &details);
+	/*
+	 * hugetlb_zap_end() should be called after tlb_finish_mmu() to avoid
+	 * hugetlb faults for the tlb-flushing memory hanppen before freeing of
+	 * the memory.  If not, the fault will fail memory allocation.
+	 *
+	 * If hugetlb_zap_end() really need to be called, flush so-far gathered
+	 * tlb entries, invoke hugetlb_zap_end(), and start another batch of
+	 * tlb flushes for remaining unmap works.
+	 */
+	if (is_vm_hugetlb_page(vma)) {
+		tlb_finish_mmu(behavior->tlb);
+		hugetlb_zap_end(vma, &details);
+		tlb_gather_mmu(behavior->tlb, vma->vm_mm);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index add8d540cb63..4431630d3240 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2023,7 +2023,6 @@ void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 */
 	unmap_single_vma(tlb, vma, address, end, details, false);
 	mmu_notifier_invalidate_range_end(&range);
-	hugetlb_zap_end(vma, details);
 }
 
 /**
@@ -2043,6 +2042,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	unmap_vma_single(&tlb, vma, address, size, details);
 	tlb_finish_mmu(&tlb);
+	hugetlb_zap_end(vma, details);
 }
 
 /**

Any concern or something I'm missing?


Thanks,
SJ

[...]