[v1] mm: readahead: make thp readahead conditional to mmap_miss logic

[PATCH] mm: readahead: make thp readahead conditional to mmap_miss logic

Posted by Roman Gushchin 4 months, 1 week ago

Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
introduced a special handling for VM_HUGEPAGE mappings: even if the
readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
allocated.

This change causes a significant regression for containers with a
tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
commit, mmap_miss logic would eventually lead to the readahead
disablement, effectively reducing the memory pressure in the
cgroup. With this change the kernel is trying to allocate 1-2 huge
pages for each fault, no matter if these pages are used or not
before being evicted, increasing the memory pressure multi-fold.

To fix the regression, let's make the new VM_HUGEPAGE conditional
to the mmap_miss check, but keep independent from the ra->ra_pages.
This way the main intention of commit 4687fdbb805a ("mm/filemap:
Support VM_HUGEPAGE for file mappings") stays intact, but the
regression is resolved.

The logic behind this changes is simple: even if a user explicitly
requests using huge pages to back the file mapping (using VM_HUGEPAGE
flag), under a very strong memory pressure it's better to fall back
to ordinary pages.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-mm@kvack.org
---
 mm/filemap.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a52dd38d2b4a..b67d7981fafb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3235,34 +3235,20 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	vm_flags_t vm_flags = vmf->vma->vm_flags;
+	bool force_thp_readahead = false;
 	unsigned short mmap_miss;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Use the readahead code, even if readahead is disabled */
-	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
-		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
-		ra->size = HPAGE_PMD_NR;
-		/*
-		 * Fetch two PMD folios, so we get the chance to actually
-		 * readahead, unless we've been told not to.
-		 */
-		if (!(vm_flags & VM_RAND_READ))
-			ra->size *= 2;
-		ra->async_size = HPAGE_PMD_NR;
-		ra->order = HPAGE_PMD_ORDER;
-		page_cache_ra_order(&ractl, ra);
-		return fpin;
-	}
-#endif
-
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
+		force_thp_readahead = true;
 	/*
 	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
 	 * already intended for random access.
 	 */
 	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
 		return fpin;
-	if (!ra->ra_pages)
+	if (!ra->ra_pages && !force_thp_readahead)
 		return fpin;
 
 	if (vm_flags & VM_SEQ_READ) {
@@ -3283,6 +3269,22 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	if (mmap_miss > MMAP_LOTSAMISS)
 		return fpin;
 
+	if (force_thp_readahead) {
+		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
+		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
+		ra->size = HPAGE_PMD_NR;
+		/*
+		 * Fetch two PMD folios, so we get the chance to actually
+		 * readahead, unless we've been told not to.
+		 */
+		if (!(vm_flags & VM_RAND_READ))
+			ra->size *= 2;
+		ra->async_size = HPAGE_PMD_NR;
+		ra->order = HPAGE_PMD_ORDER;
+		page_cache_ra_order(&ractl, ra);
+		return fpin;
+	}
+
 	if (vm_flags & VM_EXEC) {
 		/*
 		 * Allow arch to request a preferred minimum folio order for
-- 
2.51.0

Re: [PATCH] mm: readahead: make thp readahead conditional to mmap_miss logic

Posted by Dev Jain 4 months ago

On 30/09/25 11:18 am, Roman Gushchin wrote:
> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> introduced a special handling for VM_HUGEPAGE mappings: even if the
> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> allocated.
>
> This change causes a significant regression for containers with a
> tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> commit, mmap_miss logic would eventually lead to the readahead
> disablement, effectively reducing the memory pressure in the
> cgroup. With this change the kernel is trying to allocate 1-2 huge
> pages for each fault, no matter if these pages are used or not
> before being evicted, increasing the memory pressure multi-fold.
>
> To fix the regression, let's make the new VM_HUGEPAGE conditional
> to the mmap_miss check, but keep independent from the ra->ra_pages.
> This way the main intention of commit 4687fdbb805a ("mm/filemap:
> Support VM_HUGEPAGE for file mappings") stays intact, but the
> regression is resolved.
>
> The logic behind this changes is simple: even if a user explicitly
> requests using huge pages to back the file mapping (using VM_HUGEPAGE
> flag), under a very strong memory pressure it's better to fall back
> to ordinary pages.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org
> ---
>   mm/filemap.c | 40 +++++++++++++++++++++-------------------
>   1 file changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a52dd38d2b4a..b67d7981fafb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3235,34 +3235,20 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
>   	struct file *fpin = NULL;
>   	vm_flags_t vm_flags = vmf->vma->vm_flags;
> +	bool force_thp_readahead = false;
>   	unsigned short mmap_miss;
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   	/* Use the readahead code, even if readahead is disabled */
> -	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> -		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> -		ra->size = HPAGE_PMD_NR;
> -		/*
> -		 * Fetch two PMD folios, so we get the chance to actually
> -		 * readahead, unless we've been told not to.
> -		 */
> -		if (!(vm_flags & VM_RAND_READ))
> -			ra->size *= 2;
> -		ra->async_size = HPAGE_PMD_NR;
> -		ra->order = HPAGE_PMD_ORDER;
> -		page_cache_ra_order(&ractl, ra);
> -		return fpin;
> -	}
> -#endif
> -
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> +	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> +		force_thp_readahead = true;
>   	/*
>   	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
>   	 * already intended for random access.
>   	 */
>   	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
>   		return fpin;
> -	if (!ra->ra_pages)
> +	if (!ra->ra_pages && !force_thp_readahead)
>   		return fpin;
>   
>   	if (vm_flags & VM_SEQ_READ) {
> @@ -3283,6 +3269,22 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	if (mmap_miss > MMAP_LOTSAMISS)
>   		return fpin;
>   

You have moved the PMD-THP logic below the VM_SEQ_READ check, is that intentional?
So VMAs on which sequential read is expected will now use the common readahead algorithm,
instead of always benefitting from reduced TLB pressure through PMD mapping, if my understanding
is correct?

> +	if (force_thp_readahead) {
> +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> +		ra->size = HPAGE_PMD_NR;
> +		/*
> +		 * Fetch two PMD folios, so we get the chance to actually
> +		 * readahead, unless we've been told not to.
> +		 */
> +		if (!(vm_flags & VM_RAND_READ))
> +			ra->size *= 2;
> +		ra->async_size = HPAGE_PMD_NR;
> +		ra->order = HPAGE_PMD_ORDER;
> +		page_cache_ra_order(&ractl, ra);
> +		return fpin;
> +	}
> +
>   	if (vm_flags & VM_EXEC) {
>   		/*
>   		 * Allow arch to request a preferred minimum folio order for

Re: [PATCH] mm: readahead: make thp readahead conditional to mmap_miss logic

Posted by Jan Kara 4 months ago

On Sat 04-10-25 18:38:25, Dev Jain wrote:
> 
> On 30/09/25 11:18 am, Roman Gushchin wrote:
> > Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> > introduced a special handling for VM_HUGEPAGE mappings: even if the
> > readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> > allocated.
> > 
> > This change causes a significant regression for containers with a
> > tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> > commit, mmap_miss logic would eventually lead to the readahead
> > disablement, effectively reducing the memory pressure in the
> > cgroup. With this change the kernel is trying to allocate 1-2 huge
> > pages for each fault, no matter if these pages are used or not
> > before being evicted, increasing the memory pressure multi-fold.
> > 
> > To fix the regression, let's make the new VM_HUGEPAGE conditional
> > to the mmap_miss check, but keep independent from the ra->ra_pages.
> > This way the main intention of commit 4687fdbb805a ("mm/filemap:
> > Support VM_HUGEPAGE for file mappings") stays intact, but the
> > regression is resolved.
> > 
> > The logic behind this changes is simple: even if a user explicitly
> > requests using huge pages to back the file mapping (using VM_HUGEPAGE
> > flag), under a very strong memory pressure it's better to fall back
> > to ordinary pages.
> > 
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: linux-mm@kvack.org
> > ---
> >   mm/filemap.c | 40 +++++++++++++++++++++-------------------
> >   1 file changed, 21 insertions(+), 19 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index a52dd38d2b4a..b67d7981fafb 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3235,34 +3235,20 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> >   	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> >   	struct file *fpin = NULL;
> >   	vm_flags_t vm_flags = vmf->vma->vm_flags;
> > +	bool force_thp_readahead = false;
> >   	unsigned short mmap_miss;
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   	/* Use the readahead code, even if readahead is disabled */
> > -	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> > -		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> > -		ra->size = HPAGE_PMD_NR;
> > -		/*
> > -		 * Fetch two PMD folios, so we get the chance to actually
> > -		 * readahead, unless we've been told not to.
> > -		 */
> > -		if (!(vm_flags & VM_RAND_READ))
> > -			ra->size *= 2;
> > -		ra->async_size = HPAGE_PMD_NR;
> > -		ra->order = HPAGE_PMD_ORDER;
> > -		page_cache_ra_order(&ractl, ra);
> > -		return fpin;
> > -	}
> > -#endif
> > -
> > +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> > +	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> > +		force_thp_readahead = true;
> >   	/*
> >   	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
> >   	 * already intended for random access.
> >   	 */
> >   	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
> >   		return fpin;
> > -	if (!ra->ra_pages)
> > +	if (!ra->ra_pages && !force_thp_readahead)
> >   		return fpin;
> >   	if (vm_flags & VM_SEQ_READ) {
> > @@ -3283,6 +3269,22 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> >   	if (mmap_miss > MMAP_LOTSAMISS)
> >   		return fpin;
> 
> You have moved the PMD-THP logic below the VM_SEQ_READ check, is that intentional?
> So VMAs on which sequential read is expected will now use the common readahead algorithm,
> instead of always benefitting from reduced TLB pressure through PMD mapping, if my understanding
> is correct?

Hum, that's a good point. We should preserve the logic for VM_SEQ_READ
vmas. I've missed this during my review. Thanks for catching this.

								Honza

> 
> > +	if (force_thp_readahead) {
> > +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > +		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> > +		ra->size = HPAGE_PMD_NR;
> > +		/*
> > +		 * Fetch two PMD folios, so we get the chance to actually
> > +		 * readahead, unless we've been told not to.
> > +		 */
> > +		if (!(vm_flags & VM_RAND_READ))
> > +			ra->size *= 2;
> > +		ra->async_size = HPAGE_PMD_NR;
> > +		ra->order = HPAGE_PMD_ORDER;
> > +		page_cache_ra_order(&ractl, ra);
> > +		return fpin;
> > +	}
> > +
> >   	if (vm_flags & VM_EXEC) {
> >   		/*
> >   		 * Allow arch to request a preferred minimum folio order for
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] mm: readahead: make thp readahead conditional to mmap_miss logic

Posted by Jan Kara 4 months, 1 week ago

On Tue 30-09-25 07:48:15, Roman Gushchin wrote:
> Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> introduced a special handling for VM_HUGEPAGE mappings: even if the
> readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> allocated.
> 
> This change causes a significant regression for containers with a
> tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> commit, mmap_miss logic would eventually lead to the readahead
> disablement, effectively reducing the memory pressure in the
> cgroup. With this change the kernel is trying to allocate 1-2 huge
> pages for each fault, no matter if these pages are used or not
> before being evicted, increasing the memory pressure multi-fold.
> 
> To fix the regression, let's make the new VM_HUGEPAGE conditional
> to the mmap_miss check, but keep independent from the ra->ra_pages.
> This way the main intention of commit 4687fdbb805a ("mm/filemap:
> Support VM_HUGEPAGE for file mappings") stays intact, but the
> regression is resolved.
> 
> The logic behind this changes is simple: even if a user explicitly
> requests using huge pages to back the file mapping (using VM_HUGEPAGE
> flag), under a very strong memory pressure it's better to fall back
> to ordinary pages.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-mm@kvack.org

It would be good to get confirmation from Matthew that indeed this
preserves what he had in mind with commit 4687fdbb805a92 but the change
looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/filemap.c | 40 +++++++++++++++++++++-------------------
>  1 file changed, 21 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a52dd38d2b4a..b67d7981fafb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3235,34 +3235,20 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
> +	bool force_thp_readahead = false;
>  	unsigned short mmap_miss;
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* Use the readahead code, even if readahead is disabled */
> -	if ((vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> -		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> -		ra->size = HPAGE_PMD_NR;
> -		/*
> -		 * Fetch two PMD folios, so we get the chance to actually
> -		 * readahead, unless we've been told not to.
> -		 */
> -		if (!(vm_flags & VM_RAND_READ))
> -			ra->size *= 2;
> -		ra->async_size = HPAGE_PMD_NR;
> -		ra->order = HPAGE_PMD_ORDER;
> -		page_cache_ra_order(&ractl, ra);
> -		return fpin;
> -	}
> -#endif
> -
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> +	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> +		force_thp_readahead = true;
>  	/*
>  	 * If we don't want any read-ahead, don't bother. VM_EXEC case below is
>  	 * already intended for random access.
>  	 */
>  	if ((vm_flags & (VM_RAND_READ | VM_EXEC)) == VM_RAND_READ)
>  		return fpin;
> -	if (!ra->ra_pages)
> +	if (!ra->ra_pages && !force_thp_readahead)
>  		return fpin;
>  
>  	if (vm_flags & VM_SEQ_READ) {
> @@ -3283,6 +3269,22 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	if (mmap_miss > MMAP_LOTSAMISS)
>  		return fpin;
>  
> +	if (force_thp_readahead) {
> +		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> +		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> +		ra->size = HPAGE_PMD_NR;
> +		/*
> +		 * Fetch two PMD folios, so we get the chance to actually
> +		 * readahead, unless we've been told not to.
> +		 */
> +		if (!(vm_flags & VM_RAND_READ))
> +			ra->size *= 2;
> +		ra->async_size = HPAGE_PMD_NR;
> +		ra->order = HPAGE_PMD_ORDER;
> +		page_cache_ra_order(&ractl, ra);
> +		return fpin;
> +	}
> +
>  	if (vm_flags & VM_EXEC) {
>  		/*
>  		 * Allow arch to request a preferred minimum folio order for
> -- 
> 2.51.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] mm: readahead: make thp readahead conditional to mmap_miss logic

Posted by Roman Gushchin 4 months, 1 week ago

On Wed, Oct 01, 2025 at 01:35:39PM +0200, Jan Kara wrote:
> On Tue 30-09-25 07:48:15, Roman Gushchin wrote:
> > Commit 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings")
> > introduced a special handling for VM_HUGEPAGE mappings: even if the
> > readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are
> > allocated.
> > 
> > This change causes a significant regression for containers with a
> > tight memory.max limit, if VM_HUGEPAGE is widely used. Prior to this
> > commit, mmap_miss logic would eventually lead to the readahead
> > disablement, effectively reducing the memory pressure in the
> > cgroup. With this change the kernel is trying to allocate 1-2 huge
> > pages for each fault, no matter if these pages are used or not
> > before being evicted, increasing the memory pressure multi-fold.
> > 
> > To fix the regression, let's make the new VM_HUGEPAGE conditional
> > to the mmap_miss check, but keep independent from the ra->ra_pages.
> > This way the main intention of commit 4687fdbb805a ("mm/filemap:
> > Support VM_HUGEPAGE for file mappings") stays intact, but the
> > regression is resolved.
> > 
> > The logic behind this changes is simple: even if a user explicitly
> > requests using huge pages to back the file mapping (using VM_HUGEPAGE
> > flag), under a very strong memory pressure it's better to fall back
> > to ordinary pages.
> > 
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: linux-mm@kvack.org
> 
> It would be good to get confirmation from Matthew that indeed this
> preserves what he had in mind with commit 4687fdbb805a92 but the change
> looks good to me.

Hi Jan!

Matthew and myself had a chat about this issue last week at Kernel Recipes
conference and in general agreed on this approach. But of course,
an explicit Ack from him will be appreciated.

Long-term it would be great to use a better metric for memory pressure
here, e.g. PSI. But it's far from trivial.

> Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thank you!