mm/readahead: no PG_readahead on EOF

[PATCH] mm/readahead: no PG_readahead on EOF

Posted by Frederick Mayle 1 month ago

When readahead pulls in all the remaining pages for a file, setting the
readahead bit is counter productive. The async readahead it would
trigger would almost certainly be a no-op. Additionally, for mmap'd file
IO, the readahead bit limits the fault around [0], causing an extra
minor fault when the page is accessed.

This was discovered when looking at /sys/kernel/tracing/events/readahead
traces for a simple program. With the patch applied, fewer
page_cache_ra_unbounded calls are observed.

[1] do_fault_around calls filemap_map_pages, which finds eligible pages
    by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
    with PG_readahead set [3].

Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Frederick Mayle <fmayle@google.com>
---
 mm/readahead.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8c12b63ccd4a..784a3bde82be 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -329,8 +329,11 @@ static void do_page_cache_ra(struct readahead_control *ractl,
 	if (index > end_index)
 		return;
 	/* Don't read past the page containing the last byte of the file */
-	if (nr_to_read > end_index - index)
+	if (nr_to_read > end_index - index) {
 		nr_to_read = end_index - index + 1;
+		/* We've reached the end, so don't set a readahead marker. */
+		lookahead_size = 0;
+	}
 
 	filemap_invalidate_lock_shared(mapping);
 	page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
@@ -474,7 +477,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	pgoff_t index = start;
 	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit;
-	pgoff_t mark = index + ra->size - ra->async_size;
+	pgoff_t mark;
 	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
@@ -488,7 +491,13 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	limit = min(limit, ractl->_max_index);
-	limit = min(limit, index + ra->size - 1);
+	if (limit > index + ra->size - 1) {
+		limit = index + ra->size - 1;
+		mark = index + ra->size - ra->async_size;
+	} else {
+		/* We've reached the end, so don't set a readahead marker. */
+		mark = ULONG_MAX;
+	}
 
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));

base-commit: 2d565cbaafd43b10b75da56e43e5db9852a56afd
-- 
2.54.0.563.g4f69b47b94-goog

Re: [PATCH] mm/readahead: no PG_readahead on EOF

Posted by Jan Kara 1 month ago

On Fri 08-05-26 11:12:31, Frederick Mayle wrote:
> When readahead pulls in all the remaining pages for a file, setting the
> readahead bit is counter productive. The async readahead it would
> trigger would almost certainly be a no-op. Additionally, for mmap'd file
> IO, the readahead bit limits the fault around [0], causing an extra
> minor fault when the page is accessed.
> 
> This was discovered when looking at /sys/kernel/tracing/events/readahead
> traces for a simple program. With the patch applied, fewer
> page_cache_ra_unbounded calls are observed.
> 
> [1] do_fault_around calls filemap_map_pages, which finds eligible pages
>     by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
>     with PG_readahead set [3].
> 
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Frederick Mayle <fmayle@google.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

The issue Sashiko pointed out is real but preexisting issue which isn't
really related to your changes and your patch doesn't make it worse so I'd
just ignore it.

								Honza

> ---
>  mm/readahead.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8c12b63ccd4a..784a3bde82be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -329,8 +329,11 @@ static void do_page_cache_ra(struct readahead_control *ractl,
>  	if (index > end_index)
>  		return;
>  	/* Don't read past the page containing the last byte of the file */
> -	if (nr_to_read > end_index - index)
> +	if (nr_to_read > end_index - index) {
>  		nr_to_read = end_index - index + 1;
> +		/* We've reached the end, so don't set a readahead marker. */
> +		lookahead_size = 0;
> +	}
>  
>  	filemap_invalidate_lock_shared(mapping);
>  	page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
> @@ -474,7 +477,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	pgoff_t index = start;
>  	unsigned int min_order = mapping_min_folio_order(mapping);
>  	pgoff_t limit;
> -	pgoff_t mark = index + ra->size - ra->async_size;
> +	pgoff_t mark;
>  	unsigned int nofs;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
> @@ -488,7 +491,13 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
>  	limit = min(limit, ractl->_max_index);
> -	limit = min(limit, index + ra->size - 1);
> +	if (limit > index + ra->size - 1) {
> +		limit = index + ra->size - 1;
> +		mark = index + ra->size - ra->async_size;
> +	} else {
> +		/* We've reached the end, so don't set a readahead marker. */
> +		mark = ULONG_MAX;
> +	}
>  
>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> 
> base-commit: 2d565cbaafd43b10b75da56e43e5db9852a56afd
> -- 
> 2.54.0.563.g4f69b47b94-goog
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH] mm/readahead: no PG_readahead on EOF

Posted by Andrew Morton 1 month ago

On Fri,  8 May 2026 11:12:31 -0700 Frederick Mayle <fmayle@google.com> wrote:

> When readahead pulls in all the remaining pages for a file, setting the
> readahead bit is counter productive. The async readahead it would
> trigger would almost certainly be a no-op. Additionally, for mmap'd file
> IO, the readahead bit limits the fault around [0],

There is no "[0]".

> causing an extra
> minor fault when the page is accessed.
> 
> This was discovered when looking at /sys/kernel/tracing/events/readahead
> traces for a simple program. With the patch applied, fewer
> page_cache_ra_unbounded calls are observed.
> 
> [1] do_fault_around calls filemap_map_pages, which finds eligible pages
>     by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
>     with PG_readahead set [3].
> 
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Frederick Mayle <fmayle@google.com>

AI review found a little race:
	https://sashiko.dev/#/patchset/20260508181237.670645-1-fmayle@google.com

Re: [PATCH] mm/readahead: no PG_readahead on EOF

Posted by Frederick Mayle 1 month ago

On Fri, May 8, 2026 at 4:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  8 May 2026 11:12:31 -0700 Frederick Mayle <fmayle@google.com> wrote:
>
> > When readahead pulls in all the remaining pages for a file, setting the
> > readahead bit is counter productive. The async readahead it would
> > trigger would almost certainly be a no-op. Additionally, for mmap'd file
> > IO, the readahead bit limits the fault around [0],
>
> There is no "[0]".

Sorry, that should be "[1]".

>
> > causing an extra
> > minor fault when the page is accessed.
> >
> > This was discovered when looking at /sys/kernel/tracing/events/readahead
> > traces for a simple program. With the patch applied, fewer
> > page_cache_ra_unbounded calls are observed.
> >
> > [1] do_fault_around calls filemap_map_pages, which finds eligible pages
> >     by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
> >     with PG_readahead set [3].
> >
> > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
> > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Frederick Mayle <fmayle@google.com>
>
> AI review found a little race:
>         https://sashiko.dev/#/patchset/20260508181237.670645-1-fmayle@google.com

Tricky. My understanding is that previous code technically has the same issue
sashiko is pointing out. An expression like `min(limit, index + ra->size - 1)`
only mentions the field once but the compiler or hardware can still decide to
perform multiple loads from the field's address, resulting in incoherent
results.

I see from past discussions that the lack of locking is intentional for
file_ra_state fields and also there has been pushback against requiring
READ_ONCE and WRITE_ONCE for accesses, for example in the following recent
thread:

    https://lore.kernel.org/r/2xzc3lp6ehtjwbzip4i5muh4g6oep4l72zh3j6sablfghbvbau@kh7famgorzrh/

There is another one of these just below (copied here) that can result in an
underflow if ra->size gets smaller after the condition is checked. Looks to be
benign right now because do_page_cache_ra puts a limit on the param.

        if (ra->size > index - start)
                do_page_cache_ra(ractl, ra->size - (index - start),
                                 ra->async_size);

I think a fully correct solution requires READ_ONCE, but, in the meantime, I
can make it less wrong by reading ra->size into a variable and reusing that
variable through the function (+ re-reading before the fallback branch).