mm/readahead.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
When readahead pulls in all the remaining pages for a file, setting the
readahead bit is counter productive. The async readahead it would
trigger would almost certainly be a no-op. Additionally, for mmap'd file
IO, the readahead bit limits the fault around [0], causing an extra
minor fault when the page is accessed.
This was discovered when looking at /sys/kernel/tracing/events/readahead
traces for a simple program. With the patch applied, fewer
page_cache_ra_unbounded calls are observed.
[1] do_fault_around calls filemap_map_pages, which finds eligible pages
by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
with PG_readahead set [3].
Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Frederick Mayle <fmayle@google.com>
---
mm/readahead.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 8c12b63ccd4a..784a3bde82be 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -329,8 +329,11 @@ static void do_page_cache_ra(struct readahead_control *ractl,
if (index > end_index)
return;
/* Don't read past the page containing the last byte of the file */
- if (nr_to_read > end_index - index)
+ if (nr_to_read > end_index - index) {
nr_to_read = end_index - index + 1;
+ /* We've reached the end, so don't set a readahead marker. */
+ lookahead_size = 0;
+ }
filemap_invalidate_lock_shared(mapping);
page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
@@ -474,7 +477,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
pgoff_t index = start;
unsigned int min_order = mapping_min_folio_order(mapping);
pgoff_t limit;
- pgoff_t mark = index + ra->size - ra->async_size;
+ pgoff_t mark;
unsigned int nofs;
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);
@@ -488,7 +491,13 @@ void page_cache_ra_order(struct readahead_control *ractl,
limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
limit = min(limit, ractl->_max_index);
- limit = min(limit, index + ra->size - 1);
+ if (limit > index + ra->size - 1) {
+ limit = index + ra->size - 1;
+ mark = index + ra->size - ra->async_size;
+ } else {
+ /* We've reached the end, so don't set a readahead marker. */
+ mark = ULONG_MAX;
+ }
new_order = min(mapping_max_folio_order(mapping), new_order);
new_order = min_t(unsigned int, new_order, ilog2(ra->size));
base-commit: 2d565cbaafd43b10b75da56e43e5db9852a56afd
--
2.54.0.563.g4f69b47b94-goog
On Fri 08-05-26 11:12:31, Frederick Mayle wrote:
> When readahead pulls in all the remaining pages for a file, setting the
> readahead bit is counter productive. The async readahead it would
> trigger would almost certainly be a no-op. Additionally, for mmap'd file
> IO, the readahead bit limits the fault around [0], causing an extra
> minor fault when the page is accessed.
>
> This was discovered when looking at /sys/kernel/tracing/events/readahead
> traces for a simple program. With the patch applied, fewer
> page_cache_ra_unbounded calls are observed.
>
> [1] do_fault_around calls filemap_map_pages, which finds eligible pages
> by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
> with PG_readahead set [3].
>
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
> Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Frederick Mayle <fmayle@google.com>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
The issue Sashiko pointed out is real but preexisting issue which isn't
really related to your changes and your patch doesn't make it worse so I'd
just ignore it.
Honza
> ---
> mm/readahead.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8c12b63ccd4a..784a3bde82be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -329,8 +329,11 @@ static void do_page_cache_ra(struct readahead_control *ractl,
> if (index > end_index)
> return;
> /* Don't read past the page containing the last byte of the file */
> - if (nr_to_read > end_index - index)
> + if (nr_to_read > end_index - index) {
> nr_to_read = end_index - index + 1;
> + /* We've reached the end, so don't set a readahead marker. */
> + lookahead_size = 0;
> + }
>
> filemap_invalidate_lock_shared(mapping);
> page_cache_ra_unbounded(ractl, nr_to_read, lookahead_size);
> @@ -474,7 +477,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> pgoff_t index = start;
> unsigned int min_order = mapping_min_folio_order(mapping);
> pgoff_t limit;
> - pgoff_t mark = index + ra->size - ra->async_size;
> + pgoff_t mark;
> unsigned int nofs;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> @@ -488,7 +491,13 @@ void page_cache_ra_order(struct readahead_control *ractl,
>
> limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> limit = min(limit, ractl->_max_index);
> - limit = min(limit, index + ra->size - 1);
> + if (limit > index + ra->size - 1) {
> + limit = index + ra->size - 1;
> + mark = index + ra->size - ra->async_size;
> + } else {
> + /* We've reached the end, so don't set a readahead marker. */
> + mark = ULONG_MAX;
> + }
>
> new_order = min(mapping_max_folio_order(mapping), new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>
> base-commit: 2d565cbaafd43b10b75da56e43e5db9852a56afd
> --
> 2.54.0.563.g4f69b47b94-goog
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Fri, 8 May 2026 11:12:31 -0700 Frederick Mayle <fmayle@google.com> wrote: > When readahead pulls in all the remaining pages for a file, setting the > readahead bit is counter productive. The async readahead it would > trigger would almost certainly be a no-op. Additionally, for mmap'd file > IO, the readahead bit limits the fault around [0], There is no "[0]". > causing an extra > minor fault when the page is accessed. > > This was discovered when looking at /sys/kernel/tracing/events/readahead > traces for a simple program. With the patch applied, fewer > page_cache_ra_unbounded calls are observed. > > [1] do_fault_around calls filemap_map_pages, which finds eligible pages > by calling next_uptodate_folio [2]. next_uptodate_folio skips pages > with PG_readahead set [3]. > > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2] > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3] > Cc: Kalesh Singh <kaleshsingh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Signed-off-by: Frederick Mayle <fmayle@google.com> AI review found a little race: https://sashiko.dev/#/patchset/20260508181237.670645-1-fmayle@google.com
On Fri, May 8, 2026 at 4:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 8 May 2026 11:12:31 -0700 Frederick Mayle <fmayle@google.com> wrote:
>
> > When readahead pulls in all the remaining pages for a file, setting the
> > readahead bit is counter productive. The async readahead it would
> > trigger would almost certainly be a no-op. Additionally, for mmap'd file
> > IO, the readahead bit limits the fault around [0],
>
> There is no "[0]".
Sorry, that should be "[1]".
>
> > causing an extra
> > minor fault when the page is accessed.
> >
> > This was discovered when looking at /sys/kernel/tracing/events/readahead
> > traces for a simple program. With the patch applied, fewer
> > page_cache_ra_unbounded calls are observed.
> >
> > [1] do_fault_around calls filemap_map_pages, which finds eligible pages
> > by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
> > with PG_readahead set [3].
> >
> > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2]
> > Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3]
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Signed-off-by: Frederick Mayle <fmayle@google.com>
>
> AI review found a little race:
> https://sashiko.dev/#/patchset/20260508181237.670645-1-fmayle@google.com
Tricky. My understanding is that previous code technically has the same issue
sashiko is pointing out. An expression like `min(limit, index + ra->size - 1)`
only mentions the field once but the compiler or hardware can still decide to
perform multiple loads from the field's address, resulting in incoherent
results.
I see from past discussions that the lack of locking is intentional for
file_ra_state fields and also there has been pushback against requiring
READ_ONCE and WRITE_ONCE for accesses, for example in the following recent
thread:
https://lore.kernel.org/r/2xzc3lp6ehtjwbzip4i5muh4g6oep4l72zh3j6sablfghbvbau@kh7famgorzrh/
There is another one of these just below (copied here) that can result in an
underflow if ra->size gets smaller after the condition is checked. Looks to be
benign right now because do_page_cache_ra puts a limit on the param.
if (ra->size > index - start)
do_page_cache_ra(ractl, ra->size - (index - start),
ra->async_size);
I think a fully correct solution requires READ_ONCE, but, in the meantime, I
can make it less wrong by reading ra->size into a variable and reusing that
variable through the function (+ re-reading before the fallback branch).
© 2016 - 2026 Red Hat, Inc.