[RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()

Ryan Roberts posted 5 patches 7 months, 3 weeks ago
There is a newer version of this series
[RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Ryan Roberts 7 months, 3 weeks ago
page_cache_ra_order() takes a parameter called new_order, which is
intended to express the preferred order of the folios that will be
allocated for the readahead operation. Most callers indeed call this
with their preferred new order. But page_cache_async_ra() calls it with
the preferred order of the previous readahead request (actually the
order of the folio that had the readahead marker, which may be smaller
when alignment comes into play).

And despite the parameter name, page_cache_ra_order() always treats it
at the old order, adding 2 to it on entry. As a result, a cold readahead
always starts with order-2 folios.

Let's fix this behaviour by always passing in the *new* order.

Worked example:

Prior to the change, mmaping an 8MB file and touching each page
sequentially, resulted in the following, where we start with order-2
folios for the first 128K then ramp up to order-4 for the next 128K,
then get clamped to order-5 for the rest of the file because pa_pages is
limited to 128K:

TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
-----  ----------  ----------  ---------  -------  -------  -----  -----
FOLIO  0x00000000  0x00004000      16384        0        4      4      2
FOLIO  0x00004000  0x00008000      16384        4        8      4      2
FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
FOLIO  0x00010000  0x00014000      16384       16       20      4      2
FOLIO  0x00014000  0x00018000      16384       20       24      4      2
FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
FOLIO  0x00020000  0x00030000      65536       32       48     16      4
FOLIO  0x00030000  0x00040000      65536       48       64     16      4
FOLIO  0x00040000  0x00060000     131072       64       96     32      5
FOLIO  0x00060000  0x00080000     131072       96      128     32      5
FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
...

After the change, the same operation results in the first 128K being
order-0, then we start ramping up to order-2, -4, and finally get
clamped at order-5:

TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
-----  ----------  ----------  ---------  -------  -------  -----  -----
FOLIO  0x00000000  0x00001000       4096        0        1      1      0
FOLIO  0x00001000  0x00002000       4096        1        2      1      0
FOLIO  0x00002000  0x00003000       4096        2        3      1      0
FOLIO  0x00003000  0x00004000       4096        3        4      1      0
FOLIO  0x00004000  0x00005000       4096        4        5      1      0
FOLIO  0x00005000  0x00006000       4096        5        6      1      0
FOLIO  0x00006000  0x00007000       4096        6        7      1      0
FOLIO  0x00007000  0x00008000       4096        7        8      1      0
FOLIO  0x00008000  0x00009000       4096        8        9      1      0
FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
FOLIO  0x00010000  0x00011000       4096       16       17      1      0
FOLIO  0x00011000  0x00012000       4096       17       18      1      0
FOLIO  0x00012000  0x00013000       4096       18       19      1      0
FOLIO  0x00013000  0x00014000       4096       19       20      1      0
FOLIO  0x00014000  0x00015000       4096       20       21      1      0
FOLIO  0x00015000  0x00016000       4096       21       22      1      0
FOLIO  0x00016000  0x00017000       4096       22       23      1      0
FOLIO  0x00017000  0x00018000       4096       23       24      1      0
FOLIO  0x00018000  0x00019000       4096       24       25      1      0
FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
FOLIO  0x00020000  0x00024000      16384       32       36      4      2
FOLIO  0x00024000  0x00028000      16384       36       40      4      2
FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
FOLIO  0x00030000  0x00034000      16384       48       52      4      2
FOLIO  0x00034000  0x00038000      16384       52       56      4      2
FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
FOLIO  0x00040000  0x00050000      65536       64       80     16      4
FOLIO  0x00050000  0x00060000      65536       80       96     16      4
FOLIO  0x00060000  0x00080000     131072       96      128     32      5
FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
...

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/readahead.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 6a4e96b69702..8bb316f5a842 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 
 	limit = min(limit, index + ra->size - 1);
 
-	if (new_order < mapping_max_folio_order(mapping))
-		new_order += 2;
-
 	new_order = min(mapping_max_folio_order(mapping), new_order);
 	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	new_order = max(new_order, min_order);
@@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	ra->size = get_next_ra_size(ra, max_pages);
 	ra->async_size = ra->size;
 readit:
+	order += 2;
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, order);
 }
-- 
2.43.0
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Chaitanya S Prakash 7 months, 1 week ago
On 30/04/25 20:29, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
>
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
>
> Let's fix this behaviour by always passing in the *new* order.
>
> Worked example:
>
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
>
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> ...
>
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
>
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
> ...
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

This looks good to me.

Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>

> ---
>   mm/readahead.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>   
>   	limit = min(limit, index + ra->size - 1);
>   
> -	if (new_order < mapping_max_folio_order(mapping))
> -		new_order += 2;
> -
>   	new_order = min(mapping_max_folio_order(mapping), new_order);
>   	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>   	new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>   	ra->size = get_next_ra_size(ra, max_pages);
>   	ra->async_size = ra->size;
>   readit:
> +	order += 2;
>   	ractl->_index = ra->start;
>   	page_cache_ra_order(ractl, ra, order);
>   }
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Pankaj Raghav (Samsung) 7 months, 2 weeks ago
Hey Ryan,

On Wed, Apr 30, 2025 at 03:59:14PM +0100, Ryan Roberts wrote:
> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> ...
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/readahead.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  

So we always had a fallback to do_page_cache_ra() if the size of the
readahead is less than 4 pages (16k). I think this was there because we
were adding `2` to the new_order:

unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));

/*
 * Fallback when size < min_nrpages as each folio should be
 * at least min_nrpages anyway.
 */
if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
	goto fallback;

>  	limit = min(limit, index + ra->size - 1);
>  
> -	if (new_order < mapping_max_folio_order(mapping))
> -		new_order += 2;

Now that you have moved this, we could make the lhs of the max to be 2
(8k) instead of 4(16k).

- unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
+ unsigned int min_ra_size = max(2, mapping_min_folio_nrpages(mapping));

I think if we do that, we might ramp up to 8k sooner rather than jumping
from 4k to 16k directly?

> -
>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	ra->size = get_next_ra_size(ra, max_pages);
>  	ra->async_size = ra->size;
>  readit:
> +	order += 2;
>  	ractl->_index = ra->start;
>  	page_cache_ra_order(ractl, ra, order);
>  }
> -- 
> 2.43.0
> 

--
Pankaj
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Ryan Roberts 7 months, 2 weeks ago
Hi Pankaj,

Thanks for the review! ...


On 08/05/2025 13:55, Pankaj Raghav (Samsung) wrote:
> Hey Ryan,
> 
> On Wed, Apr 30, 2025 at 03:59:14PM +0100, Ryan Roberts wrote:
>> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
>> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
>> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
>> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
>> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
>> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
>> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
>> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
>> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
>> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
>> ...
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/readahead.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 6a4e96b69702..8bb316f5a842 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>>  
> 
> So we always had a fallback to do_page_cache_ra() if the size of the
> readahead is less than 4 pages (16k). I think this was there because we
> were adding `2` to the new_order:

If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
additionally, it's not just ra->size that dictates the final order of the folio;
it also depends on alignment in the file, EOF, etc.

If we remove the fallback condition completely, things will still work out. So
unless someone can explain the reason for that condition (Matthew?), my vote
would be to remove it entirely.

> 
> unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> 
> /*
>  * Fallback when size < min_nrpages as each folio should be
>  * at least min_nrpages anyway.
>  */
> if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> 	goto fallback;
> 
>>  	limit = min(limit, index + ra->size - 1);
>>  
>> -	if (new_order < mapping_max_folio_order(mapping))
>> -		new_order += 2;
> 
> Now that you have moved this, we could make the lhs of the max to be 2
> (8k) instead of 4(16k).

I don't really understand why magic number 2 would now be correct?

> 
> - unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> + unsigned int min_ra_size = max(2, mapping_min_folio_nrpages(mapping));
> 
> I think if we do that, we might ramp up to 8k sooner rather than jumping
> from 4k to 16k directly?

In practice I don't think so; This would only give us order-1 where we didn't
have it before if new_order >= 1 and ra->size is 3 or 4 pages.

But as I said, my vote would be to remove this fallback condition entirely. What
do you think?

Thanks,
Ryan

> 
>> -
>>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>>  	new_order = max(new_order, min_order);
>> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>  	ra->size = get_next_ra_size(ra, max_pages);
>>  	ra->async_size = ra->size;
>>  readit:
>> +	order += 2;
>>  	ractl->_index = ra->start;
>>  	page_cache_ra_order(ractl, ra, order);
>>  }
>> -- 
>> 2.43.0
>>
> 
> --
> Pankaj
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Pankaj Raghav (Samsung) 7 months, 1 week ago
> >>  
> > 
> > So we always had a fallback to do_page_cache_ra() if the size of the
> > readahead is less than 4 pages (16k). I think this was there because we
> > were adding `2` to the new_order:
> 
> If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
> pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
> additionally, it's not just ra->size that dictates the final order of the folio;
> it also depends on alignment in the file, EOF, etc.
> 

IIRC, initially we were not able to use order-1 folios[1], so we always
did a fallback for any order < 2 using do_page_cache_ra(). I think that
is where the magic order 2 (4 pages) is coming. Please someone can
correct me if I am wrong.

But we don't have that limitation for file-backed folios anymore, so the
fallback for ra->size < 4 is probably not needed. So the only time we do
a fallback is if we don't support large folios.

> If we remove the fallback condition completely, things will still work out. So
> unless someone can explain the reason for that condition (Matthew?), my vote
> would be to remove it entirely.

I am actually fine with removing the first part of this fallback condition.
But as I said, we still need to do a fallback if we don't support large folios.

--
Pankaj

[1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Ryan Roberts 7 months, 1 week ago
On 09/05/2025 21:50, Pankaj Raghav (Samsung) wrote:
>>>>  
>>>
>>> So we always had a fallback to do_page_cache_ra() if the size of the
>>> readahead is less than 4 pages (16k). I think this was there because we
>>> were adding `2` to the new_order:
>>
>> If this is the reason for the magic number 4, then it's a bug in itself IMHO. 4
>> pages is only 16K when the page size is 4K; arm64 supports other page sizes. But
>> additionally, it's not just ra->size that dictates the final order of the folio;
>> it also depends on alignment in the file, EOF, etc.
>>
> 
> IIRC, initially we were not able to use order-1 folios[1], so we always
> did a fallback for any order < 2 using do_page_cache_ra(). I think that
> is where the magic order 2 (4 pages) is coming. Please someone can
> correct me if I am wrong.

Ahh, I see. That might have been where it came from, but IMHO, it still didn't
really belong there; just because the size is bigger than 4 pages, it doesn't
mean you would never want to use order-1 folios - there are alignment
considerations that can cause that. The logic in page_cache_ra_order() used to
know to avoid order-1.

> 
> But we don't have that limitation for file-backed folios anymore, so the
> fallback for ra->size < 4 is probably not needed. So the only time we do
> a fallback is if we don't support large folios.
> 
>> If we remove the fallback condition completely, things will still work out. So
>> unless someone can explain the reason for that condition (Matthew?), my vote
>> would be to remove it entirely.
> 
> I am actually fine with removing the first part of this fallback condition.
> But as I said, we still need to do a fallback if we don't support large folios.

Yep agreed. I'll make this change in the next version.

> 
> --
> Pankaj
> 
> [1] https://lore.kernel.org/all/ZH0GvxAdw1RO2Shr@casper.infradead.org/
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Anshuman Khandual 7 months, 2 weeks ago

On 4/30/25 20:29, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
> 
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
> 
> Let's fix this behaviour by always passing in the *new* order.

Makes sense.

> 
> Worked example:
> 
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> ...
> 
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5

I guess performance wise this will be worse than earlier ? Although it
does fix the semantics for page_cache_ra_order() with respect to the
parameter 'new_order'.

> ...
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/readahead.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	limit = min(limit, index + ra->size - 1);
>  
> -	if (new_order < mapping_max_folio_order(mapping))
> -		new_order += 2;
> -
>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	ra->size = get_next_ra_size(ra, max_pages);
>  	ra->async_size = ra->size;
>  readit:

Should not the earlier conditional check also be brought here before
incrementing the order ? Just curious.

if (new_order < mapping_max_folio_order(mapping))

> +	order += 2;
>  	ractl->_index = ra->start;
>  	page_cache_ra_order(ractl, ra, order);
>  }
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Ryan Roberts 7 months, 2 weeks ago
On 05/05/2025 11:09, Anshuman Khandual wrote:
> 
> 
> On 4/30/25 20:29, Ryan Roberts wrote:
>> page_cache_ra_order() takes a parameter called new_order, which is
>> intended to express the preferred order of the folios that will be
>> allocated for the readahead operation. Most callers indeed call this
>> with their preferred new order. But page_cache_async_ra() calls it with
>> the preferred order of the previous readahead request (actually the
>> order of the folio that had the readahead marker, which may be smaller
>> when alignment comes into play).
>>
>> And despite the parameter name, page_cache_ra_order() always treats it
>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>> always starts with order-2 folios.
>>
>> Let's fix this behaviour by always passing in the *new* order.
> 
> Makes sense.
> 
>>
>> Worked example:
>>
>> Prior to the change, mmaping an 8MB file and touching each page
>> sequentially, resulted in the following, where we start with order-2
>> folios for the first 128K then ramp up to order-4 for the next 128K,
>> then get clamped to order-5 for the rest of the file because pa_pages is
>> limited to 128K:
>>
>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
>> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
>> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
>> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
>> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
>> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
>> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
>> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
>> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
>> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
>> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>> ...
>>
>> After the change, the same operation results in the first 128K being
>> order-0, then we start ramping up to order-2, -4, and finally get
>> clamped at order-5:
>>
>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
>> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
>> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
>> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
>> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
>> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
>> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
>> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
>> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
>> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
>> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
>> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
>> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
>> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
>> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
>> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
>> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
>> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
>> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
>> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
>> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
>> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
>> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
>> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
>> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
>> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
>> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
>> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
>> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
>> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
>> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
>> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
>> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
>> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
>> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
>> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
>> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
>> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
>> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
>> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
>> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
>> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
> 
> I guess performance wise this will be worse than earlier ? 

Maybe, maybe not. If higer order always gave better performance then we would
always surely use the highest order? Order-0 is a bit easier to allocate than
order-2. So if the file actually isn't being accessed sequentially, allocating
order-0 for the cold cache case might actually be better over all?

> Although it
> does fix the semantics for page_cache_ra_order() with respect to the
> parameter 'new_order'.

Yes that's the piece I was keen to sort out; Once you get to patch 5 it's
important that new_order really does mean new_order otherwise we would end up
allocating higher order than the arch intedended.

If we think we really *should* be starting at order-2 instead of order-0, we
should pass 2 as new_order instead of 0.

> 
>> ...
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/readahead.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 6a4e96b69702..8bb316f5a842 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>>  
>>  	limit = min(limit, index + ra->size - 1);
>>  
>> -	if (new_order < mapping_max_folio_order(mapping))
>> -		new_order += 2;
>> -
>>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>>  	new_order = max(new_order, min_order);
>> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>  	ra->size = get_next_ra_size(ra, max_pages);
>>  	ra->async_size = ra->size;
>>  readit:
> 
> Should not the earlier conditional check also be brought here before
> incrementing the order ? Just curious.
> 
> if (new_order < mapping_max_folio_order(mapping))

No that's not needed. page_cache_ra_order() will clamp new_order appropriately.
The conditional that I removed was unneeded becaude the following lines are
clamping the new value explicitly anyway.

Thanks,
Ryan

> 
>> +	order += 2;
>>  	ractl->_index = ra->start;
>>  	page_cache_ra_order(ractl, ra, order);
>>  }
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by David Hildenbrand 7 months, 2 weeks ago
On 30.04.25 16:59, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
> 
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
> 
> Let's fix this behaviour by always passing in the *new* order.
> 
> Worked example:
> 
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5

Interesting, I would have thought we'd ramp up earlier.

> ...
> 
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5

Similar here, do you know why we don't ramp up earlier. Allocating that 
many order-0 + order-2 pages looks a bit suboptimal to me for a 
sequential read.

I wonder if you're change will have a measurable downside on sequential 
read. Anyhow, I think it was already not behaving how I would have 
expected it ... :)

Acked-by: David Hildenbrand <david@redhat.com>

> ...
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   mm/readahead.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>   
>   	limit = min(limit, index + ra->size - 1);
>   
> -	if (new_order < mapping_max_folio_order(mapping))
> -		new_order += 2;
> -
>   	new_order = min(mapping_max_folio_order(mapping), new_order);
>   	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>   	new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>   	ra->size = get_next_ra_size(ra, max_pages);
>   	ra->async_size = ra->size;
>   readit:
> +	order += 2;
>   	ractl->_index = ra->start;
>   	page_cache_ra_order(ractl, ra, order);
>   }


-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Jan Kara 7 months, 2 weeks ago
On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
> On 30.04.25 16:59, Ryan Roberts wrote:
> > page_cache_ra_order() takes a parameter called new_order, which is
> > intended to express the preferred order of the folios that will be
> > allocated for the readahead operation. Most callers indeed call this
> > with their preferred new order. But page_cache_async_ra() calls it with
> > the preferred order of the previous readahead request (actually the
> > order of the folio that had the readahead marker, which may be smaller
> > when alignment comes into play).
> > 
> > And despite the parameter name, page_cache_ra_order() always treats it
> > at the old order, adding 2 to it on entry. As a result, a cold readahead
> > always starts with order-2 folios.
> > 
> > Let's fix this behaviour by always passing in the *new* order.
> > 
> > Worked example:
> > 
> > Prior to the change, mmaping an 8MB file and touching each page
> > sequentially, resulted in the following, where we start with order-2
> > folios for the first 128K then ramp up to order-4 for the next 128K,
> > then get clamped to order-5 for the rest of the file because pa_pages is
> > limited to 128K:
> > 
> > TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> > -----  ----------  ----------  ---------  -------  -------  -----  -----
> > FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> > FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> > FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> > FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> > FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> > FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> > FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> > FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> > FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> > FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> > FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> > FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> > FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> > FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> 
> Interesting, I would have thought we'd ramp up earlier.
> 
> > ...
> > 
> > After the change, the same operation results in the first 128K being
> > order-0, then we start ramping up to order-2, -4, and finally get
> > clamped at order-5:
> > 
> > TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> > -----  ----------  ----------  ---------  -------  -------  -----  -----
> > FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> > FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> > FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> > FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> > FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> > FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> > FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> > FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> > FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> > FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> > FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> > FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> > FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> > FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> > FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> > FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> > FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> > FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> > FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> > FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> > FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> > FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> > FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> > FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> > FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> > FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> > FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> > FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> > FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> > FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> > FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> > FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> > FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> > FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> > FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> > FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> > FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> > FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> > FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> > FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> > FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> > FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> > FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> > FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> > FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> > FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
> 
> Similar here, do you know why we don't ramp up earlier. Allocating that many
> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.

Note that this is reading through mmap using the mmap readahead code. If
you use standard read(2), the readahead window starts small as well and
ramps us along with the desired order so we don't allocate that many small
order pages in that case.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by David Hildenbrand 7 months, 2 weeks ago
On 05.05.25 12:09, Jan Kara wrote:
> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
>> On 30.04.25 16:59, Ryan Roberts wrote:
>>> page_cache_ra_order() takes a parameter called new_order, which is
>>> intended to express the preferred order of the folios that will be
>>> allocated for the readahead operation. Most callers indeed call this
>>> with their preferred new order. But page_cache_async_ra() calls it with
>>> the preferred order of the previous readahead request (actually the
>>> order of the folio that had the readahead marker, which may be smaller
>>> when alignment comes into play).
>>>
>>> And despite the parameter name, page_cache_ra_order() always treats it
>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>>> always starts with order-2 folios.
>>>
>>> Let's fix this behaviour by always passing in the *new* order.
>>>
>>> Worked example:
>>>
>>> Prior to the change, mmaping an 8MB file and touching each page
>>> sequentially, resulted in the following, where we start with order-2
>>> folios for the first 128K then ramp up to order-4 for the next 128K,
>>> then get clamped to order-5 for the rest of the file because pa_pages is
>>> limited to 128K:
>>>
>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>>> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
>>> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
>>> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
>>> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
>>> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
>>> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
>>> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
>>> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
>>> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
>>> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
>>> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>>
>> Interesting, I would have thought we'd ramp up earlier.
>>
>>> ...
>>>
>>> After the change, the same operation results in the first 128K being
>>> order-0, then we start ramping up to order-2, -4, and finally get
>>> clamped at order-5:
>>>
>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>>> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
>>> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
>>> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
>>> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
>>> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
>>> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
>>> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
>>> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
>>> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
>>> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
>>> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
>>> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
>>> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
>>> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
>>> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
>>> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
>>> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
>>> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
>>> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
>>> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
>>> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
>>> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
>>> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
>>> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
>>> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
>>> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
>>> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
>>> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
>>> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
>>> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
>>> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
>>> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
>>> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
>>> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
>>> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
>>> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
>>> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
>>> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
>>> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
>>> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
>>> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
>>> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>>> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
>>
>> Similar here, do you know why we don't ramp up earlier. Allocating that many
>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
> 
> Note that this is reading through mmap using the mmap readahead code. If
> you use standard read(2), the readahead window starts small as well and
> ramps us along with the desired order so we don't allocate that many small
> order pages in that case.

Ah, thanks for that information! :)

-- 
Cheers,

David / dhildenb
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Ryan Roberts 7 months, 2 weeks ago
On 05/05/2025 11:25, David Hildenbrand wrote:
> On 05.05.25 12:09, Jan Kara wrote:
>> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
>>> On 30.04.25 16:59, Ryan Roberts wrote:
>>>> page_cache_ra_order() takes a parameter called new_order, which is
>>>> intended to express the preferred order of the folios that will be
>>>> allocated for the readahead operation. Most callers indeed call this
>>>> with their preferred new order. But page_cache_async_ra() calls it with
>>>> the preferred order of the previous readahead request (actually the
>>>> order of the folio that had the readahead marker, which may be smaller
>>>> when alignment comes into play).
>>>>
>>>> And despite the parameter name, page_cache_ra_order() always treats it
>>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
>>>> always starts with order-2 folios.
>>>>
>>>> Let's fix this behaviour by always passing in the *new* order.
>>>>
>>>> Worked example:
>>>>
>>>> Prior to the change, mmaping an 8MB file and touching each page
>>>> sequentially, resulted in the following, where we start with order-2
>>>> folios for the first 128K then ramp up to order-4 for the next 128K,
>>>> then get clamped to order-5 for the rest of the file because pa_pages is
>>>> limited to 128K:
>>>>
>>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>>>> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
>>>> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
>>>> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
>>>> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
>>>> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
>>>> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
>>>> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
>>>> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
>>>> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
>>>> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
>>>> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
>>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>>>
>>> Interesting, I would have thought we'd ramp up earlier.
>>>
>>>> ...
>>>>
>>>> After the change, the same operation results in the first 128K being
>>>> order-0, then we start ramping up to order-2, -4, and finally get
>>>> clamped at order-5:
>>>>
>>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
>>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
>>>> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
>>>> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
>>>> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
>>>> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
>>>> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
>>>> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
>>>> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
>>>> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
>>>> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
>>>> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
>>>> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
>>>> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
>>>> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
>>>> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
>>>> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
>>>> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
>>>> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
>>>> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
>>>> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
>>>> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
>>>> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
>>>> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
>>>> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
>>>> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
>>>> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
>>>> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
>>>> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
>>>> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
>>>> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
>>>> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
>>>> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
>>>> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
>>>> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
>>>> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
>>>> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
>>>> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
>>>> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
>>>> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
>>>> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
>>>> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
>>>> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
>>>> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
>>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
>>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
>>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
>>>> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
>>>
>>> Similar here, do you know why we don't ramp up earlier. Allocating that many
>>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
>>
>> Note that this is reading through mmap using the mmap readahead code. If
>> you use standard read(2), the readahead window starts small as well and
>> ramps us along with the desired order so we don't allocate that many small
>> order pages in that case.

That does raise an interesting question though; why do we use a fixed size
window for mmap? It feels like we could start with a smaller window and ramp it
up as order ramps up too, capped to the end of the vma.

Although perhaps that is an investigation for another day... My main motivation
here was to be consistent about what page_cache_ra_order()'s new_order means,
and to actually implement algorithm that was originally intended - start from 0
and ramp up +2 on each readahead marker.

> 
> Ah, thanks for that information! :)
> 

Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Jan Kara 7 months, 2 weeks ago
On Mon 05-05-25 13:51:48, Ryan Roberts wrote:
> On 05/05/2025 11:25, David Hildenbrand wrote:
> > On 05.05.25 12:09, Jan Kara wrote:
> >> On Mon 05-05-25 11:51:43, David Hildenbrand wrote:
> >>> On 30.04.25 16:59, Ryan Roberts wrote:
> >>>> page_cache_ra_order() takes a parameter called new_order, which is
> >>>> intended to express the preferred order of the folios that will be
> >>>> allocated for the readahead operation. Most callers indeed call this
> >>>> with their preferred new order. But page_cache_async_ra() calls it with
> >>>> the preferred order of the previous readahead request (actually the
> >>>> order of the folio that had the readahead marker, which may be smaller
> >>>> when alignment comes into play).
> >>>>
> >>>> And despite the parameter name, page_cache_ra_order() always treats it
> >>>> at the old order, adding 2 to it on entry. As a result, a cold readahead
> >>>> always starts with order-2 folios.
> >>>>
> >>>> Let's fix this behaviour by always passing in the *new* order.
> >>>>
> >>>> Worked example:
> >>>>
> >>>> Prior to the change, mmaping an 8MB file and touching each page
> >>>> sequentially, resulted in the following, where we start with order-2
> >>>> folios for the first 128K then ramp up to order-4 for the next 128K,
> >>>> then get clamped to order-5 for the rest of the file because pa_pages is
> >>>> limited to 128K:
> >>>>
> >>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> >>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
> >>>> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> >>>> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> >>>> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> >>>> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> >>>> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> >>>> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> >>>> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> >>>> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> >>>> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> >>>> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> >>>> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> >>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> >>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> >>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> >>>
> >>> Interesting, I would have thought we'd ramp up earlier.
> >>>
> >>>> ...
> >>>>
> >>>> After the change, the same operation results in the first 128K being
> >>>> order-0, then we start ramping up to order-2, -4, and finally get
> >>>> clamped at order-5:
> >>>>
> >>>> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> >>>> -----  ----------  ----------  ---------  -------  -------  -----  -----
> >>>> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> >>>> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> >>>> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> >>>> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> >>>> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> >>>> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> >>>> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> >>>> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> >>>> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> >>>> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> >>>> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> >>>> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> >>>> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> >>>> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> >>>> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> >>>> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> >>>> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> >>>> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> >>>> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> >>>> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> >>>> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> >>>> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> >>>> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> >>>> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> >>>> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> >>>> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> >>>> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> >>>> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> >>>> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> >>>> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> >>>> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> >>>> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> >>>> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> >>>> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> >>>> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> >>>> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> >>>> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> >>>> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> >>>> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> >>>> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> >>>> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> >>>> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> >>>> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> >>>> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> >>>> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> >>>> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
> >>>
> >>> Similar here, do you know why we don't ramp up earlier. Allocating that many
> >>> order-0 + order-2 pages looks a bit suboptimal to me for a sequential read.
> >>
> >> Note that this is reading through mmap using the mmap readahead code. If
> >> you use standard read(2), the readahead window starts small as well and
> >> ramps us along with the desired order so we don't allocate that many small
> >> order pages in that case.
> 
> That does raise an interesting question though; why do we use a fixed size
> window for mmap? It feels like we could start with a smaller window and ramp it
> up as order ramps up too, capped to the end of the vma.
> 
> Although perhaps that is an investigation for another day... My main motivation
> here was to be consistent about what page_cache_ra_order()'s new_order means,
> and to actually implement algorithm that was originally intended - start from 0
> and ramp up +2 on each readahead marker.

Well, in my opinion the whole mmap readahead logic would deserve some
remodelling :) because a lot of decisions there are quite disputable for
contemporary systems. But that's definitely for some other patchset...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
Posted by Jan Kara 7 months, 2 weeks ago
On Wed 30-04-25 15:59:14, Ryan Roberts wrote:
> page_cache_ra_order() takes a parameter called new_order, which is
> intended to express the preferred order of the folios that will be
> allocated for the readahead operation. Most callers indeed call this
> with their preferred new order. But page_cache_async_ra() calls it with
> the preferred order of the previous readahead request (actually the
> order of the folio that had the readahead marker, which may be smaller
> when alignment comes into play).
> 
> And despite the parameter name, page_cache_ra_order() always treats it
> at the old order, adding 2 to it on entry. As a result, a cold readahead
> always starts with order-2 folios.
> 
> Let's fix this behaviour by always passing in the *new* order.
> 
> Worked example:
> 
> Prior to the change, mmaping an 8MB file and touching each page
> sequentially, resulted in the following, where we start with order-2
> folios for the first 128K then ramp up to order-4 for the next 128K,
> then get clamped to order-5 for the rest of the file because pa_pages is
> limited to 128K:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00004000      16384        0        4      4      2
> FOLIO  0x00004000  0x00008000      16384        4        8      4      2
> FOLIO  0x00008000  0x0000c000      16384        8       12      4      2
> FOLIO  0x0000c000  0x00010000      16384       12       16      4      2
> FOLIO  0x00010000  0x00014000      16384       16       20      4      2
> FOLIO  0x00014000  0x00018000      16384       20       24      4      2
> FOLIO  0x00018000  0x0001c000      16384       24       28      4      2
> FOLIO  0x0001c000  0x00020000      16384       28       32      4      2
> FOLIO  0x00020000  0x00030000      65536       32       48     16      4
> FOLIO  0x00030000  0x00040000      65536       48       64     16      4
> FOLIO  0x00040000  0x00060000     131072       64       96     32      5
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> ...
> 
> After the change, the same operation results in the first 128K being
> order-0, then we start ramping up to order-2, -4, and finally get
> clamped at order-5:
> 
> TYPE    STARTOFFS     ENDOFFS       SIZE  STARTPG    ENDPG   NRPG  ORDER
> -----  ----------  ----------  ---------  -------  -------  -----  -----
> FOLIO  0x00000000  0x00001000       4096        0        1      1      0
> FOLIO  0x00001000  0x00002000       4096        1        2      1      0
> FOLIO  0x00002000  0x00003000       4096        2        3      1      0
> FOLIO  0x00003000  0x00004000       4096        3        4      1      0
> FOLIO  0x00004000  0x00005000       4096        4        5      1      0
> FOLIO  0x00005000  0x00006000       4096        5        6      1      0
> FOLIO  0x00006000  0x00007000       4096        6        7      1      0
> FOLIO  0x00007000  0x00008000       4096        7        8      1      0
> FOLIO  0x00008000  0x00009000       4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000       4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000       4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000       4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000       4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000       4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000       4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000       4096       15       16      1      0
> FOLIO  0x00010000  0x00011000       4096       16       17      1      0
> FOLIO  0x00011000  0x00012000       4096       17       18      1      0
> FOLIO  0x00012000  0x00013000       4096       18       19      1      0
> FOLIO  0x00013000  0x00014000       4096       19       20      1      0
> FOLIO  0x00014000  0x00015000       4096       20       21      1      0
> FOLIO  0x00015000  0x00016000       4096       21       22      1      0
> FOLIO  0x00016000  0x00017000       4096       22       23      1      0
> FOLIO  0x00017000  0x00018000       4096       23       24      1      0
> FOLIO  0x00018000  0x00019000       4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000       4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000       4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000       4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000       4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000       4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000       4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000       4096       31       32      1      0
> FOLIO  0x00020000  0x00024000      16384       32       36      4      2
> FOLIO  0x00024000  0x00028000      16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000      16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000      16384       44       48      4      2
> FOLIO  0x00030000  0x00034000      16384       48       52      4      2
> FOLIO  0x00034000  0x00038000      16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000      16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000      16384       60       64      4      2
> FOLIO  0x00040000  0x00050000      65536       64       80     16      4
> FOLIO  0x00050000  0x00060000      65536       80       96     16      4
> FOLIO  0x00060000  0x00080000     131072       96      128     32      5
> FOLIO  0x00080000  0x000a0000     131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000     131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000     131072      192      224     32      5
> ...
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Makes sense and looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/readahead.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 6a4e96b69702..8bb316f5a842 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -479,9 +479,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	limit = min(limit, index + ra->size - 1);
>  
> -	if (new_order < mapping_max_folio_order(mapping))
> -		new_order += 2;
> -
>  	new_order = min(mapping_max_folio_order(mapping), new_order);
>  	new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	new_order = max(new_order, min_order);
> @@ -683,6 +680,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	ra->size = get_next_ra_size(ra, max_pages);
>  	ra->async_size = ra->size;
>  readit:
> +	order += 2;
>  	ractl->_index = ra->start;
>  	page_cache_ra_order(ractl, ra, order);
>  }
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR