[v4] Readahead tweaks for larger folios

[RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Ryan Roberts 9 months, 2 weeks ago

Previously asynchonous readahead would read ra_pages (usually 128K)
directly after the end of the synchonous readahead and given the
synchronous readahead portion had no alignment guarantees (beyond page
boundaries) it is possible (and likely) that the end of the initial 128K
region would not fall on a natural boundary for the folio size being
used. Therefore smaller folios were used to align down to the required
boundary, both at the end of the previous readahead block and at the
start of the new one.

In the worst cases, this can result in never properly ramping up the
folio size, and instead getting stuck oscillating between order-0, -1
and -2 folios. The next readahead will try to use folios whose order is
+2 bigger than the folio that had the readahead marker. But because of
the alignment requirements, that folio (the first one in the readahead
block) can end up being order-0 in some cases.

There will be 2 modifications to solve this issue:

1) Calculate the readahead size so the end is aligned to a folio
   boundary. This prevents needing to allocate small folios to align
   down at the end of the window and fixes the oscillation problem.

2) Remember the "preferred folio order" in the ra state instead of
   inferring it from the folio with the readahead marker. This solves
   the slow ramp up problem (discussed in a subsequent patch).

This patch addresses (1) only. A subsequent patch will address (2).

Worked example:

The following shows the previous pathalogical behaviour when the initial
synchronous readahead is unaligned. We start reading at page 17 in the
file and read sequentially from there. I'm showing a dump of the pages
in the page cache just after we read the first page of the folio with
the readahead marker.

Initially there are no pages in the page cache:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00800000     8388608        0     2048   2048

Then we access page 17, causing synchonous read-around of 128K with a
readahead marker set up at page 25. So far, all as expected:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0  Y
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
HOLE   0x00021000  0x00800000     8253440       33     2048   2015

Now access pages 18-25 inclusive. This causes an asynchronous 128K
readahead starting at page 33. But since we are unaligned, even though
the preferred folio order is 2, the first folio in this batch (the one
with the new readahead marker) is order-0:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0  Y
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00041000        4096       64       65      1      0
HOLE   0x00041000  0x00800000     8122368       65     2048   1983

Which means that when we now read pages 26-33 and readahead is kicked
off again, the new preferred order is 2 (0 + 2), not 4 as we intended:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00041000        4096       64       65      1      0
FOLIO  0x00041000  0x00042000        4096       65       66      1      0  Y
FOLIO  0x00042000  0x00044000        8192       66       68      2      1
FOLIO  0x00044000  0x00048000       16384       68       72      4      2
FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
FOLIO  0x00050000  0x00054000       16384       80       84      4      2
FOLIO  0x00054000  0x00058000       16384       84       88      4      2
FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
FOLIO  0x00060000  0x00061000        4096       96       97      1      0
HOLE   0x00061000  0x00800000     7991296       97     2048   1951

This ramp up from order-0 with smaller orders at the edges for alignment
cycle continues all the way to the end of the file (not shown).

After the change, we round down the end boundary to the order boundary
so we no longer get stuck in the cycle and can ramp up the order over
time. Note that the rate of the ramp up is still not as we would expect
it. We will fix that next. Here we are touching pages 17-256
sequentially:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00044000       16384       64       68      4      2
FOLIO  0x00044000  0x00048000       16384       68       72      4      2
FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
FOLIO  0x00050000  0x00054000       16384       80       84      4      2
FOLIO  0x00054000  0x00058000       16384       84       88      4      2
FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
FOLIO  0x00060000  0x00070000       65536       96      112     16      4
FOLIO  0x00070000  0x00080000       65536      112      128     16      4
FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
FOLIO  0x00100000  0x00120000      131072      256      288     32      5
FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
HOLE   0x00140000  0x00800000     7077888      320     2048   1728

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/readahead.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8bb316f5a842..82f9f623f2d7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	unsigned long max_pages;
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t index = readahead_index(ractl);
-	pgoff_t expected, start;
+	pgoff_t expected, start, end, aligned_end;
 	unsigned int order = folio_order(folio);
 
 	/* no readahead */
@@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
 		 * the readahead window.
 		 */
 		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
-		ra->async_size = ra->size;
 		goto readit;
 	}
 
@@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
 	ra->size = start - index;	/* old async_size */
 	ra->size += req_count;
 	ra->size = get_next_ra_size(ra, max_pages);
-	ra->async_size = ra->size;
 readit:
 	order += 2;
+	end = ra->start + ra->size;
+	aligned_end = round_down(end, 1UL << order);
+	if (aligned_end > ra->start)
+		ra->size -= end - aligned_end;
+	ra->async_size = ra->size;
 	ractl->_index = ra->start;
 	page_cache_ra_order(ractl, ra, order);
 }
-- 
2.43.0

Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Jan Kara 9 months, 1 week ago

On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> Previously asynchonous readahead would read ra_pages (usually 128K)
> directly after the end of the synchonous readahead and given the
> synchronous readahead portion had no alignment guarantees (beyond page
> boundaries) it is possible (and likely) that the end of the initial 128K
> region would not fall on a natural boundary for the folio size being
> used. Therefore smaller folios were used to align down to the required
> boundary, both at the end of the previous readahead block and at the
> start of the new one.
> 
> In the worst cases, this can result in never properly ramping up the
> folio size, and instead getting stuck oscillating between order-0, -1
> and -2 folios. The next readahead will try to use folios whose order is
> +2 bigger than the folio that had the readahead marker. But because of
> the alignment requirements, that folio (the first one in the readahead
> block) can end up being order-0 in some cases.
> 
> There will be 2 modifications to solve this issue:
> 
> 1) Calculate the readahead size so the end is aligned to a folio
>    boundary. This prevents needing to allocate small folios to align
>    down at the end of the window and fixes the oscillation problem.
> 
> 2) Remember the "preferred folio order" in the ra state instead of
>    inferring it from the folio with the readahead marker. This solves
>    the slow ramp up problem (discussed in a subsequent patch).
> 
> This patch addresses (1) only. A subsequent patch will address (2).
> 
> Worked example:
> 
> The following shows the previous pathalogical behaviour when the initial
> synchronous readahead is unaligned. We start reading at page 17 in the
> file and read sequentially from there. I'm showing a dump of the pages
> in the page cache just after we read the first page of the folio with
> the readahead marker.
> 
> Initially there are no pages in the page cache:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00800000     8388608        0     2048   2048
> 
> Then we access page 17, causing synchonous read-around of 128K with a
> readahead marker set up at page 25. So far, all as expected:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00001000        4096        0        1      1
> FOLIO  0x00001000  0x00002000        4096        1        2      1      0
> FOLIO  0x00002000  0x00003000        4096        2        3      1      0
> FOLIO  0x00003000  0x00004000        4096        3        4      1      0
> FOLIO  0x00004000  0x00005000        4096        4        5      1      0
> FOLIO  0x00005000  0x00006000        4096        5        6      1      0
> FOLIO  0x00006000  0x00007000        4096        6        7      1      0
> FOLIO  0x00007000  0x00008000        4096        7        8      1      0
> FOLIO  0x00008000  0x00009000        4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
> FOLIO  0x00010000  0x00011000        4096       16       17      1      0
> FOLIO  0x00011000  0x00012000        4096       17       18      1      0
> FOLIO  0x00012000  0x00013000        4096       18       19      1      0
> FOLIO  0x00013000  0x00014000        4096       19       20      1      0
> FOLIO  0x00014000  0x00015000        4096       20       21      1      0
> FOLIO  0x00015000  0x00016000        4096       21       22      1      0
> FOLIO  0x00016000  0x00017000        4096       22       23      1      0
> FOLIO  0x00017000  0x00018000        4096       23       24      1      0
> FOLIO  0x00018000  0x00019000        4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000        4096       25       26      1      0  Y
> FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
> FOLIO  0x00020000  0x00021000        4096       32       33      1      0
> HOLE   0x00021000  0x00800000     8253440       33     2048   2015
> 
> Now access pages 18-25 inclusive. This causes an asynchronous 128K
> readahead starting at page 33. But since we are unaligned, even though
> the preferred folio order is 2, the first folio in this batch (the one
> with the new readahead marker) is order-0:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00001000        4096        0        1      1
> FOLIO  0x00001000  0x00002000        4096        1        2      1      0
> FOLIO  0x00002000  0x00003000        4096        2        3      1      0
> FOLIO  0x00003000  0x00004000        4096        3        4      1      0
> FOLIO  0x00004000  0x00005000        4096        4        5      1      0
> FOLIO  0x00005000  0x00006000        4096        5        6      1      0
> FOLIO  0x00006000  0x00007000        4096        6        7      1      0
> FOLIO  0x00007000  0x00008000        4096        7        8      1      0
> FOLIO  0x00008000  0x00009000        4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
> FOLIO  0x00010000  0x00011000        4096       16       17      1      0
> FOLIO  0x00011000  0x00012000        4096       17       18      1      0
> FOLIO  0x00012000  0x00013000        4096       18       19      1      0
> FOLIO  0x00013000  0x00014000        4096       19       20      1      0
> FOLIO  0x00014000  0x00015000        4096       20       21      1      0
> FOLIO  0x00015000  0x00016000        4096       21       22      1      0
> FOLIO  0x00016000  0x00017000        4096       22       23      1      0
> FOLIO  0x00017000  0x00018000        4096       23       24      1      0
> FOLIO  0x00018000  0x00019000        4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
> FOLIO  0x00020000  0x00021000        4096       32       33      1      0
> FOLIO  0x00021000  0x00022000        4096       33       34      1      0  Y
> FOLIO  0x00022000  0x00024000        8192       34       36      2      1
> FOLIO  0x00024000  0x00028000       16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
> FOLIO  0x00030000  0x00034000       16384       48       52      4      2
> FOLIO  0x00034000  0x00038000       16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
> FOLIO  0x00040000  0x00041000        4096       64       65      1      0
> HOLE   0x00041000  0x00800000     8122368       65     2048   1983
> 
> Which means that when we now read pages 26-33 and readahead is kicked
> off again, the new preferred order is 2 (0 + 2), not 4 as we intended:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00001000        4096        0        1      1
> FOLIO  0x00001000  0x00002000        4096        1        2      1      0
> FOLIO  0x00002000  0x00003000        4096        2        3      1      0
> FOLIO  0x00003000  0x00004000        4096        3        4      1      0
> FOLIO  0x00004000  0x00005000        4096        4        5      1      0
> FOLIO  0x00005000  0x00006000        4096        5        6      1      0
> FOLIO  0x00006000  0x00007000        4096        6        7      1      0
> FOLIO  0x00007000  0x00008000        4096        7        8      1      0
> FOLIO  0x00008000  0x00009000        4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
> FOLIO  0x00010000  0x00011000        4096       16       17      1      0
> FOLIO  0x00011000  0x00012000        4096       17       18      1      0
> FOLIO  0x00012000  0x00013000        4096       18       19      1      0
> FOLIO  0x00013000  0x00014000        4096       19       20      1      0
> FOLIO  0x00014000  0x00015000        4096       20       21      1      0
> FOLIO  0x00015000  0x00016000        4096       21       22      1      0
> FOLIO  0x00016000  0x00017000        4096       22       23      1      0
> FOLIO  0x00017000  0x00018000        4096       23       24      1      0
> FOLIO  0x00018000  0x00019000        4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
> FOLIO  0x00020000  0x00021000        4096       32       33      1      0
> FOLIO  0x00021000  0x00022000        4096       33       34      1      0
> FOLIO  0x00022000  0x00024000        8192       34       36      2      1
> FOLIO  0x00024000  0x00028000       16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
> FOLIO  0x00030000  0x00034000       16384       48       52      4      2
> FOLIO  0x00034000  0x00038000       16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
> FOLIO  0x00040000  0x00041000        4096       64       65      1      0
> FOLIO  0x00041000  0x00042000        4096       65       66      1      0  Y
> FOLIO  0x00042000  0x00044000        8192       66       68      2      1
> FOLIO  0x00044000  0x00048000       16384       68       72      4      2
> FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
> FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
> FOLIO  0x00050000  0x00054000       16384       80       84      4      2
> FOLIO  0x00054000  0x00058000       16384       84       88      4      2
> FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
> FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
> FOLIO  0x00060000  0x00061000        4096       96       97      1      0
> HOLE   0x00061000  0x00800000     7991296       97     2048   1951
> 
> This ramp up from order-0 with smaller orders at the edges for alignment
> cycle continues all the way to the end of the file (not shown).
> 
> After the change, we round down the end boundary to the order boundary
> so we no longer get stuck in the cycle and can ramp up the order over
> time. Note that the rate of the ramp up is still not as we would expect
> it. We will fix that next. Here we are touching pages 17-256
> sequentially:
> 
> TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
> -----  ----------  ----------  ----------  -------  -------  -----  -----  --
> HOLE   0x00000000  0x00001000        4096        0        1      1
> FOLIO  0x00001000  0x00002000        4096        1        2      1      0
> FOLIO  0x00002000  0x00003000        4096        2        3      1      0
> FOLIO  0x00003000  0x00004000        4096        3        4      1      0
> FOLIO  0x00004000  0x00005000        4096        4        5      1      0
> FOLIO  0x00005000  0x00006000        4096        5        6      1      0
> FOLIO  0x00006000  0x00007000        4096        6        7      1      0
> FOLIO  0x00007000  0x00008000        4096        7        8      1      0
> FOLIO  0x00008000  0x00009000        4096        8        9      1      0
> FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
> FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
> FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
> FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
> FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
> FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
> FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
> FOLIO  0x00010000  0x00011000        4096       16       17      1      0
> FOLIO  0x00011000  0x00012000        4096       17       18      1      0
> FOLIO  0x00012000  0x00013000        4096       18       19      1      0
> FOLIO  0x00013000  0x00014000        4096       19       20      1      0
> FOLIO  0x00014000  0x00015000        4096       20       21      1      0
> FOLIO  0x00015000  0x00016000        4096       21       22      1      0
> FOLIO  0x00016000  0x00017000        4096       22       23      1      0
> FOLIO  0x00017000  0x00018000        4096       23       24      1      0
> FOLIO  0x00018000  0x00019000        4096       24       25      1      0
> FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
> FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
> FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
> FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
> FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
> FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
> FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
> FOLIO  0x00020000  0x00021000        4096       32       33      1      0
> FOLIO  0x00021000  0x00022000        4096       33       34      1      0
> FOLIO  0x00022000  0x00024000        8192       34       36      2      1
> FOLIO  0x00024000  0x00028000       16384       36       40      4      2
> FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
> FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
> FOLIO  0x00030000  0x00034000       16384       48       52      4      2
> FOLIO  0x00034000  0x00038000       16384       52       56      4      2
> FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
> FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
> FOLIO  0x00040000  0x00044000       16384       64       68      4      2
> FOLIO  0x00044000  0x00048000       16384       68       72      4      2
> FOLIO  0x00048000  0x0004c000       16384       72       76      4      2
> FOLIO  0x0004c000  0x00050000       16384       76       80      4      2
> FOLIO  0x00050000  0x00054000       16384       80       84      4      2
> FOLIO  0x00054000  0x00058000       16384       84       88      4      2
> FOLIO  0x00058000  0x0005c000       16384       88       92      4      2
> FOLIO  0x0005c000  0x00060000       16384       92       96      4      2
> FOLIO  0x00060000  0x00070000       65536       96      112     16      4
> FOLIO  0x00070000  0x00080000       65536      112      128     16      4
> FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
> FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
> FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
> FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
> FOLIO  0x00100000  0x00120000      131072      256      288     32      5
> FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
> HOLE   0x00140000  0x00800000     7077888      320     2048   1728
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Looks good. When I was reading this code some time ago, I also felt we
should rather do some rounding instead of creating small folios so thanks
for working on this. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

									Honza

> ---
>  mm/readahead.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8bb316f5a842..82f9f623f2d7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	unsigned long max_pages;
>  	struct file_ra_state *ra = ractl->ra;
>  	pgoff_t index = readahead_index(ractl);
> -	pgoff_t expected, start;
> +	pgoff_t expected, start, end, aligned_end;
>  	unsigned int order = folio_order(folio);
>  
>  	/* no readahead */
> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  		 * the readahead window.
>  		 */
>  		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> -		ra->async_size = ra->size;
>  		goto readit;
>  	}
>  
> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
>  	ra->size = start - index;	/* old async_size */
>  	ra->size += req_count;
>  	ra->size = get_next_ra_size(ra, max_pages);
> -	ra->async_size = ra->size;
>  readit:
>  	order += 2;
> +	end = ra->start + ra->size;
> +	aligned_end = round_down(end, 1UL << order);
> +	if (aligned_end > ra->start)
> +		ra->size -= end - aligned_end;
> +	ra->async_size = ra->size;
>  	ractl->_index = ra->start;
>  	page_cache_ra_order(ractl, ra, order);
>  }
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Jan Kara 9 months, 1 week ago

On Mon 05-05-25 11:13:26, Jan Kara wrote:
> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> > Previously asynchonous readahead would read ra_pages (usually 128K)
> > directly after the end of the synchonous readahead and given the
> > synchronous readahead portion had no alignment guarantees (beyond page
> > boundaries) it is possible (and likely) that the end of the initial 128K
> > region would not fall on a natural boundary for the folio size being
> > used. Therefore smaller folios were used to align down to the required
> > boundary, both at the end of the previous readahead block and at the
> > start of the new one.
> > 
> > In the worst cases, this can result in never properly ramping up the
> > folio size, and instead getting stuck oscillating between order-0, -1
> > and -2 folios. The next readahead will try to use folios whose order is
> > +2 bigger than the folio that had the readahead marker. But because of
> > the alignment requirements, that folio (the first one in the readahead
> > block) can end up being order-0 in some cases.
> > 
> > There will be 2 modifications to solve this issue:
> > 
> > 1) Calculate the readahead size so the end is aligned to a folio
> >    boundary. This prevents needing to allocate small folios to align
> >    down at the end of the window and fixes the oscillation problem.
> > 
> > 2) Remember the "preferred folio order" in the ra state instead of
> >    inferring it from the folio with the readahead marker. This solves
> >    the slow ramp up problem (discussed in a subsequent patch).
> > 
> > This patch addresses (1) only. A subsequent patch will address (2).
> > 
> > Worked example:
> > 
> > The following shows the previous pathalogical behaviour when the initial
> > synchronous readahead is unaligned. We start reading at page 17 in the
> > file and read sequentially from there. I'm showing a dump of the pages
> > in the page cache just after we read the first page of the folio with
> > the readahead marker.

<snip>

> Looks good. When I was reading this code some time ago, I also felt we
> should rather do some rounding instead of creating small folios so thanks
> for working on this. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

But now I've also remembered why what you do here isn't an obvious win.
There are storage devices (mostly RAID arrays) where optimum read size
isn't a power of 2. Think for example a RAID-0 device composed from three
disks. It will have max_pages something like 384 (512k * 3). Suppose we are
on x86 and max_order is 9. Then previously (if we were lucky with
alignment) we were alternating between order 7 and order 8 pages in the
page cache and do optimally sized IOs od 1536k. Now you will allocate all
folios of order 8 (nice) but reads will be just 1024k and you'll see
noticeable drop in read throughput (not nice). Note that this is not just a
theoretical example but a real case we have hit when doing performance
testing of servers and for which I was tweaking readahead code in the past.

So I think we need to tweak this logic a bit. Perhaps we should round_down
end to the minimum alignment dictated by 'order' and maxpages? Like:

1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)

If you set badly aligned readahead size manually, you will get small pages
in the page cache but that's just you being stupid. In practice, hardware
induced readahead size need not be powers of 2 but they are *sane* :).

								Honza

> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 8bb316f5a842..82f9f623f2d7 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  	unsigned long max_pages;
> >  	struct file_ra_state *ra = ractl->ra;
> >  	pgoff_t index = readahead_index(ractl);
> > -	pgoff_t expected, start;
> > +	pgoff_t expected, start, end, aligned_end;
> >  	unsigned int order = folio_order(folio);
> >  
> >  	/* no readahead */
> > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  		 * the readahead window.
> >  		 */
> >  		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> > -		ra->async_size = ra->size;
> >  		goto readit;
> >  	}
> >  
> > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  	ra->size = start - index;	/* old async_size */
> >  	ra->size += req_count;
> >  	ra->size = get_next_ra_size(ra, max_pages);
> > -	ra->async_size = ra->size;
> >  readit:
> >  	order += 2;
> > +	end = ra->start + ra->size;
> > +	aligned_end = round_down(end, 1UL << order);
> > +	if (aligned_end > ra->start)
> > +		ra->size -= end - aligned_end;
> > +	ra->async_size = ra->size;
> >  	ractl->_index = ra->start;
> >  	page_cache_ra_order(ractl, ra, order);
> >  }
> > -- 
> > 2.43.0
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Ryan Roberts 9 months, 1 week ago

On 05/05/2025 10:37, Jan Kara wrote:
> On Mon 05-05-25 11:13:26, Jan Kara wrote:
>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
>>> Previously asynchonous readahead would read ra_pages (usually 128K)
>>> directly after the end of the synchonous readahead and given the
>>> synchronous readahead portion had no alignment guarantees (beyond page
>>> boundaries) it is possible (and likely) that the end of the initial 128K
>>> region would not fall on a natural boundary for the folio size being
>>> used. Therefore smaller folios were used to align down to the required
>>> boundary, both at the end of the previous readahead block and at the
>>> start of the new one.
>>>
>>> In the worst cases, this can result in never properly ramping up the
>>> folio size, and instead getting stuck oscillating between order-0, -1
>>> and -2 folios. The next readahead will try to use folios whose order is
>>> +2 bigger than the folio that had the readahead marker. But because of
>>> the alignment requirements, that folio (the first one in the readahead
>>> block) can end up being order-0 in some cases.
>>>
>>> There will be 2 modifications to solve this issue:
>>>
>>> 1) Calculate the readahead size so the end is aligned to a folio
>>>    boundary. This prevents needing to allocate small folios to align
>>>    down at the end of the window and fixes the oscillation problem.
>>>
>>> 2) Remember the "preferred folio order" in the ra state instead of
>>>    inferring it from the folio with the readahead marker. This solves
>>>    the slow ramp up problem (discussed in a subsequent patch).
>>>
>>> This patch addresses (1) only. A subsequent patch will address (2).
>>>
>>> Worked example:
>>>
>>> The following shows the previous pathalogical behaviour when the initial
>>> synchronous readahead is unaligned. We start reading at page 17 in the
>>> file and read sequentially from there. I'm showing a dump of the pages
>>> in the page cache just after we read the first page of the folio with
>>> the readahead marker.
> 
> <snip>
> 
>> Looks good. When I was reading this code some time ago, I also felt we
>> should rather do some rounding instead of creating small folios so thanks
>> for working on this. Feel free to add:
>>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> But now I've also remembered why what you do here isn't an obvious win.
> There are storage devices (mostly RAID arrays) where optimum read size
> isn't a power of 2. Think for example a RAID-0 device composed from three
> disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> on x86 and max_order is 9. Then previously (if we were lucky with
> alignment) we were alternating between order 7 and order 8 pages in the
> page cache and do optimally sized IOs od 1536k. 

Sorry I'm struggling to follow some of this, perhaps my superficial
understanding of all the readahead subtleties is starting to show...

How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
bdi->ra_pages would remain the usual 32 (128K)? In which case, for mmap, won't
we continue to be limited by ra_pages and will never get beyond order-5? (for
mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
spotting it in the code...

I guess you are also implicitly teaching me something about how the block layer
works here too... if there are 2 read requests for an order-7 and order-8, then
the block layer will merge those to a single read (upto the 384 optimal size?)
but if there are 2 reads of order-8 then it won't merge because it would be
bigger than the optimal size and it won't split the second one at the optimal
size either? Have I inferred that correctly?

> Now you will allocate all
> folios of order 8 (nice) but reads will be just 1024k and you'll see
> noticeable drop in read throughput (not nice). Note that this is not just a
> theoretical example but a real case we have hit when doing performance
> testing of servers and for which I was tweaking readahead code in the past.
> 
> So I think we need to tweak this logic a bit. Perhaps we should round_down
> end to the minimum alignment dictated by 'order' and maxpages? Like:
> 
> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)

Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - 1" part?

I think what you are suggesting is that the patch becomes something like this:

---8<---
+	end = ra->start + ra->size;
+	aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
+	if (aligned_end > ra->start)
+		ra->size -= end - aligned_end;
+	ra->async_size = ra->size;
---8<---

So if max_pages=384, then aligned_end will be aligned down to a maximum of the
previous 1MB boundary?

Thanks,
Ryan

> 
> If you set badly aligned readahead size manually, you will get small pages
> in the page cache but that's just you being stupid. In practice, hardware
> induced readahead size need not be powers of 2 but they are *sane* :).
> 
> 								Honza
> 
>>> diff --git a/mm/readahead.c b/mm/readahead.c
>>> index 8bb316f5a842..82f9f623f2d7 100644
>>> --- a/mm/readahead.c
>>> +++ b/mm/readahead.c
>>> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  	unsigned long max_pages;
>>>  	struct file_ra_state *ra = ractl->ra;
>>>  	pgoff_t index = readahead_index(ractl);
>>> -	pgoff_t expected, start;
>>> +	pgoff_t expected, start, end, aligned_end;
>>>  	unsigned int order = folio_order(folio);
>>>  
>>>  	/* no readahead */
>>> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  		 * the readahead window.
>>>  		 */
>>>  		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
>>> -		ra->async_size = ra->size;
>>>  		goto readit;
>>>  	}
>>>  
>>> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
>>>  	ra->size = start - index;	/* old async_size */
>>>  	ra->size += req_count;
>>>  	ra->size = get_next_ra_size(ra, max_pages);
>>> -	ra->async_size = ra->size;
>>>  readit:
>>>  	order += 2;
>>> +	end = ra->start + ra->size;
>>> +	aligned_end = round_down(end, 1UL << order);
>>> +	if (aligned_end > ra->start)
>>> +		ra->size -= end - aligned_end;
>>> +	ra->async_size = ra->size;
>>>  	ractl->_index = ra->start;
>>>  	page_cache_ra_order(ractl, ra, order);
>>>  }
>>> -- 
>>> 2.43.0
>>>
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR

Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Jan Kara 9 months, 1 week ago

On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
> On 05/05/2025 10:37, Jan Kara wrote:
> > On Mon 05-05-25 11:13:26, Jan Kara wrote:
> >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> >>> Previously asynchonous readahead would read ra_pages (usually 128K)
> >>> directly after the end of the synchonous readahead and given the
> >>> synchronous readahead portion had no alignment guarantees (beyond page
> >>> boundaries) it is possible (and likely) that the end of the initial 128K
> >>> region would not fall on a natural boundary for the folio size being
> >>> used. Therefore smaller folios were used to align down to the required
> >>> boundary, both at the end of the previous readahead block and at the
> >>> start of the new one.
> >>>
> >>> In the worst cases, this can result in never properly ramping up the
> >>> folio size, and instead getting stuck oscillating between order-0, -1
> >>> and -2 folios. The next readahead will try to use folios whose order is
> >>> +2 bigger than the folio that had the readahead marker. But because of
> >>> the alignment requirements, that folio (the first one in the readahead
> >>> block) can end up being order-0 in some cases.
> >>>
> >>> There will be 2 modifications to solve this issue:
> >>>
> >>> 1) Calculate the readahead size so the end is aligned to a folio
> >>>    boundary. This prevents needing to allocate small folios to align
> >>>    down at the end of the window and fixes the oscillation problem.
> >>>
> >>> 2) Remember the "preferred folio order" in the ra state instead of
> >>>    inferring it from the folio with the readahead marker. This solves
> >>>    the slow ramp up problem (discussed in a subsequent patch).
> >>>
> >>> This patch addresses (1) only. A subsequent patch will address (2).
> >>>
> >>> Worked example:
> >>>
> >>> The following shows the previous pathalogical behaviour when the initial
> >>> synchronous readahead is unaligned. We start reading at page 17 in the
> >>> file and read sequentially from there. I'm showing a dump of the pages
> >>> in the page cache just after we read the first page of the folio with
> >>> the readahead marker.
> > 
> > <snip>
> > 
> >> Looks good. When I was reading this code some time ago, I also felt we
> >> should rather do some rounding instead of creating small folios so thanks
> >> for working on this. Feel free to add:
> >>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > But now I've also remembered why what you do here isn't an obvious win.
> > There are storage devices (mostly RAID arrays) where optimum read size
> > isn't a power of 2. Think for example a RAID-0 device composed from three
> > disks. It will have max_pages something like 384 (512k * 3). Suppose we are
> > on x86 and max_order is 9. Then previously (if we were lucky with
> > alignment) we were alternating between order 7 and order 8 pages in the
> > page cache and do optimally sized IOs od 1536k. 
> 
> Sorry I'm struggling to follow some of this, perhaps my superficial
> understanding of all the readahead subtleties is starting to show...
> 
> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
> bdi->ra_pages would remain the usual 32 (128K)?

Sorry, I have been probably too brief in my previous message :)
bdi->ra_pages is actually set based on optimal IO size reported by the
hardware (see blk_apply_bdi_limits() and how its callers are filling in
lim->io_opt). The 128K you speak about is just a last-resort value if
hardware doesn't provide one. And some storage devices do report optimal IO
size that is not power of two.

Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
actually do this (usually from their udev rules). We don't have to perform
well when some odd value gets set but you definitely cannot assume
bdi->ra_pages is 128K :).

> In which case, for mmap, won't
> we continue to be limited by ra_pages and will never get beyond order-5? (for
> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
> spotting it in the code...
> 
> I guess you are also implicitly teaching me something about how the block layer
> works here too... if there are 2 read requests for an order-7 and order-8, then
> the block layer will merge those to a single read (upto the 384 optimal size?)

Correct. In fact readahead code will already perform this merging when
submitting the IO.

> but if there are 2 reads of order-8 then it won't merge because it would be
> bigger than the optimal size and it won't split the second one at the optimal
> size either? Have I inferred that correctly?

With the code as you modify it, you would round down ra->size from 384 to
256 and submit only one 1MB sized IO (with one order-8 page). And this will
cause regression in read throughput for such devices because they now don't
get buffer large enough to run at full speed.

> > Now you will allocate all
> > folios of order 8 (nice) but reads will be just 1024k and you'll see
> > noticeable drop in read throughput (not nice). Note that this is not just a
> > theoretical example but a real case we have hit when doing performance
> > testing of servers and for which I was tweaking readahead code in the past.
> > 
> > So I think we need to tweak this logic a bit. Perhaps we should round_down
> > end to the minimum alignment dictated by 'order' and maxpages? Like:
> > 
> > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
> 
> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
> 1" part?

My bad. It should have been:

1 << min(order, ffs(max_pages) - 1)

> I think what you are suggesting is that the patch becomes something like
> this:
> 
> ---8<---
> +	end = ra->start + ra->size;
> +	aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));

Not quite. ilog2() returns the most significant bit set but we really want
to align to the least significant bit set. So when max_pages is 384, we
want to align to at most order-7 (aligning the end more does not make sense
when you want to do IO 384 pages large). That's why I'm using ffs() and not
ilog2().

> +	if (aligned_end > ra->start)
> +		ra->size -= end - aligned_end;
> +	ra->async_size = ra->size;
> ---8<---
> 
> So if max_pages=384, then aligned_end will be aligned down to a maximum
> of the previous 1MB boundary?

No, it needs to be aligned only to previous 512K boundary because we want
to do IOs 3*512K large.

Hope things are a bit clearer now :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Posted by Ryan Roberts 9 months, 1 week ago

On 06/05/2025 12:29, Jan Kara wrote:
> On Tue 06-05-25 10:28:11, Ryan Roberts wrote:
>> On 05/05/2025 10:37, Jan Kara wrote:
>>> On Mon 05-05-25 11:13:26, Jan Kara wrote:
>>>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
>>>>> Previously asynchonous readahead would read ra_pages (usually 128K)
>>>>> directly after the end of the synchonous readahead and given the
>>>>> synchronous readahead portion had no alignment guarantees (beyond page
>>>>> boundaries) it is possible (and likely) that the end of the initial 128K
>>>>> region would not fall on a natural boundary for the folio size being
>>>>> used. Therefore smaller folios were used to align down to the required
>>>>> boundary, both at the end of the previous readahead block and at the
>>>>> start of the new one.
>>>>>
>>>>> In the worst cases, this can result in never properly ramping up the
>>>>> folio size, and instead getting stuck oscillating between order-0, -1
>>>>> and -2 folios. The next readahead will try to use folios whose order is
>>>>> +2 bigger than the folio that had the readahead marker. But because of
>>>>> the alignment requirements, that folio (the first one in the readahead
>>>>> block) can end up being order-0 in some cases.
>>>>>
>>>>> There will be 2 modifications to solve this issue:
>>>>>
>>>>> 1) Calculate the readahead size so the end is aligned to a folio
>>>>>    boundary. This prevents needing to allocate small folios to align
>>>>>    down at the end of the window and fixes the oscillation problem.
>>>>>
>>>>> 2) Remember the "preferred folio order" in the ra state instead of
>>>>>    inferring it from the folio with the readahead marker. This solves
>>>>>    the slow ramp up problem (discussed in a subsequent patch).
>>>>>
>>>>> This patch addresses (1) only. A subsequent patch will address (2).
>>>>>
>>>>> Worked example:
>>>>>
>>>>> The following shows the previous pathalogical behaviour when the initial
>>>>> synchronous readahead is unaligned. We start reading at page 17 in the
>>>>> file and read sequentially from there. I'm showing a dump of the pages
>>>>> in the page cache just after we read the first page of the folio with
>>>>> the readahead marker.
>>>
>>> <snip>
>>>
>>>> Looks good. When I was reading this code some time ago, I also felt we
>>>> should rather do some rounding instead of creating small folios so thanks
>>>> for working on this. Feel free to add:
>>>>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>
>>> But now I've also remembered why what you do here isn't an obvious win.
>>> There are storage devices (mostly RAID arrays) where optimum read size
>>> isn't a power of 2. Think for example a RAID-0 device composed from three
>>> disks. It will have max_pages something like 384 (512k * 3). Suppose we are
>>> on x86 and max_order is 9. Then previously (if we were lucky with
>>> alignment) we were alternating between order 7 and order 8 pages in the
>>> page cache and do optimally sized IOs od 1536k. 
>>
>> Sorry I'm struggling to follow some of this, perhaps my superficial
>> understanding of all the readahead subtleties is starting to show...
>>
>> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and
>> bdi->ra_pages would remain the usual 32 (128K)?
> 
> Sorry, I have been probably too brief in my previous message :)
> bdi->ra_pages is actually set based on optimal IO size reported by the
> hardware (see blk_apply_bdi_limits() and how its callers are filling in
> lim->io_opt). The 128K you speak about is just a last-resort value if
> hardware doesn't provide one. And some storage devices do report optimal IO
> size that is not power of two.

Ahh, got it - thanks for the education!

> 
> Also note that bdi->ra_pages can be tuned in sysfs and a lot of users
> actually do this (usually from their udev rules). We don't have to perform
> well when some odd value gets set but you definitely cannot assume
> bdi->ra_pages is 128K :).
> 
>> In which case, for mmap, won't
>> we continue to be limited by ra_pages and will never get beyond order-5? (for
>> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just
>> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not
>> spotting it in the code...
>>
>> I guess you are also implicitly teaching me something about how the block layer
>> works here too... if there are 2 read requests for an order-7 and order-8, then
>> the block layer will merge those to a single read (upto the 384 optimal size?)
> 
> Correct. In fact readahead code will already perform this merging when
> submitting the IO.
> 
>> but if there are 2 reads of order-8 then it won't merge because it would be
>> bigger than the optimal size and it won't split the second one at the optimal
>> size either? Have I inferred that correctly?
> 
> With the code as you modify it, you would round down ra->size from 384 to
> 256 and submit only one 1MB sized IO (with one order-8 page). And this will
> cause regression in read throughput for such devices because they now don't
> get buffer large enough to run at full speed.

Ahha, yes, thanks - now it's clicking.

> 
>>> Now you will allocate all
>>> folios of order 8 (nice) but reads will be just 1024k and you'll see
>>> noticeable drop in read throughput (not nice). Note that this is not just a
>>> theoretical example but a real case we have hit when doing performance
>>> testing of servers and for which I was tweaking readahead code in the past.
>>>
>>> So I think we need to tweak this logic a bit. Perhaps we should round_down
>>> end to the minimum alignment dictated by 'order' and maxpages? Like:
>>>
>>> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)
>>
>> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT -
>> 1" part?
> 
> My bad. It should have been:
> 
> 1 << min(order, ffs(max_pages) - 1)
> 
>> I think what you are suggesting is that the patch becomes something like
>> this:
>>
>> ---8<---
>> +	end = ra->start + ra->size;
>> +	aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages)));
> 
> Not quite. ilog2() returns the most significant bit set but we really want
> to align to the least significant bit set. So when max_pages is 384, we
> want to align to at most order-7 (aligning the end more does not make sense
> when you want to do IO 384 pages large). That's why I'm using ffs() and not
> ilog2().

Yep got it now.

> 
>> +	if (aligned_end > ra->start)
>> +		ra->size -= end - aligned_end;
>> +	ra->async_size = ra->size;
>> ---8<---
>>
>> So if max_pages=384, then aligned_end will be aligned down to a maximum
>> of the previous 1MB boundary?
> 
> No, it needs to be aligned only to previous 512K boundary because we want
> to do IOs 3*512K large.
> 
> Hope things are a bit clearer now :)

Yes, much!

Thanks,
Ryan

> 
> 								Honza

[RFC PATCH v4 1/5] mm/readahead: Honour new_order in page_cache_ra_order()
[RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary
[RFC PATCH v4 3/5] mm/readahead: Make space in struct file_ra_state
[RFC PATCH v4 4/5] mm/readahead: Store folio order in struct file_ra_state
[RFC PATCH v4 5/5] mm/filemap: Allow arch to request folio size for exec memory