Previously asynchonous readahead would read ra_pages (usually 128K)
directly after the end of the synchonous readahead and given the
synchronous readahead portion had no alignment guarantees (beyond page
boundaries) it is possible (and likely) that the end of the initial 128K
region would not fall on a natural boundary for the folio size being
used. Therefore smaller folios were used to align down to the required
boundary, both at the end of the previous readahead block and at the
start of the new one.
In the worst cases, this can result in never properly ramping up the
folio size, and instead getting stuck oscillating between order-0, -1
and -2 folios. The next readahead will try to use folios whose order is
+2 bigger than the folio that had the readahead marker. But because of
the alignment requirements, that folio (the first one in the readahead
block) can end up being order-0 in some cases.
There will be 2 modifications to solve this issue:
1) Calculate the readahead size so the end is aligned to a folio
boundary. This prevents needing to allocate small folios to align
down at the end of the window and fixes the oscillation problem.
2) Remember the "preferred folio order" in the ra state instead of
inferring it from the folio with the readahead marker. This solves
the slow ramp up problem (discussed in a subsequent patch).
This patch addresses (1) only. A subsequent patch will address (2).
Worked example:
The following shows the previous pathalogical behaviour when the initial
synchronous readahead is unaligned. We start reading at page 17 in the
file and read sequentially from there. I'm showing a dump of the pages
in the page cache just after we read the first page of the folio with
the readahead marker.
Initially there are no pages in the page cache:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00800000 8388608 0 2048 2048
Then we access page 17, causing synchonous read-around of 128K with a
readahead marker set up at page 25. So far, all as expected:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
HOLE 0x00021000 0x00800000 8253440 33 2048 2015
Now access pages 18-25 inclusive. This causes an asynchronous 128K
readahead starting at page 33. But since we are unaligned, even though
the preferred folio order is 2, the first folio in this batch (the one
with the new readahead marker) is order-0:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
HOLE 0x00041000 0x00800000 8122368 65 2048 1983
Which means that when we now read pages 26-33 and readahead is kicked
off again, the new preferred order is 2 (0 + 2), not 4 as we intended:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00041000 4096 64 65 1 0
FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y
FOLIO 0x00042000 0x00044000 8192 66 68 2 1
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00061000 4096 96 97 1 0
HOLE 0x00061000 0x00800000 7991296 97 2048 1951
This ramp up from order-0 with smaller orders at the edges for alignment
cycle continues all the way to the end of the file (not shown).
After the change, we round down the end boundary to the order boundary
so we no longer get stuck in the cycle and can ramp up the order over
time. Note that the rate of the ramp up is still not as we would expect
it. We will fix that next. Here we are touching pages 17-256
sequentially:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00044000 16384 64 68 4 2
FOLIO 0x00044000 0x00048000 16384 68 72 4 2
FOLIO 0x00048000 0x0004c000 16384 72 76 4 2
FOLIO 0x0004c000 0x00050000 16384 76 80 4 2
FOLIO 0x00050000 0x00054000 16384 80 84 4 2
FOLIO 0x00054000 0x00058000 16384 84 88 4 2
FOLIO 0x00058000 0x0005c000 16384 88 92 4 2
FOLIO 0x0005c000 0x00060000 16384 92 96 4 2
FOLIO 0x00060000 0x00070000 65536 96 112 16 4
FOLIO 0x00070000 0x00080000 65536 112 128 16 4
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
mm/readahead.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c
index 8bb316f5a842..82f9f623f2d7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
unsigned long max_pages;
struct file_ra_state *ra = ractl->ra;
pgoff_t index = readahead_index(ractl);
- pgoff_t expected, start;
+ pgoff_t expected, start, end, aligned_end;
unsigned int order = folio_order(folio);
/* no readahead */
@@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
* the readahead window.
*/
ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
- ra->async_size = ra->size;
goto readit;
}
@@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size = start - index; /* old async_size */
ra->size += req_count;
ra->size = get_next_ra_size(ra, max_pages);
- ra->async_size = ra->size;
readit:
order += 2;
+ end = ra->start + ra->size;
+ aligned_end = round_down(end, 1UL << order);
+ if (aligned_end > ra->start)
+ ra->size -= end - aligned_end;
+ ra->async_size = ra->size;
ractl->_index = ra->start;
page_cache_ra_order(ractl, ra, order);
}
--
2.43.0
On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > Previously asynchonous readahead would read ra_pages (usually 128K) > directly after the end of the synchonous readahead and given the > synchronous readahead portion had no alignment guarantees (beyond page > boundaries) it is possible (and likely) that the end of the initial 128K > region would not fall on a natural boundary for the folio size being > used. Therefore smaller folios were used to align down to the required > boundary, both at the end of the previous readahead block and at the > start of the new one. > > In the worst cases, this can result in never properly ramping up the > folio size, and instead getting stuck oscillating between order-0, -1 > and -2 folios. The next readahead will try to use folios whose order is > +2 bigger than the folio that had the readahead marker. But because of > the alignment requirements, that folio (the first one in the readahead > block) can end up being order-0 in some cases. > > There will be 2 modifications to solve this issue: > > 1) Calculate the readahead size so the end is aligned to a folio > boundary. This prevents needing to allocate small folios to align > down at the end of the window and fixes the oscillation problem. > > 2) Remember the "preferred folio order" in the ra state instead of > inferring it from the folio with the readahead marker. This solves > the slow ramp up problem (discussed in a subsequent patch). > > This patch addresses (1) only. A subsequent patch will address (2). > > Worked example: > > The following shows the previous pathalogical behaviour when the initial > synchronous readahead is unaligned. We start reading at page 17 in the > file and read sequentially from there. I'm showing a dump of the pages > in the page cache just after we read the first page of the folio with > the readahead marker. > > Initially there are no pages in the page cache: > > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA > ----- ---------- ---------- ---------- ------- ------- ----- ----- -- > HOLE 0x00000000 0x00800000 8388608 0 2048 2048 > > Then we access page 17, causing synchonous read-around of 128K with a > readahead marker set up at page 25. So far, all as expected: > > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA > ----- ---------- ---------- ---------- ------- ------- ----- ----- -- > HOLE 0x00000000 0x00001000 4096 0 1 1 > FOLIO 0x00001000 0x00002000 4096 1 2 1 0 > FOLIO 0x00002000 0x00003000 4096 2 3 1 0 > FOLIO 0x00003000 0x00004000 4096 3 4 1 0 > FOLIO 0x00004000 0x00005000 4096 4 5 1 0 > FOLIO 0x00005000 0x00006000 4096 5 6 1 0 > FOLIO 0x00006000 0x00007000 4096 6 7 1 0 > FOLIO 0x00007000 0x00008000 4096 7 8 1 0 > FOLIO 0x00008000 0x00009000 4096 8 9 1 0 > FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 > FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 > FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 > FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 > FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 > FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 > FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 > FOLIO 0x00010000 0x00011000 4096 16 17 1 0 > FOLIO 0x00011000 0x00012000 4096 17 18 1 0 > FOLIO 0x00012000 0x00013000 4096 18 19 1 0 > FOLIO 0x00013000 0x00014000 4096 19 20 1 0 > FOLIO 0x00014000 0x00015000 4096 20 21 1 0 > FOLIO 0x00015000 0x00016000 4096 21 22 1 0 > FOLIO 0x00016000 0x00017000 4096 22 23 1 0 > FOLIO 0x00017000 0x00018000 4096 23 24 1 0 > FOLIO 0x00018000 0x00019000 4096 24 25 1 0 > FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y > FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 > FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 > FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 > FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 > FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 > FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 > FOLIO 0x00020000 0x00021000 4096 32 33 1 0 > HOLE 0x00021000 0x00800000 8253440 33 2048 2015 > > Now access pages 18-25 inclusive. This causes an asynchronous 128K > readahead starting at page 33. But since we are unaligned, even though > the preferred folio order is 2, the first folio in this batch (the one > with the new readahead marker) is order-0: > > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA > ----- ---------- ---------- ---------- ------- ------- ----- ----- -- > HOLE 0x00000000 0x00001000 4096 0 1 1 > FOLIO 0x00001000 0x00002000 4096 1 2 1 0 > FOLIO 0x00002000 0x00003000 4096 2 3 1 0 > FOLIO 0x00003000 0x00004000 4096 3 4 1 0 > FOLIO 0x00004000 0x00005000 4096 4 5 1 0 > FOLIO 0x00005000 0x00006000 4096 5 6 1 0 > FOLIO 0x00006000 0x00007000 4096 6 7 1 0 > FOLIO 0x00007000 0x00008000 4096 7 8 1 0 > FOLIO 0x00008000 0x00009000 4096 8 9 1 0 > FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 > FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 > FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 > FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 > FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 > FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 > FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 > FOLIO 0x00010000 0x00011000 4096 16 17 1 0 > FOLIO 0x00011000 0x00012000 4096 17 18 1 0 > FOLIO 0x00012000 0x00013000 4096 18 19 1 0 > FOLIO 0x00013000 0x00014000 4096 19 20 1 0 > FOLIO 0x00014000 0x00015000 4096 20 21 1 0 > FOLIO 0x00015000 0x00016000 4096 21 22 1 0 > FOLIO 0x00016000 0x00017000 4096 22 23 1 0 > FOLIO 0x00017000 0x00018000 4096 23 24 1 0 > FOLIO 0x00018000 0x00019000 4096 24 25 1 0 > FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 > FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 > FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 > FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 > FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 > FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 > FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 > FOLIO 0x00020000 0x00021000 4096 32 33 1 0 > FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y > FOLIO 0x00022000 0x00024000 8192 34 36 2 1 > FOLIO 0x00024000 0x00028000 16384 36 40 4 2 > FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 > FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 > FOLIO 0x00030000 0x00034000 16384 48 52 4 2 > FOLIO 0x00034000 0x00038000 16384 52 56 4 2 > FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 > FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 > FOLIO 0x00040000 0x00041000 4096 64 65 1 0 > HOLE 0x00041000 0x00800000 8122368 65 2048 1983 > > Which means that when we now read pages 26-33 and readahead is kicked > off again, the new preferred order is 2 (0 + 2), not 4 as we intended: > > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA > ----- ---------- ---------- ---------- ------- ------- ----- ----- -- > HOLE 0x00000000 0x00001000 4096 0 1 1 > FOLIO 0x00001000 0x00002000 4096 1 2 1 0 > FOLIO 0x00002000 0x00003000 4096 2 3 1 0 > FOLIO 0x00003000 0x00004000 4096 3 4 1 0 > FOLIO 0x00004000 0x00005000 4096 4 5 1 0 > FOLIO 0x00005000 0x00006000 4096 5 6 1 0 > FOLIO 0x00006000 0x00007000 4096 6 7 1 0 > FOLIO 0x00007000 0x00008000 4096 7 8 1 0 > FOLIO 0x00008000 0x00009000 4096 8 9 1 0 > FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 > FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 > FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 > FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 > FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 > FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 > FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 > FOLIO 0x00010000 0x00011000 4096 16 17 1 0 > FOLIO 0x00011000 0x00012000 4096 17 18 1 0 > FOLIO 0x00012000 0x00013000 4096 18 19 1 0 > FOLIO 0x00013000 0x00014000 4096 19 20 1 0 > FOLIO 0x00014000 0x00015000 4096 20 21 1 0 > FOLIO 0x00015000 0x00016000 4096 21 22 1 0 > FOLIO 0x00016000 0x00017000 4096 22 23 1 0 > FOLIO 0x00017000 0x00018000 4096 23 24 1 0 > FOLIO 0x00018000 0x00019000 4096 24 25 1 0 > FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 > FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 > FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 > FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 > FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 > FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 > FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 > FOLIO 0x00020000 0x00021000 4096 32 33 1 0 > FOLIO 0x00021000 0x00022000 4096 33 34 1 0 > FOLIO 0x00022000 0x00024000 8192 34 36 2 1 > FOLIO 0x00024000 0x00028000 16384 36 40 4 2 > FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 > FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 > FOLIO 0x00030000 0x00034000 16384 48 52 4 2 > FOLIO 0x00034000 0x00038000 16384 52 56 4 2 > FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 > FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 > FOLIO 0x00040000 0x00041000 4096 64 65 1 0 > FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y > FOLIO 0x00042000 0x00044000 8192 66 68 2 1 > FOLIO 0x00044000 0x00048000 16384 68 72 4 2 > FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 > FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 > FOLIO 0x00050000 0x00054000 16384 80 84 4 2 > FOLIO 0x00054000 0x00058000 16384 84 88 4 2 > FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 > FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 > FOLIO 0x00060000 0x00061000 4096 96 97 1 0 > HOLE 0x00061000 0x00800000 7991296 97 2048 1951 > > This ramp up from order-0 with smaller orders at the edges for alignment > cycle continues all the way to the end of the file (not shown). > > After the change, we round down the end boundary to the order boundary > so we no longer get stuck in the cycle and can ramp up the order over > time. Note that the rate of the ramp up is still not as we would expect > it. We will fix that next. Here we are touching pages 17-256 > sequentially: > > TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA > ----- ---------- ---------- ---------- ------- ------- ----- ----- -- > HOLE 0x00000000 0x00001000 4096 0 1 1 > FOLIO 0x00001000 0x00002000 4096 1 2 1 0 > FOLIO 0x00002000 0x00003000 4096 2 3 1 0 > FOLIO 0x00003000 0x00004000 4096 3 4 1 0 > FOLIO 0x00004000 0x00005000 4096 4 5 1 0 > FOLIO 0x00005000 0x00006000 4096 5 6 1 0 > FOLIO 0x00006000 0x00007000 4096 6 7 1 0 > FOLIO 0x00007000 0x00008000 4096 7 8 1 0 > FOLIO 0x00008000 0x00009000 4096 8 9 1 0 > FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 > FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 > FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 > FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 > FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 > FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 > FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 > FOLIO 0x00010000 0x00011000 4096 16 17 1 0 > FOLIO 0x00011000 0x00012000 4096 17 18 1 0 > FOLIO 0x00012000 0x00013000 4096 18 19 1 0 > FOLIO 0x00013000 0x00014000 4096 19 20 1 0 > FOLIO 0x00014000 0x00015000 4096 20 21 1 0 > FOLIO 0x00015000 0x00016000 4096 21 22 1 0 > FOLIO 0x00016000 0x00017000 4096 22 23 1 0 > FOLIO 0x00017000 0x00018000 4096 23 24 1 0 > FOLIO 0x00018000 0x00019000 4096 24 25 1 0 > FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 > FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 > FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 > FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 > FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 > FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 > FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 > FOLIO 0x00020000 0x00021000 4096 32 33 1 0 > FOLIO 0x00021000 0x00022000 4096 33 34 1 0 > FOLIO 0x00022000 0x00024000 8192 34 36 2 1 > FOLIO 0x00024000 0x00028000 16384 36 40 4 2 > FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 > FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 > FOLIO 0x00030000 0x00034000 16384 48 52 4 2 > FOLIO 0x00034000 0x00038000 16384 52 56 4 2 > FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 > FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 > FOLIO 0x00040000 0x00044000 16384 64 68 4 2 > FOLIO 0x00044000 0x00048000 16384 68 72 4 2 > FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 > FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 > FOLIO 0x00050000 0x00054000 16384 80 84 4 2 > FOLIO 0x00054000 0x00058000 16384 84 88 4 2 > FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 > FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 > FOLIO 0x00060000 0x00070000 65536 96 112 16 4 > FOLIO 0x00070000 0x00080000 65536 112 128 16 4 > FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 > FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 > FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 > FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 > FOLIO 0x00100000 0x00120000 131072 256 288 32 5 > FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y > HOLE 0x00140000 0x00800000 7077888 320 2048 1728 > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Looks good. When I was reading this code some time ago, I also felt we should rather do some rounding instead of creating small folios so thanks for working on this. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > mm/readahead.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/mm/readahead.c b/mm/readahead.c > index 8bb316f5a842..82f9f623f2d7 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl, > unsigned long max_pages; > struct file_ra_state *ra = ractl->ra; > pgoff_t index = readahead_index(ractl); > - pgoff_t expected, start; > + pgoff_t expected, start, end, aligned_end; > unsigned int order = folio_order(folio); > > /* no readahead */ > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl, > * the readahead window. > */ > ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); > - ra->async_size = ra->size; > goto readit; > } > > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl, > ra->size = start - index; /* old async_size */ > ra->size += req_count; > ra->size = get_next_ra_size(ra, max_pages); > - ra->async_size = ra->size; > readit: > order += 2; > + end = ra->start + ra->size; > + aligned_end = round_down(end, 1UL << order); > + if (aligned_end > ra->start) > + ra->size -= end - aligned_end; > + ra->async_size = ra->size; > ractl->_index = ra->start; > page_cache_ra_order(ractl, ra, order); > } > -- > 2.43.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Mon 05-05-25 11:13:26, Jan Kara wrote: > On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > > Previously asynchonous readahead would read ra_pages (usually 128K) > > directly after the end of the synchonous readahead and given the > > synchronous readahead portion had no alignment guarantees (beyond page > > boundaries) it is possible (and likely) that the end of the initial 128K > > region would not fall on a natural boundary for the folio size being > > used. Therefore smaller folios were used to align down to the required > > boundary, both at the end of the previous readahead block and at the > > start of the new one. > > > > In the worst cases, this can result in never properly ramping up the > > folio size, and instead getting stuck oscillating between order-0, -1 > > and -2 folios. The next readahead will try to use folios whose order is > > +2 bigger than the folio that had the readahead marker. But because of > > the alignment requirements, that folio (the first one in the readahead > > block) can end up being order-0 in some cases. > > > > There will be 2 modifications to solve this issue: > > > > 1) Calculate the readahead size so the end is aligned to a folio > > boundary. This prevents needing to allocate small folios to align > > down at the end of the window and fixes the oscillation problem. > > > > 2) Remember the "preferred folio order" in the ra state instead of > > inferring it from the folio with the readahead marker. This solves > > the slow ramp up problem (discussed in a subsequent patch). > > > > This patch addresses (1) only. A subsequent patch will address (2). > > > > Worked example: > > > > The following shows the previous pathalogical behaviour when the initial > > synchronous readahead is unaligned. We start reading at page 17 in the > > file and read sequentially from there. I'm showing a dump of the pages > > in the page cache just after we read the first page of the folio with > > the readahead marker. <snip> > Looks good. When I was reading this code some time ago, I also felt we > should rather do some rounding instead of creating small folios so thanks > for working on this. Feel free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> But now I've also remembered why what you do here isn't an obvious win. There are storage devices (mostly RAID arrays) where optimum read size isn't a power of 2. Think for example a RAID-0 device composed from three disks. It will have max_pages something like 384 (512k * 3). Suppose we are on x86 and max_order is 9. Then previously (if we were lucky with alignment) we were alternating between order 7 and order 8 pages in the page cache and do optimally sized IOs od 1536k. Now you will allocate all folios of order 8 (nice) but reads will be just 1024k and you'll see noticeable drop in read throughput (not nice). Note that this is not just a theoretical example but a real case we have hit when doing performance testing of servers and for which I was tweaking readahead code in the past. So I think we need to tweak this logic a bit. Perhaps we should round_down end to the minimum alignment dictated by 'order' and maxpages? Like: 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) If you set badly aligned readahead size manually, you will get small pages in the page cache but that's just you being stupid. In practice, hardware induced readahead size need not be powers of 2 but they are *sane* :). Honza > > diff --git a/mm/readahead.c b/mm/readahead.c > > index 8bb316f5a842..82f9f623f2d7 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl, > > unsigned long max_pages; > > struct file_ra_state *ra = ractl->ra; > > pgoff_t index = readahead_index(ractl); > > - pgoff_t expected, start; > > + pgoff_t expected, start, end, aligned_end; > > unsigned int order = folio_order(folio); > > > > /* no readahead */ > > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl, > > * the readahead window. > > */ > > ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); > > - ra->async_size = ra->size; > > goto readit; > > } > > > > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl, > > ra->size = start - index; /* old async_size */ > > ra->size += req_count; > > ra->size = get_next_ra_size(ra, max_pages); > > - ra->async_size = ra->size; > > readit: > > order += 2; > > + end = ra->start + ra->size; > > + aligned_end = round_down(end, 1UL << order); > > + if (aligned_end > ra->start) > > + ra->size -= end - aligned_end; > > + ra->async_size = ra->size; > > ractl->_index = ra->start; > > page_cache_ra_order(ractl, ra, order); > > } > > -- > > 2.43.0 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 05/05/2025 10:37, Jan Kara wrote: > On Mon 05-05-25 11:13:26, Jan Kara wrote: >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: >>> Previously asynchonous readahead would read ra_pages (usually 128K) >>> directly after the end of the synchonous readahead and given the >>> synchronous readahead portion had no alignment guarantees (beyond page >>> boundaries) it is possible (and likely) that the end of the initial 128K >>> region would not fall on a natural boundary for the folio size being >>> used. Therefore smaller folios were used to align down to the required >>> boundary, both at the end of the previous readahead block and at the >>> start of the new one. >>> >>> In the worst cases, this can result in never properly ramping up the >>> folio size, and instead getting stuck oscillating between order-0, -1 >>> and -2 folios. The next readahead will try to use folios whose order is >>> +2 bigger than the folio that had the readahead marker. But because of >>> the alignment requirements, that folio (the first one in the readahead >>> block) can end up being order-0 in some cases. >>> >>> There will be 2 modifications to solve this issue: >>> >>> 1) Calculate the readahead size so the end is aligned to a folio >>> boundary. This prevents needing to allocate small folios to align >>> down at the end of the window and fixes the oscillation problem. >>> >>> 2) Remember the "preferred folio order" in the ra state instead of >>> inferring it from the folio with the readahead marker. This solves >>> the slow ramp up problem (discussed in a subsequent patch). >>> >>> This patch addresses (1) only. A subsequent patch will address (2). >>> >>> Worked example: >>> >>> The following shows the previous pathalogical behaviour when the initial >>> synchronous readahead is unaligned. We start reading at page 17 in the >>> file and read sequentially from there. I'm showing a dump of the pages >>> in the page cache just after we read the first page of the folio with >>> the readahead marker. > > <snip> > >> Looks good. When I was reading this code some time ago, I also felt we >> should rather do some rounding instead of creating small folios so thanks >> for working on this. Feel free to add: >> >> Reviewed-by: Jan Kara <jack@suse.cz> > > But now I've also remembered why what you do here isn't an obvious win. > There are storage devices (mostly RAID arrays) where optimum read size > isn't a power of 2. Think for example a RAID-0 device composed from three > disks. It will have max_pages something like 384 (512k * 3). Suppose we are > on x86 and max_order is 9. Then previously (if we were lucky with > alignment) we were alternating between order 7 and order 8 pages in the > page cache and do optimally sized IOs od 1536k. Sorry I'm struggling to follow some of this, perhaps my superficial understanding of all the readahead subtleties is starting to show... How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and bdi->ra_pages would remain the usual 32 (128K)? In which case, for mmap, won't we continue to be limited by ra_pages and will never get beyond order-5? (for mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not spotting it in the code... I guess you are also implicitly teaching me something about how the block layer works here too... if there are 2 read requests for an order-7 and order-8, then the block layer will merge those to a single read (upto the 384 optimal size?) but if there are 2 reads of order-8 then it won't merge because it would be bigger than the optimal size and it won't split the second one at the optimal size either? Have I inferred that correctly? > Now you will allocate all > folios of order 8 (nice) but reads will be just 1024k and you'll see > noticeable drop in read throughput (not nice). Note that this is not just a > theoretical example but a real case we have hit when doing performance > testing of servers and for which I was tweaking readahead code in the past. > > So I think we need to tweak this logic a bit. Perhaps we should round_down > end to the minimum alignment dictated by 'order' and maxpages? Like: > > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - 1" part? I think what you are suggesting is that the patch becomes something like this: ---8<--- + end = ra->start + ra->size; + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); + if (aligned_end > ra->start) + ra->size -= end - aligned_end; + ra->async_size = ra->size; ---8<--- So if max_pages=384, then aligned_end will be aligned down to a maximum of the previous 1MB boundary? Thanks, Ryan > > If you set badly aligned readahead size manually, you will get small pages > in the page cache but that's just you being stupid. In practice, hardware > induced readahead size need not be powers of 2 but they are *sane* :). > > Honza > >>> diff --git a/mm/readahead.c b/mm/readahead.c >>> index 8bb316f5a842..82f9f623f2d7 100644 >>> --- a/mm/readahead.c >>> +++ b/mm/readahead.c >>> @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl, >>> unsigned long max_pages; >>> struct file_ra_state *ra = ractl->ra; >>> pgoff_t index = readahead_index(ractl); >>> - pgoff_t expected, start; >>> + pgoff_t expected, start, end, aligned_end; >>> unsigned int order = folio_order(folio); >>> >>> /* no readahead */ >>> @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl, >>> * the readahead window. >>> */ >>> ra->size = max(ra->size, get_next_ra_size(ra, max_pages)); >>> - ra->async_size = ra->size; >>> goto readit; >>> } >>> >>> @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl, >>> ra->size = start - index; /* old async_size */ >>> ra->size += req_count; >>> ra->size = get_next_ra_size(ra, max_pages); >>> - ra->async_size = ra->size; >>> readit: >>> order += 2; >>> + end = ra->start + ra->size; >>> + aligned_end = round_down(end, 1UL << order); >>> + if (aligned_end > ra->start) >>> + ra->size -= end - aligned_end; >>> + ra->async_size = ra->size; >>> ractl->_index = ra->start; >>> page_cache_ra_order(ractl, ra, order); >>> } >>> -- >>> 2.43.0 >>> >> -- >> Jan Kara <jack@suse.com> >> SUSE Labs, CR
On Tue 06-05-25 10:28:11, Ryan Roberts wrote: > On 05/05/2025 10:37, Jan Kara wrote: > > On Mon 05-05-25 11:13:26, Jan Kara wrote: > >> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: > >>> Previously asynchonous readahead would read ra_pages (usually 128K) > >>> directly after the end of the synchonous readahead and given the > >>> synchronous readahead portion had no alignment guarantees (beyond page > >>> boundaries) it is possible (and likely) that the end of the initial 128K > >>> region would not fall on a natural boundary for the folio size being > >>> used. Therefore smaller folios were used to align down to the required > >>> boundary, both at the end of the previous readahead block and at the > >>> start of the new one. > >>> > >>> In the worst cases, this can result in never properly ramping up the > >>> folio size, and instead getting stuck oscillating between order-0, -1 > >>> and -2 folios. The next readahead will try to use folios whose order is > >>> +2 bigger than the folio that had the readahead marker. But because of > >>> the alignment requirements, that folio (the first one in the readahead > >>> block) can end up being order-0 in some cases. > >>> > >>> There will be 2 modifications to solve this issue: > >>> > >>> 1) Calculate the readahead size so the end is aligned to a folio > >>> boundary. This prevents needing to allocate small folios to align > >>> down at the end of the window and fixes the oscillation problem. > >>> > >>> 2) Remember the "preferred folio order" in the ra state instead of > >>> inferring it from the folio with the readahead marker. This solves > >>> the slow ramp up problem (discussed in a subsequent patch). > >>> > >>> This patch addresses (1) only. A subsequent patch will address (2). > >>> > >>> Worked example: > >>> > >>> The following shows the previous pathalogical behaviour when the initial > >>> synchronous readahead is unaligned. We start reading at page 17 in the > >>> file and read sequentially from there. I'm showing a dump of the pages > >>> in the page cache just after we read the first page of the folio with > >>> the readahead marker. > > > > <snip> > > > >> Looks good. When I was reading this code some time ago, I also felt we > >> should rather do some rounding instead of creating small folios so thanks > >> for working on this. Feel free to add: > >> > >> Reviewed-by: Jan Kara <jack@suse.cz> > > > > But now I've also remembered why what you do here isn't an obvious win. > > There are storage devices (mostly RAID arrays) where optimum read size > > isn't a power of 2. Think for example a RAID-0 device composed from three > > disks. It will have max_pages something like 384 (512k * 3). Suppose we are > > on x86 and max_order is 9. Then previously (if we were lucky with > > alignment) we were alternating between order 7 and order 8 pages in the > > page cache and do optimally sized IOs od 1536k. > > Sorry I'm struggling to follow some of this, perhaps my superficial > understanding of all the readahead subtleties is starting to show... > > How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and > bdi->ra_pages would remain the usual 32 (128K)? Sorry, I have been probably too brief in my previous message :) bdi->ra_pages is actually set based on optimal IO size reported by the hardware (see blk_apply_bdi_limits() and how its callers are filling in lim->io_opt). The 128K you speak about is just a last-resort value if hardware doesn't provide one. And some storage devices do report optimal IO size that is not power of two. Also note that bdi->ra_pages can be tuned in sysfs and a lot of users actually do this (usually from their udev rules). We don't have to perform well when some odd value gets set but you definitely cannot assume bdi->ra_pages is 128K :). > In which case, for mmap, won't > we continue to be limited by ra_pages and will never get beyond order-5? (for > mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just > returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not > spotting it in the code... > > I guess you are also implicitly teaching me something about how the block layer > works here too... if there are 2 read requests for an order-7 and order-8, then > the block layer will merge those to a single read (upto the 384 optimal size?) Correct. In fact readahead code will already perform this merging when submitting the IO. > but if there are 2 reads of order-8 then it won't merge because it would be > bigger than the optimal size and it won't split the second one at the optimal > size either? Have I inferred that correctly? With the code as you modify it, you would round down ra->size from 384 to 256 and submit only one 1MB sized IO (with one order-8 page). And this will cause regression in read throughput for such devices because they now don't get buffer large enough to run at full speed. > > Now you will allocate all > > folios of order 8 (nice) but reads will be just 1024k and you'll see > > noticeable drop in read throughput (not nice). Note that this is not just a > > theoretical example but a real case we have hit when doing performance > > testing of servers and for which I was tweaking readahead code in the past. > > > > So I think we need to tweak this logic a bit. Perhaps we should round_down > > end to the minimum alignment dictated by 'order' and maxpages? Like: > > > > 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) > > Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - > 1" part? My bad. It should have been: 1 << min(order, ffs(max_pages) - 1) > I think what you are suggesting is that the patch becomes something like > this: > > ---8<--- > + end = ra->start + ra->size; > + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); Not quite. ilog2() returns the most significant bit set but we really want to align to the least significant bit set. So when max_pages is 384, we want to align to at most order-7 (aligning the end more does not make sense when you want to do IO 384 pages large). That's why I'm using ffs() and not ilog2(). > + if (aligned_end > ra->start) > + ra->size -= end - aligned_end; > + ra->async_size = ra->size; > ---8<--- > > So if max_pages=384, then aligned_end will be aligned down to a maximum > of the previous 1MB boundary? No, it needs to be aligned only to previous 512K boundary because we want to do IOs 3*512K large. Hope things are a bit clearer now :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 06/05/2025 12:29, Jan Kara wrote: > On Tue 06-05-25 10:28:11, Ryan Roberts wrote: >> On 05/05/2025 10:37, Jan Kara wrote: >>> On Mon 05-05-25 11:13:26, Jan Kara wrote: >>>> On Wed 30-04-25 15:59:15, Ryan Roberts wrote: >>>>> Previously asynchonous readahead would read ra_pages (usually 128K) >>>>> directly after the end of the synchonous readahead and given the >>>>> synchronous readahead portion had no alignment guarantees (beyond page >>>>> boundaries) it is possible (and likely) that the end of the initial 128K >>>>> region would not fall on a natural boundary for the folio size being >>>>> used. Therefore smaller folios were used to align down to the required >>>>> boundary, both at the end of the previous readahead block and at the >>>>> start of the new one. >>>>> >>>>> In the worst cases, this can result in never properly ramping up the >>>>> folio size, and instead getting stuck oscillating between order-0, -1 >>>>> and -2 folios. The next readahead will try to use folios whose order is >>>>> +2 bigger than the folio that had the readahead marker. But because of >>>>> the alignment requirements, that folio (the first one in the readahead >>>>> block) can end up being order-0 in some cases. >>>>> >>>>> There will be 2 modifications to solve this issue: >>>>> >>>>> 1) Calculate the readahead size so the end is aligned to a folio >>>>> boundary. This prevents needing to allocate small folios to align >>>>> down at the end of the window and fixes the oscillation problem. >>>>> >>>>> 2) Remember the "preferred folio order" in the ra state instead of >>>>> inferring it from the folio with the readahead marker. This solves >>>>> the slow ramp up problem (discussed in a subsequent patch). >>>>> >>>>> This patch addresses (1) only. A subsequent patch will address (2). >>>>> >>>>> Worked example: >>>>> >>>>> The following shows the previous pathalogical behaviour when the initial >>>>> synchronous readahead is unaligned. We start reading at page 17 in the >>>>> file and read sequentially from there. I'm showing a dump of the pages >>>>> in the page cache just after we read the first page of the folio with >>>>> the readahead marker. >>> >>> <snip> >>> >>>> Looks good. When I was reading this code some time ago, I also felt we >>>> should rather do some rounding instead of creating small folios so thanks >>>> for working on this. Feel free to add: >>>> >>>> Reviewed-by: Jan Kara <jack@suse.cz> >>> >>> But now I've also remembered why what you do here isn't an obvious win. >>> There are storage devices (mostly RAID arrays) where optimum read size >>> isn't a power of 2. Think for example a RAID-0 device composed from three >>> disks. It will have max_pages something like 384 (512k * 3). Suppose we are >>> on x86 and max_order is 9. Then previously (if we were lucky with >>> alignment) we were alternating between order 7 and order 8 pages in the >>> page cache and do optimally sized IOs od 1536k. >> >> Sorry I'm struggling to follow some of this, perhaps my superficial >> understanding of all the readahead subtleties is starting to show... >> >> How is the 384 figure provided? I'd guess that comes from bdi->io_pages, and >> bdi->ra_pages would remain the usual 32 (128K)? > > Sorry, I have been probably too brief in my previous message :) > bdi->ra_pages is actually set based on optimal IO size reported by the > hardware (see blk_apply_bdi_limits() and how its callers are filling in > lim->io_opt). The 128K you speak about is just a last-resort value if > hardware doesn't provide one. And some storage devices do report optimal IO > size that is not power of two. Ahh, got it - thanks for the education! > > Also note that bdi->ra_pages can be tuned in sysfs and a lot of users > actually do this (usually from their udev rules). We don't have to perform > well when some odd value gets set but you definitely cannot assume > bdi->ra_pages is 128K :). > >> In which case, for mmap, won't >> we continue to be limited by ra_pages and will never get beyond order-5? (for >> mmap req_size is always set to ra_pages IIRC, so ractl_max_pages() always just >> returns ra_pages). Or perhaps ra_pages is set to 384 somewhere, but I'm not >> spotting it in the code... >> >> I guess you are also implicitly teaching me something about how the block layer >> works here too... if there are 2 read requests for an order-7 and order-8, then >> the block layer will merge those to a single read (upto the 384 optimal size?) > > Correct. In fact readahead code will already perform this merging when > submitting the IO. > >> but if there are 2 reads of order-8 then it won't merge because it would be >> bigger than the optimal size and it won't split the second one at the optimal >> size either? Have I inferred that correctly? > > With the code as you modify it, you would round down ra->size from 384 to > 256 and submit only one 1MB sized IO (with one order-8 page). And this will > cause regression in read throughput for such devices because they now don't > get buffer large enough to run at full speed. Ahha, yes, thanks - now it's clicking. > >>> Now you will allocate all >>> folios of order 8 (nice) but reads will be just 1024k and you'll see >>> noticeable drop in read throughput (not nice). Note that this is not just a >>> theoretical example but a real case we have hit when doing performance >>> testing of servers and for which I was tweaking readahead code in the past. >>> >>> So I think we need to tweak this logic a bit. Perhaps we should round_down >>> end to the minimum alignment dictated by 'order' and maxpages? Like: >>> >>> 1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1) >> >> Sorry I'm staring at this and struggling to understand the "PAGE_SHIFT - >> 1" part? > > My bad. It should have been: > > 1 << min(order, ffs(max_pages) - 1) > >> I think what you are suggesting is that the patch becomes something like >> this: >> >> ---8<--- >> + end = ra->start + ra->size; >> + aligned_end = round_down(end, 1UL << min(order, ilog2(max_pages))); > > Not quite. ilog2() returns the most significant bit set but we really want > to align to the least significant bit set. So when max_pages is 384, we > want to align to at most order-7 (aligning the end more does not make sense > when you want to do IO 384 pages large). That's why I'm using ffs() and not > ilog2(). Yep got it now. > >> + if (aligned_end > ra->start) >> + ra->size -= end - aligned_end; >> + ra->async_size = ra->size; >> ---8<--- >> >> So if max_pages=384, then aligned_end will be aligned down to a maximum >> of the previous 1MB boundary? > > No, it needs to be aligned only to previous 512K boundary because we want > to do IOs 3*512K large. > > Hope things are a bit clearer now :) Yes, much! Thanks, Ryan > > Honza
© 2016 - 2025 Red Hat, Inc.