1 | From: Barry Song <v-songbaohua@oppo.com> | 1 | From: Barry Song <v-songbaohua@oppo.com> |
---|---|---|---|
2 | 2 | ||
3 | page_vma_mapped_walk() within try_to_unmap_one() races with other | 3 | Within try_to_unmap_one(), page_vma_mapped_walk() races with other |
4 | PTEs modification such as break-before-make, while iterating PTEs | 4 | PTE modifications preceded by pte clear. While iterating over PTEs |
5 | of a large folio, it will only begin to acquire PTL after it gets | 5 | of a large folio, it only starts acquiring PTL from the first valid |
6 | a valid(present) PTE. break-before-make intermediately sets PTEs | 6 | (present) PTE. PTE modifications can temporarily set PTEs to |
7 | to pte_none. Thus, a large folio's PTEs might be partially skipped | 7 | pte_none. Consequently, the initial PTEs of a large folio might |
8 | in try_to_unmap_one(). | 8 | be skipped in try_to_unmap_one(). |
9 | For example, for an anon folio, after try_to_unmap_one(), we may | 9 | For example, for an anon folio, if we skip PTE0, we may have PTE0 |
10 | have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. | 10 | which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap |
11 | So folio will be still mapped, the folio fails to be reclaimed. | 11 | entries after try_to_unmap_one(). |
12 | What’s even more worrying is, its PTEs are no longer in a unified | 12 | So folio will be still mapped, the folio fails to be reclaimed |
13 | state. This might lead to accident folio_split() afterwards. And | 13 | and is put back to LRU in this round. |
14 | since a part of PTEs are now swap entries, accessing them will | 14 | This also breaks up PTEs optimization such as CONT-PTE on this |
15 | incur page fault - do_swap_page. | 15 | large folio and may lead to accident folio_split() afterwards. |
16 | It creates both anxiety and more expense. While we can't avoid | 16 | And since a part of PTEs are now swap entries, accessing those |
17 | userspace's unmap to break up unified PTEs such as CONT-PTE for | 17 | parts will introduce overhead - do_swap_page. |
18 | a large folio, we can indeed keep away from kernel's breaking up | 18 | Although the kernel can withstand all of the above issues, the |
19 | them due to its code design. | 19 | situation still seems quite awkward and warrants making it more |
20 | This patch is holding PTL from PTE0, thus, the folio will either | 20 | ideal. |
21 | be entirely reclaimed or entirely kept. On the other hand, this | 21 | The same race also occurs with small folios, but they have only |
22 | approach doesn't increase PTL contention. Even w/o the patch, | 22 | one PTE, thus, it won't be possible for them to be partially |
23 | page_vma_mapped_walk() will always get PTL after it sometimes | 23 | unmapped. |
24 | skips one or two PTEs because intermediate break-before-makes | 24 | This patch holds PTL from PTE0, allowing us to avoid reading PTE |
25 | are short, according to test. Of course, even w/o this patch, | 25 | values that are in the process of being transformed. With stable |
26 | the vast majority of try_to_unmap_one still can get PTL from | 26 | PTE values, we can ensure that this large folio is either |
27 | PTE0. This patch makes the number 100%. | 27 | completely reclaimed or that all PTEs remain untouched in this |
28 | The other option is that we can give up in try_to_unmap_one | 28 | round. |
29 | once we find PTE0 is not the first entry we get PTL, we call | 29 | A corner case is that if we hold PTL from PTE0 and most initial |
30 | page_vma_mapped_walk_done() to end the iteration at this case. | 30 | PTEs have been really unmapped before that, we may increase the |
31 | This will keep the unified PTEs while the folio isn't reclaimed. | 31 | duration of holding PTL. Thus we only apply this optimization to |
32 | The result is quite similar with small folios with one PTE - | 32 | folios which are still entirely mapped (not in deferred_split |
33 | either entirely reclaimed or entirely kept. | 33 | list). |
34 | Reclaiming large folios by holding PTL from PTE0 seems a better | ||
35 | option comparing to giving up after detecting PTL begins from | ||
36 | non-PTE0. | ||
37 | 34 | ||
38 | Cc: Hugh Dickins <hughd@google.com> | 35 | Cc: Hugh Dickins <hughd@google.com> |
39 | Signed-off-by: Barry Song <v-songbaohua@oppo.com> | 36 | Signed-off-by: Barry Song <v-songbaohua@oppo.com> |
40 | --- | 37 | --- |
41 | mm/vmscan.c | 11 +++++++++++ | 38 | v2: |
42 | 1 file changed, 11 insertions(+) | 39 | * Refine commit message and code comment after reading all comments |
40 | from Ryan and David, thanks! | ||
41 | * Avoid increasing the duration of PTL by applying optimization | ||
42 | on folios not in deferred_split_list with respect to Ying's | ||
43 | comment, thanks! | ||
44 | |||
45 | mm/vmscan.c | 12 ++++++++++++ | ||
46 | 1 file changed, 12 insertions(+) | ||
43 | 47 | ||
44 | diff --git a/mm/vmscan.c b/mm/vmscan.c | 48 | diff --git a/mm/vmscan.c b/mm/vmscan.c |
45 | index XXXXXXX..XXXXXXX 100644 | 49 | index XXXXXXX..XXXXXXX 100644 |
46 | --- a/mm/vmscan.c | 50 | --- a/mm/vmscan.c |
47 | +++ b/mm/vmscan.c | 51 | +++ b/mm/vmscan.c |
48 | @@ -XXX,XX +XXX,XX @@ static unsigned int shrink_folio_list(struct list_head *folio_list, | 52 | @@ -XXX,XX +XXX,XX @@ static unsigned int shrink_folio_list(struct list_head *folio_list, |
49 | 53 | ||
50 | if (folio_test_pmd_mappable(folio)) | 54 | if (folio_test_pmd_mappable(folio)) |
51 | flags |= TTU_SPLIT_HUGE_PMD; | 55 | flags |= TTU_SPLIT_HUGE_PMD; |
52 | + /* | 56 | + /* |
53 | + * if page table lock is not held from the first PTE of | 57 | + * Without TTU_SYNC, try_to_unmap will only begin to hold PTL |
54 | + * a large folio, some PTEs might be skipped because of | 58 | + * from the first present PTE within a large folio. Some initial |
55 | + * races with break-before-make, for example, PTEs can | 59 | + * PTEs might be skipped due to races with parallel PTE writes |
56 | + * be pte_none intermediately, thus one or more PTEs | 60 | + * in which PTEs can be cleared temporarily before being written |
57 | + * might be skipped in try_to_unmap_one, we might result | 61 | + * new present values. This will lead to a large folio is still |
58 | + * in a large folio is partially mapped and partially | 62 | + * mapped while some subpages have been partially unmapped after |
59 | + * unmapped after try_to_unmap | 63 | + * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the |
64 | + * first PTE, eliminating the influence of temporary PTE values. | ||
60 | + */ | 65 | + */ |
61 | + if (folio_test_large(folio)) | 66 | + if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) |
62 | + flags |= TTU_SYNC; | 67 | + flags |= TTU_SYNC; |
63 | 68 | ||
64 | try_to_unmap(folio, flags); | 69 | try_to_unmap(folio, flags); |
65 | if (folio_mapped(folio)) { | 70 | if (folio_mapped(folio)) { |
66 | -- | 71 | -- |
67 | 2.34.1 | 72 | 2.34.1 |
68 | diff view generated by jsdifflib |