1
From: Barry Song <v-songbaohua@oppo.com>
1
From: Barry Song <v-songbaohua@oppo.com>
2
2
3
page_vma_mapped_walk() within try_to_unmap_one() races with other
3
Within try_to_unmap_one(), page_vma_mapped_walk() races with other
4
PTEs modification such as break-before-make, while iterating PTEs
4
PTE modifications preceded by pte clear. While iterating over PTEs
5
of a large folio, it will only begin to acquire PTL after it gets
5
of a large folio, it only starts acquiring PTL from the first valid
6
a valid(present) PTE. break-before-make intermediately sets PTEs
6
(present) PTE. PTE modifications can temporarily set PTEs to
7
to pte_none. Thus, a large folio's PTEs might be partially skipped
7
pte_none. Consequently, the initial PTEs of a large folio might
8
in try_to_unmap_one().
8
be skipped in try_to_unmap_one().
9
For example, for an anon folio, after try_to_unmap_one(), we may
9
For example, for an anon folio, if we skip PTE0, we may have PTE0
10
have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
10
which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap
11
So folio will be still mapped, the folio fails to be reclaimed.
11
entries after try_to_unmap_one().
12
What’s even more worrying is, its PTEs are no longer in a unified
12
So folio will be still mapped, the folio fails to be reclaimed
13
state. This might lead to accident folio_split() afterwards. And
13
and is put back to LRU in this round.
14
since a part of PTEs are now swap entries, accessing them will
14
This also breaks up PTEs optimization such as CONT-PTE on this
15
incur page fault - do_swap_page.
15
large folio and may lead to accident folio_split() afterwards.
16
It creates both anxiety and more expense. While we can't avoid
16
And since a part of PTEs are now swap entries, accessing those
17
userspace's unmap to break up unified PTEs such as CONT-PTE for
17
parts will introduce overhead - do_swap_page.
18
a large folio, we can indeed keep away from kernel's breaking up
18
Although the kernel can withstand all of the above issues, the
19
them due to its code design.
19
situation still seems quite awkward and warrants making it more
20
This patch is holding PTL from PTE0, thus, the folio will either
20
ideal.
21
be entirely reclaimed or entirely kept. On the other hand, this
21
The same race also occurs with small folios, but they have only
22
approach doesn't increase PTL contention. Even w/o the patch,
22
one PTE, thus, it won't be possible for them to be partially
23
page_vma_mapped_walk() will always get PTL after it sometimes
23
unmapped.
24
skips one or two PTEs because intermediate break-before-makes
24
This patch holds PTL from PTE0, allowing us to avoid reading PTE
25
are short, according to test. Of course, even w/o this patch,
25
values that are in the process of being transformed. With stable
26
the vast majority of try_to_unmap_one still can get PTL from
26
PTE values, we can ensure that this large folio is either
27
PTE0. This patch makes the number 100%.
27
completely reclaimed or that all PTEs remain untouched in this
28
The other option is that we can give up in try_to_unmap_one
28
round.
29
once we find PTE0 is not the first entry we get PTL, we call
29
A corner case is that if we hold PTL from PTE0 and most initial
30
page_vma_mapped_walk_done() to end the iteration at this case.
30
PTEs have been really unmapped before that, we may increase the
31
This will keep the unified PTEs while the folio isn't reclaimed.
31
duration of holding PTL. Thus we only apply this optimization to
32
The result is quite similar with small folios with one PTE -
32
folios which are still entirely mapped (not in deferred_split
33
either entirely reclaimed or entirely kept.
33
list).
34
Reclaiming large folios by holding PTL from PTE0 seems a better
35
option comparing to giving up after detecting PTL begins from
36
non-PTE0.
37
34
38
Cc: Hugh Dickins <hughd@google.com>
35
Cc: Hugh Dickins <hughd@google.com>
39
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
36
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
40
---
37
---
41
mm/vmscan.c | 11 +++++++++++
38
v2:
42
1 file changed, 11 insertions(+)
39
* Refine commit message and code comment after reading all comments
40
from Ryan and David, thanks!
41
* Avoid increasing the duration of PTL by applying optimization
42
on folios not in deferred_split_list with respect to Ying's
43
comment, thanks!
44
45
mm/vmscan.c | 12 ++++++++++++
46
1 file changed, 12 insertions(+)
43
47
44
diff --git a/mm/vmscan.c b/mm/vmscan.c
48
diff --git a/mm/vmscan.c b/mm/vmscan.c
45
index XXXXXXX..XXXXXXX 100644
49
index XXXXXXX..XXXXXXX 100644
46
--- a/mm/vmscan.c
50
--- a/mm/vmscan.c
47
+++ b/mm/vmscan.c
51
+++ b/mm/vmscan.c
48
@@ -XXX,XX +XXX,XX @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
52
@@ -XXX,XX +XXX,XX @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
49
53
50
            if (folio_test_pmd_mappable(folio))
54
            if (folio_test_pmd_mappable(folio))
51
                flags |= TTU_SPLIT_HUGE_PMD;
55
                flags |= TTU_SPLIT_HUGE_PMD;
52
+            /*
56
+            /*
53
+             * if page table lock is not held from the first PTE of
57
+             * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
54
+             * a large folio, some PTEs might be skipped because of
58
+             * from the first present PTE within a large folio. Some initial
55
+             * races with break-before-make, for example, PTEs can
59
+             * PTEs might be skipped due to races with parallel PTE writes
56
+             * be pte_none intermediately, thus one or more PTEs
60
+             * in which PTEs can be cleared temporarily before being written
57
+             * might be skipped in try_to_unmap_one, we might result
61
+             * new present values. This will lead to a large folio is still
58
+             * in a large folio is partially mapped and partially
62
+             * mapped while some subpages have been partially unmapped after
59
+             * unmapped after try_to_unmap
63
+             * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
64
+             * first PTE, eliminating the influence of temporary PTE values.
60
+             */
65
+             */
61
+            if (folio_test_large(folio))
66
+            if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
62
+                flags |= TTU_SYNC;
67
+                flags |= TTU_SYNC;
63
68
64
            try_to_unmap(folio, flags);
69
            try_to_unmap(folio, flags);
65
            if (folio_mapped(folio)) {
70
            if (folio_mapped(folio)) {
66
--
71
--
67
2.34.1
72
2.34.1
68
diff view generated by jsdifflib