mm/huge_memory.c | 648 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 647 insertions(+), 1 deletion(-)
Hi all, Matthew and I have discussed about a different way of splitting large folios. Instead of split one folio uniformly into the same order smaller ones, doing buddy allocator like split can reduce the total number of resulting folios, the amount of memory needed for multi-index xarray split, and keep more large folios after a split. In addition, both Hugh[1] and Ryan[2] had similar suggestions before. The patch is an initial implementation. It passes simple order-9 to lower order split tests for anonymous folios and pagecache folios. There are still a lot of TODOs to make it upstream. But I would like to gather feedbacks before that. Design === folio_split() splits a large folio in the same way as buddy allocator splits a large free page for allocation. The purpose is to minimize the number of folios after the split. For example, if user wants to free the 3rd subpage in a order-9 folio, folio_split() will split the order-9 folio as: O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache Since anon folio does not support order-1 yet. The split process is similar to existing approach: 1. Unmap all page mappings (split PMD mappings if exist); 2. Split meta data like memcg, page owner, page alloc tag; 3. Copy meta data in struct folio to sub pages, but instead of spliting the whole folio into multiple smaller ones with the same order in a shot, this approach splits the folio iteratively. Taking the example above, this approach first splits the original order-9 into two order-8, then splits left part of order-8 to two order-7 and so on; 4. Post-process split folios, like write mapping->i_pages for pagecache, adjust folio refcounts, add split folios to corresponding list; 5. Remap split folios 6. Unlock split folios. TODOs === 1. For anon folios, the code needs to check enabled mTHP orders and only split a large folios to enabled orders. But this might be up to debate since if no mTHP order is enabled, the folio will be split to order-0 ones. 2. Use xas_nomem() instead of xas_split_alloc() as Matthew suggested. The issue I am having is that when I use xas_set_order(), xas_store(), xas_nomem(), xas_split() pattern, xas->xa_alloc is NULL instead of some allocated memory. I must get something wrong and need some help from Matthew about it. 3. Use folio_split() in pagecache truncate and do more testing. 4. Need to add shmem support if this is needed. 5. Currently, the inputs of folio_split() are original folio, new order, and a page pointer that tells where to split to new order. For truncate, better inputs might be two page pointers on the start and end of the split and the folio_split() figures out the new order. Any comments and/or suggestions are welcome. Thanks. [1] https://lore.kernel.org/linux-mm/9dd96da-efa2-5123-20d4-4992136ef3ad@google.com/ [2] https://lore.kernel.org/linux-mm/cbb1d6a0-66dd-47d0-8733-f836fe050374@arm.com/ Zi Yan (1): mm/huge_memory: buddy allocator like folio_split() mm/huge_memory.c | 648 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 647 insertions(+), 1 deletion(-) -- 2.45.2
On 09.10.24 00:37, Zi Yan wrote: > Hi all, Hi! > > Matthew and I have discussed about a different way of splitting large > folios. Instead of split one folio uniformly into the same order smaller > ones, doing buddy allocator like split can reduce the total number of > resulting folios, the amount of memory needed for multi-index xarray > split, and keep more large folios after a split. In addition, both > Hugh[1] and Ryan[2] had similar suggestions before. > > The patch is an initial implementation. It passes simple order-9 to > lower order split tests for anonymous folios and pagecache folios. > There are still a lot of TODOs to make it upstream. But I would like to gather > feedbacks before that. Interesting, but I don't see any actual users besides the debug/test interface wired up. I assume ftruncate() / fallocate(PUNCH_HOLE) might be good use cases? For example, when punching 1M of a 2M folio, we can just leave a 1M folio in the pagecache. Any other obvious users you have in mind? -- Cheers, David / dhildenb
On 18 Oct 2024, at 14:42, David Hildenbrand wrote: > On 09.10.24 00:37, Zi Yan wrote: >> Hi all, > > Hi! > >> >> Matthew and I have discussed about a different way of splitting large >> folios. Instead of split one folio uniformly into the same order smaller >> ones, doing buddy allocator like split can reduce the total number of >> resulting folios, the amount of memory needed for multi-index xarray >> split, and keep more large folios after a split. In addition, both >> Hugh[1] and Ryan[2] had similar suggestions before. >> >> The patch is an initial implementation. It passes simple order-9 to >> lower order split tests for anonymous folios and pagecache folios. >> There are still a lot of TODOs to make it upstream. But I would like to gather >> feedbacks before that. > > Interesting, but I don't see any actual users besides the debug/test interface wired up. Right. I am working on it now, since two potential users, anon large folios and truncate, might need more sophisticated implementation to fully take advantage of this new split. For anon large folios, this might be open to debate, if only a subset of orders are enabled, I assume folio_split() can only split to smaller folios with the enabled orders. For example, to get one order-0 from an order-9, and only order-4 (64KB on x86) is enabled, folio_split() can only split the order-9 to 16 order-0s, 31 order-4s, unless we are OK with anon large folios with not enabled orders appear in the system. For truncate, the example you give below is an easy one. For cases like punching from 3rd to 5th order-0 of a order-3, [O0, O0, __, __, __, O0, O0, O0], I am thinking which approach is better: 1. two folio_split()s, 1) split second order-1 from order-3, 2) split order-0 from the second order-2; 2. one folio_split() by making folio_split() to support arbitrary range split, so two steps in 1 can be done in one shot, which saves unmapping and remapping cost. Maybe I should go for 1 first as an easy route, but I still need an algorithm in truncate to figure out the way of calling folio_split()s. > > I assume ftruncate() / fallocate(PUNCH_HOLE) might be good use cases? For example, when punching 1M of a 2M folio, we can just leave a 1M folio in the pagecache. Yes, I am trying to make this work. > > Any other obvious users you have in mind? Presumably, folio_split() should replace all split_huge*() to reduce total number of folios after a split. But for swapcache folios, I need to figure out if swap system works well with buddy allocator like splits. Best Regards, Yan, Zi
On Fri, Oct 18, 2024 at 12:11 PM Zi Yan <ziy@nvidia.com> wrote: > > On 18 Oct 2024, at 14:42, David Hildenbrand wrote: > > > On 09.10.24 00:37, Zi Yan wrote: > >> Hi all, > > > > Hi! > > > >> > >> Matthew and I have discussed about a different way of splitting large > >> folios. Instead of split one folio uniformly into the same order smaller > >> ones, doing buddy allocator like split can reduce the total number of > >> resulting folios, the amount of memory needed for multi-index xarray > >> split, and keep more large folios after a split. In addition, both > >> Hugh[1] and Ryan[2] had similar suggestions before. > >> > >> The patch is an initial implementation. It passes simple order-9 to > >> lower order split tests for anonymous folios and pagecache folios. > >> There are still a lot of TODOs to make it upstream. But I would like to gather > >> feedbacks before that. > > > > Interesting, but I don't see any actual users besides the debug/test interface wired up. > > Right. I am working on it now, since two potential users, anon large folios > and truncate, might need more sophisticated implementation to fully take > advantage of this new split. > > For anon large folios, this might be open to debate, if only a subset of > orders are enabled, I assume folio_split() can only split to smaller > folios with the enabled orders. For example, to get one order-0 from > an order-9, and only order-4 (64KB on x86) is enabled, folio_split() > can only split the order-9 to 16 order-0s, 31 order-4s, unless we are > OK with anon large folios with not enabled orders appear in the system. For anon large folios, deferred split may be a problem too. The deferred split is typically used to free the unmapped subpages by, for example, MADV_DONTNEED. But we don't know which subpages are unmapped without reading their _mapcount by iterating every subpages. > > For truncate, the example you give below is an easy one. For cases like > punching from 3rd to 5th order-0 of a order-3, [O0, O0, __, __, __, O0, O0, O0], > I am thinking which approach is better: > > 1. two folio_split()s, > 1) split second order-1 from order-3, 2) split order-0 from the second order-2; > > 2. one folio_split() by making folio_split() to support arbitrary range split, > so two steps in 1 can be done in one shot, which saves unmapping and remapping > cost. > > Maybe I should go for 1 first as an easy route, but I still need an algorithm > in truncate to figure out the way of calling folio_split()s. > > > > > I assume ftruncate() / fallocate(PUNCH_HOLE) might be good use cases? For example, when punching 1M of a 2M folio, we can just leave a 1M folio in the pagecache. > > Yes, I am trying to make this work. > > > > > Any other obvious users you have in mind? > > Presumably, folio_split() should replace all split_huge*() to reduce total > number of folios after a split. But for swapcache folios, I need to figure > out if swap system works well with buddy allocator like splits. > > > > Best Regards, > Yan, Zi
On 18.10.24 21:44, Yang Shi wrote: > On Fri, Oct 18, 2024 at 12:11 PM Zi Yan <ziy@nvidia.com> wrote: >> >> On 18 Oct 2024, at 14:42, David Hildenbrand wrote: >> >>> On 09.10.24 00:37, Zi Yan wrote: >>>> Hi all, >>> >>> Hi! >>> >>>> >>>> Matthew and I have discussed about a different way of splitting large >>>> folios. Instead of split one folio uniformly into the same order smaller >>>> ones, doing buddy allocator like split can reduce the total number of >>>> resulting folios, the amount of memory needed for multi-index xarray >>>> split, and keep more large folios after a split. In addition, both >>>> Hugh[1] and Ryan[2] had similar suggestions before. >>>> >>>> The patch is an initial implementation. It passes simple order-9 to >>>> lower order split tests for anonymous folios and pagecache folios. >>>> There are still a lot of TODOs to make it upstream. But I would like to gather >>>> feedbacks before that. >>> >>> Interesting, but I don't see any actual users besides the debug/test interface wired up. >> >> Right. I am working on it now, since two potential users, anon large folios >> and truncate, might need more sophisticated implementation to fully take >> advantage of this new split. >> >> For anon large folios, this might be open to debate, if only a subset of >> orders are enabled, I assume folio_split() can only split to smaller >> folios with the enabled orders. For example, to get one order-0 from >> an order-9, and only order-4 (64KB on x86) is enabled, folio_split() >> can only split the order-9 to 16 order-0s, 31 order-4s, unless we are >> OK with anon large folios with not enabled orders appear in the system. > > For anon large folios, deferred split may be a problem too. The > deferred split is typically used to free the unmapped subpages by, for > example, MADV_DONTNEED. But we don't know which subpages are unmapped > without reading their _mapcount by iterating every subpages. Yeah, and I am still hoping we can get rid of the _mapcounts. If you know a folio is exclusive, at least synchronously during MADV_DONTNEED you know what you can try split. Deferred would require an rmap walk -- if it's really worth it. -- Cheers, David / dhildenb
On Tue, Oct 08, 2024 at 06:37:47PM -0400, Zi Yan wrote: > mm/huge_memory.c | 648 ++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 647 insertions(+), 1 deletion(-) The idea is sane, but I think it would require a lot more ground work before getting it upstream. I don't think we can afford two parallel split implementations. folio_split() and split_huge_page*() should share the same implementation internally. Otherwise it is going to be pain to maintain them in-sync. -- Kiryl Shutsemau / Kirill A. Shutemov
On 9 Oct 2024, at 5:54, Kirill A. Shutemov wrote: > On Tue, Oct 08, 2024 at 06:37:47PM -0400, Zi Yan wrote: >> mm/huge_memory.c | 648 ++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 647 insertions(+), 1 deletion(-) > > The idea is sane, but I think it would require a lot more ground work > before getting it upstream. I don't think we can afford two parallel split > implementations. folio_split() and split_huge_page*() should share the same > implementation internally. Otherwise it is going to be pain to maintain > them in-sync. The goal is to replace split_huge_page*() with folio_split(). But for now, split_huge_page*() is still needed for swap cached anon folios and shmem. And this might take quite a while until we have a better swap system. I think it is possible to use the same internal implementation for both folio_split() and split_huge_page*(). I can give it a try in the next version. -- Best Regards, Yan, Zi
© 2016 - 2024 Red Hat, Inc.