The upcoming changes in compound_head() require memmap to be naturally
aligned to the maximum folio size.
Add a warning if it is not.
A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
kernel is still likely to be functional if this strict check fails.
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
include/linux/mmzone.h | 1 +
mm/sparse.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6cfede39570a..9f44dc760cdc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,6 +91,7 @@
#endif
#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
+#define MAX_FOLIO_SIZE (PAGE_SIZE << MAX_FOLIO_ORDER)
enum migratetype {
MIGRATE_UNMOVABLE,
diff --git a/mm/sparse.c b/mm/sparse.c
index 17c50a6415c2..c5810ff7c6f7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -600,6 +600,9 @@ void __init sparse_init(void)
BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
memblocks_present();
+ WARN_ON(!IS_ALIGNED((unsigned long)pfn_to_page(0),
+ MAX_FOLIO_SIZE / sizeof(struct page)));
+
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
--
2.51.2
On 2025/12/18 23:09, Kiryl Shutsemau wrote:
> The upcoming changes in compound_head() require memmap to be naturally
> aligned to the maximum folio size.
>
> Add a warning if it is not.
>
> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the
> kernel is still likely to be functional if this strict check fails.
Different architectures default to 2 MB alignment (mainly to
enable huge mappings), which only accommodates folios up to
128 MB. Yet 1 GB huge pages are still fairly common, so
validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to
miss the most frequent case.
I’m concerned that this might plant a hidden time bomb: it
could detonate at any moment in later code, silently triggering
memory corruption or similar failures. Therefore, I don’t
think a WARNING is a good choice.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
> include/linux/mmzone.h | 1 +
> mm/sparse.c | 3 +++
> 2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6cfede39570a..9f44dc760cdc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -91,6 +91,7 @@
> #endif
>
> #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER)
> +#define MAX_FOLIO_SIZE (PAGE_SIZE << MAX_FOLIO_ORDER)
>
> enum migratetype {
> MIGRATE_UNMOVABLE,
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 17c50a6415c2..c5810ff7c6f7 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -600,6 +600,9 @@ void __init sparse_init(void)
> BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section)));
> memblocks_present();
>
> + WARN_ON(!IS_ALIGNED((unsigned long)pfn_to_page(0),
> + MAX_FOLIO_SIZE / sizeof(struct page)));
> +
> pnum_begin = first_present_section_nr();
> nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
>
On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: > > > On 2025/12/18 23:09, Kiryl Shutsemau wrote: > > The upcoming changes in compound_head() require memmap to be naturally > > aligned to the maximum folio size. > > > > Add a warning if it is not. > > > > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the > > kernel is still likely to be functional if this strict check fails. > > Different architectures default to 2 MB alignment (mainly to > enable huge mappings), which only accommodates folios up to > 128 MB. Yet 1 GB huge pages are still fairly common, so > validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to > miss the most frequent case. I don't follow. 16 GB check is more strict that anything smaller. How can it miss the most frequent case? > I’m concerned that this might plant a hidden time bomb: it > could detonate at any moment in later code, silently triggering > memory corruption or similar failures. Therefore, I don’t > think a WARNING is a good choice. We can upgrade it BUG_ON(), but I want to understand your logic here first. -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/22/25 15:02, Kiryl Shutsemau wrote: > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >> >> >> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>> The upcoming changes in compound_head() require memmap to be naturally >>> aligned to the maximum folio size. >>> >>> Add a warning if it is not. >>> >>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>> kernel is still likely to be functional if this strict check fails. >> >> Different architectures default to 2 MB alignment (mainly to >> enable huge mappings), which only accommodates folios up to >> 128 MB. Yet 1 GB huge pages are still fairly common, so >> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >> miss the most frequent case. > > I don't follow. 16 GB check is more strict that anything smaller. > How can it miss the most frequent case? > >> I’m concerned that this might plant a hidden time bomb: it >> could detonate at any moment in later code, silently triggering >> memory corruption or similar failures. Therefore, I don’t >> think a WARNING is a good choice. > > We can upgrade it BUG_ON(), but I want to understand your logic here > first. Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough? This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly? But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much, -- Cheers David
On Mon, Dec 22, 2025 at 03:18:29PM +0100, David Hildenbrand (Red Hat) wrote: > On 12/22/25 15:02, Kiryl Shutsemau wrote: > > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: > > > > > > > > > On 2025/12/18 23:09, Kiryl Shutsemau wrote: > > > > The upcoming changes in compound_head() require memmap to be naturally > > > > aligned to the maximum folio size. > > > > > > > > Add a warning if it is not. > > > > > > > > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the > > > > kernel is still likely to be functional if this strict check fails. > > > > > > Different architectures default to 2 MB alignment (mainly to > > > enable huge mappings), which only accommodates folios up to > > > 128 MB. Yet 1 GB huge pages are still fairly common, so > > > validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to > > > miss the most frequent case. > > > > I don't follow. 16 GB check is more strict that anything smaller. > > How can it miss the most frequent case? > > > > > I’m concerned that this might plant a hidden time bomb: it > > > could detonate at any moment in later code, silently triggering > > > memory corruption or similar failures. Therefore, I don’t > > > think a WARNING is a good choice. > > > > We can upgrade it BUG_ON(), but I want to understand your logic here > > first. > > Definitely no BUG_ON(). I would assume this is something we would find early > during testing, so even a VM_WARN_ON_ONCE() should be good enough? > > This smells like a possible problem, though, as soon as some architecture > wants to increase the folio size. What would be the expected step to ensure > the alignment is done properly? It depends on memory model and whether the arch has KASLR for memmap. > But OTOH, as I raised Willy's work will make all of that here obsolete > either way, so maybe not worth worrying about that case too much, Willy, what is timeline here? -- Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 22, 2025, at 22:52, Kiryl Shutsemau <kas@kernel.org> wrote: > > On Mon, Dec 22, 2025 at 03:18:29PM +0100, David Hildenbrand (Red Hat) wrote: >>> On 12/22/25 15:02, Kiryl Shutsemau wrote: >>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >>>> >>>> >>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>>>> The upcoming changes in compound_head() require memmap to be naturally >>>>> aligned to the maximum folio size. >>>>> >>>>> Add a warning if it is not. >>>>> >>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>>>> kernel is still likely to be functional if this strict check fails. >>>> >>>> Different architectures default to 2 MB alignment (mainly to >>>> enable huge mappings), which only accommodates folios up to >>>> 128 MB. Yet 1 GB huge pages are still fairly common, so >>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >>>> miss the most frequent case. >>> >>> I don't follow. 16 GB check is more strict that anything smaller. >>> How can it miss the most frequent case? >>> >>>> I’m concerned that this might plant a hidden time bomb: it >>>> could detonate at any moment in later code, silently triggering >>>> memory corruption or similar failures. Therefore, I don’t >>>> think a WARNING is a good choice. >>> >>> We can upgrade it BUG_ON(), but I want to understand your logic here >>> first. >> >> Definitely no BUG_ON(). I would assume this is something we would find early >> during testing, so even a VM_WARN_ON_ONCE() should be good enough? >> >> This smells like a possible problem, though, as soon as some architecture >> wants to increase the folio size. What would be the expected step to ensure >> the alignment is done properly? > > It depends on memory model and whether the arch has KASLR for memmap. Yes. Theoretically, the most correct approach is to ensure that the randomly chosen offset at the KASLR relocation site meets alignment requirements, and it likely needs to be adapted for each architecture—sounds rather tedious. > >> But OTOH, as I raised Willy's work will make all of that here obsolete >> either way, so maybe not worth worrying about that case too much, > > Willy, what is timeline here? > > -- > Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote: > > On 12/22/25 15:02, Kiryl Shutsemau wrote: >>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >>> >>> >>> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>>> The upcoming changes in compound_head() require memmap to be naturally >>>> aligned to the maximum folio size. >>>> >>>> Add a warning if it is not. >>>> >>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>>> kernel is still likely to be functional if this strict check fails. >>> >>> Different architectures default to 2 MB alignment (mainly to >>> enable huge mappings), which only accommodates folios up to >>> 128 MB. Yet 1 GB huge pages are still fairly common, so >>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >>> miss the most frequent case. >> I don't follow. 16 GB check is more strict that anything smaller. >> How can it miss the most frequent case? >>> I’m concerned that this might plant a hidden time bomb: it >>> could detonate at any moment in later code, silently triggering >>> memory corruption or similar failures. Therefore, I don’t >>> think a WARNING is a good choice. >> We can upgrade it BUG_ON(), but I want to understand your logic here >> first. > > Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough? > > This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly? > > But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much, Hi David, I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance. Thank you very much in advance. Best regards, > > -- > Cheers > > David
On 12/22/25 15:55, Muchun Song wrote: > > >> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote: >> >> On 12/22/25 15:02, Kiryl Shutsemau wrote: >>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >>>> >>>> >>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>>>> The upcoming changes in compound_head() require memmap to be naturally >>>>> aligned to the maximum folio size. >>>>> >>>>> Add a warning if it is not. >>>>> >>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>>>> kernel is still likely to be functional if this strict check fails. >>>> >>>> Different architectures default to 2 MB alignment (mainly to >>>> enable huge mappings), which only accommodates folios up to >>>> 128 MB. Yet 1 GB huge pages are still fairly common, so >>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >>>> miss the most frequent case. >>> I don't follow. 16 GB check is more strict that anything smaller. >>> How can it miss the most frequent case? >>>> I’m concerned that this might plant a hidden time bomb: it >>>> could detonate at any moment in later code, silently triggering >>>> memory corruption or similar failures. Therefore, I don’t >>>> think a WARNING is a good choice. >>> We can upgrade it BUG_ON(), but I want to understand your logic here >>> first. >> >> Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough? >> >> This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly? >> >> But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much, > > Hi David, > Hi! :) > I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance. > Thank you very much in advance. There is some information to be had at [1], but more at [2]. Take a look at [2] in "After those projects are complete - Then we can shrink struct page to 32 bytes:" In essence, all pages (belonging to a memdesc) will have a "memdesc" pointer (that replaces the compound_head pointer). "Then we make page->compound_head point to the dynamically allocated memdesc rather than the first page. Then we can transition to the above layout. " The "memdesc" could be a pointer to a "struct folio" that is allocated from the slab. So in the new memdesc world, all pages part of a folio will point at the allocated "struct folio", not the head page where "struct folio" currently overlays "struct page". That would mean that the proposal in this patch set will have to be reverted again. At LPC, Willy said that he wants to have something out there in the first half of 2026. [1] https://kernelnewbies.org/MatthewWilcox/Memdescs [2] https://kernelnewbies.org/MatthewWilcox/Memdescs/Path -- Cheers David
On Tue, Dec 23, 2025 at 10:38:26AM +0100, David Hildenbrand (Red Hat) wrote: > On 12/22/25 15:55, Muchun Song wrote: > > > > > > > On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote: > > > > > > On 12/22/25 15:02, Kiryl Shutsemau wrote: > > > > > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: > > > > > > > > > > > > > > > On 2025/12/18 23:09, Kiryl Shutsemau wrote: > > > > > > The upcoming changes in compound_head() require memmap to be naturally > > > > > > aligned to the maximum folio size. > > > > > > > > > > > > Add a warning if it is not. > > > > > > > > > > > > A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the > > > > > > kernel is still likely to be functional if this strict check fails. > > > > > > > > > > Different architectures default to 2 MB alignment (mainly to > > > > > enable huge mappings), which only accommodates folios up to > > > > > 128 MB. Yet 1 GB huge pages are still fairly common, so > > > > > validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to > > > > > miss the most frequent case. > > > > I don't follow. 16 GB check is more strict that anything smaller. > > > > How can it miss the most frequent case? > > > > > I’m concerned that this might plant a hidden time bomb: it > > > > > could detonate at any moment in later code, silently triggering > > > > > memory corruption or similar failures. Therefore, I don’t > > > > > think a WARNING is a good choice. > > > > We can upgrade it BUG_ON(), but I want to understand your logic here > > > > first. > > > > > > Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough? > > > > > > This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly? > > > > > > But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much, > > > > Hi David, > > > > Hi! :) > > > I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance. > > Thank you very much in advance. > > There is some information to be had at [1], but more at [2]. Take a look at > [2] in "After those projects are complete - Then we can shrink struct page > to 32 bytes:" > > In essence, all pages (belonging to a memdesc) will have a "memdesc" pointer > (that replaces the compound_head pointer). > > "Then we make page->compound_head point to the dynamically allocated memdesc > rather than the first page. Then we can transition to the above layout. " I am not sure I understand how it is going to work. 32-byte layout indicates that flags will stay in the statically allocated part, but most (all?) flags are in the head page and we would need a way to redirect from tail to head in the statically allocated pages. > The "memdesc" could be a pointer to a "struct folio" that is allocated from > the slab. > > So in the new memdesc world, all pages part of a folio will point at the > allocated "struct folio", not the head page where "struct folio" currently > overlays "struct page". > > That would mean that the proposal in this patch set will have to be reverted > again. > > > At LPC, Willy said that he wants to have something out there in the first > half of 2026. Okay, seems ambitious to me. Last time I asked, we had no idea how much performance would additional indirection cost us. Do we have a clue? I like memdesc idea, but indirection cost always bothered me. -- Kiryl Shutsemau / Kirill A. Shutemov
>> "Then we make page->compound_head point to the dynamically allocated memdesc
>> rather than the first page. Then we can transition to the above layout. "
>
Sorry for the late reply, it's been a bit crazy over here.
> I am not sure I understand how it is going to work.
>
I don't recall all the details that Willy shared over the last years
while working on folios, but I will try to answer as best as I can from
the top of my head. (there are plenty of resources on the list, on the
web, in his presentations etc.).
> 32-byte layout indicates that flags will stay in the statically
> allocated part, but most (all?) flags are in the head page and we would
> need a way to redirect from tail to head in the statically allocated
> pages.
When working with folios we will never go through the head page flags.
That's why Willy has incrementally converted most folio code that worked
on pages to work on folios.
For example, PageUptodate() does a
folio_test_uptodate(page_folio(page));
The flags in the 32-byte layout will be used by some non-folio things
for which we won't allocate memdescs (just yet) (e.g., free pages in the
buddy and other things that does not require a lot of metadata). Some of
these flags will be moved into the memdesc pointer in the future as the
conversion proceeeds.
>
>> The "memdesc" could be a pointer to a "struct folio" that is allocated from
>> the slab.
>>
>> So in the new memdesc world, all pages part of a folio will point at the
>> allocated "struct folio", not the head page where "struct folio" currently
>> overlays "struct page".
>>
>> That would mean that the proposal in this patch set will have to be reverted
>> again.
>>
>>
>> At LPC, Willy said that he wants to have something out there in the first
>> half of 2026.
>
> Okay, seems ambitious to me.
When the program was called "2025" I considered it very ambitious :) Now
I consider it ambitious. I think Willy already shared early versions of
the "struct slab" split and the "struct ptdesc" split recently on the list.
>
> Last time I asked, we had no idea how much performance would additional
> indirection cost us. Do we have a clue?
I raised that in the past, and I think the answer I got was that
(a) We always had these indirection cost when going from tail page to
head page / folio.
(b) We must convert the code to do as little page_folio() as possible.
That's why we saw so much code conversion to stop working on pages
and only work on folios.
There are certainly cases where we cannot currently avoid the
indirection, like when we traverse a page table and go
pfn -> page -> folio
and cannot simply go
pfn -> folio
On the bright side, we'll lose the head-page checks and can simply
dereference the pointer.
I don't know whether Willy has more information yet, but I would assume
that in most cases this will be similar to the performance summary in
your cover letter: "... has shown either no change or only a slight
improvement within the noise.", just that it will be "only a slight
degradation within the noise". :)
We'll learn I guess, in particular which other page -> folio conversions
cannot be optimized out by caching the folio.
For quite some time there will be a magical config option that will
switch between both layouts. I'd assume that things will get more
complicated if we suddenly have a "compound_head/folio" pointer and a
"compound_info" pointer at the same time.
But it's really Willy who has the concept in mind as he is very likely
right now busy writing some of that code.
I'm just the messenger.
:)
[I would hope that Willy could share his thoughts]
--
Cheers
David
On Thu, Jan 08, 2026 at 12:08:35AM +0100, David Hildenbrand (Red Hat) wrote: > > > "Then we make page->compound_head point to the dynamically allocated memdesc > > > rather than the first page. Then we can transition to the above layout. " > > > > Sorry for the late reply, it's been a bit crazy over here. > > > I am not sure I understand how it is going to work. > > > > I don't recall all the details that Willy shared over the last years while > working on folios, but I will try to answer as best as I can from the top of > my head. (there are plenty of resources on the list, on the web, in his > presentations etc.). > > > 32-byte layout indicates that flags will stay in the statically > > allocated part, but most (all?) flags are in the head page and we would > > need a way to redirect from tail to head in the statically allocated > > pages. > > When working with folios we will never go through the head page flags. > That's why Willy has incrementally converted most folio code that worked on > pages to work on folios. A little more detail here: - Zone/Node/Section stay in page->flags and are replicated to folio->flags - HWPoison stays in page->flags - Reserved stays in page->flags - AnonExclusive stays in page->flags - Writeback/Referenced/Uptodate/Dirty/LRU/Active/Workingset/Owner1/ Owner2/Reclaim/Swapbacked/Unevictable/Dropbehind/MLocked/Young/Idle all exist only in folio->flags - Head/Private/Private2 all go away - Locked & Waiters are ... complicated. I'll elaborate if there's demand. - I haven't put any effort into analyzing the Xen flags. - HasHWPoisoned/LargeRmappable/PartiallyMapped all move to folio->flags > When the program was called "2025" I considered it very ambitious :) Now I > consider it ambitious. I think Willy already shared early versions of the > "struct slab" split and the "struct ptdesc" split recently on the list. ptdesc, yes. Slab is still in progress. > For quite some time there will be a magical config option that will switch > between both layouts. I'd assume that things will get more complicated if we > suddenly have a "compound_head/folio" pointer and a "compound_info" pointer > at the same time. What I'm hoping to get to is a point where calling compound_head() on a page which is part of a folio is a BUG. You should only be calling page_folio() on a page which is part of a folio -- because there's nothing useful to find in the head page. So compound_head (or compound_info) can share space with page->memdesc. For now I've actually put page->memdesc adjacent to page->compound_head, for no reason that I can recall. I had thought that calling page_folio() on a page that's not part of a folio would also be a BUG(), but now I think it's better to quietly return NULL. That's based on my experience working with slab and ptdesc.
>> When the program was called "2025" I considered it very ambitious :) Now I >> consider it ambitious. I think Willy already shared early versions of the >> "struct slab" split and the "struct ptdesc" split recently on the list. > > ptdesc, yes. Slab is still in progress. Ah, I could have sworn you sent something out, but maybe these were preparations only. :) > >> For quite some time there will be a magical config option that will switch >> between both layouts. I'd assume that things will get more complicated if we >> suddenly have a "compound_head/folio" pointer and a "compound_info" pointer >> at the same time. > > What I'm hoping to get to is a point where calling compound_head() on > a page which is part of a folio is a BUG. You should only be calling > page_folio() on a page which is part of a folio -- because there's nothing > useful to find in the head page. So compound_head (or compound_info) can > share space with page->memdesc. For now I've actually put page->memdesc > adjacent to page->compound_head, for no reason that I can recall. > > I had thought that calling page_folio() on a page that's not part of > a folio would also be a BUG(), but now I think it's better to quietly > return NULL. That's based on my experience working with slab and ptdesc. So once that is in, even if we only allocate "struct folio" separately, the whole fake-head stuff can go away either way, as it is hugetlb->folio material only. Which leaves the question whether we should consider Kiryl's patch set in the meantime here as something to merge. Willy, what is the rough timeline until we can expect to see at least "struct folio" get allocated separately, and would this patch set here get in the way of doing so, or doesn't it really matter? -- Cheers David
On Thu, Jan 08, 2026 at 12:08:35AM +0100, David Hildenbrand (Red Hat) wrote: > > > "Then we make page->compound_head point to the dynamically allocated memdesc > > > rather than the first page. Then we can transition to the above layout. " > > > > Sorry for the late reply, it's been a bit crazy over here. > > > I am not sure I understand how it is going to work. > > > > I don't recall all the details that Willy shared over the last years while > working on folios, but I will try to answer as best as I can from the top of > my head. (there are plenty of resources on the list, on the web, in his > presentations etc.). > > > 32-byte layout indicates that flags will stay in the statically > > allocated part, but most (all?) flags are in the head page and we would > > need a way to redirect from tail to head in the statically allocated > > pages. > > When working with folios we will never go through the head page flags. > That's why Willy has incrementally converted most folio code that worked on > pages to work on folios. > > For example, PageUptodate() does a > > folio_test_uptodate(page_folio(page)); > > The flags in the 32-byte layout will be used by some non-folio things for > which we won't allocate memdescs (just yet) (e.g., free pages in the buddy > and other things that does not require a lot of metadata). Some of these > flags will be moved into the memdesc pointer in the future as the conversion > proceeeds. Okay, makes sense. > > > The "memdesc" could be a pointer to a "struct folio" that is allocated from > > > the slab. > > > > > > So in the new memdesc world, all pages part of a folio will point at the > > > allocated "struct folio", not the head page where "struct folio" currently > > > overlays "struct page". > > > > > > That would mean that the proposal in this patch set will have to be reverted > > > again. > > > > > > > > > At LPC, Willy said that he wants to have something out there in the first > > > half of 2026. > > > > Okay, seems ambitious to me. > > When the program was called "2025" I considered it very ambitious :) Now I > consider it ambitious. I think Willy already shared early versions of the > "struct slab" split and the "struct ptdesc" split recently on the list. > > > > > Last time I asked, we had no idea how much performance would additional > > indirection cost us. Do we have a clue? > > I raised that in the past, and I think the answer I got was that > > (a) We always had these indirection cost when going from tail page to > head page / folio. > (b) We must convert the code to do as little page_folio() as possible. > That's why we saw so much code conversion to stop working on pages > and only work on folios. > > There are certainly cases where we cannot currently avoid the indirection, > like when we traverse a page table and go > > pfn -> page -> folio > > and cannot simply go > > pfn -> folio > > On the bright side, we'll lose the head-page checks and can simply > dereference the pointer. > > I don't know whether Willy has more information yet, but I would assume that > in most cases this will be similar to the performance summary in your cover > letter: "... has shown either no change or only a slight improvement within > the noise.", just that it will be "only a slight degradation within the > noise". :) > > We'll learn I guess, in particular which other page -> folio conversions > cannot be optimized out by caching the folio. > > > For quite some time there will be a magical config option that will switch > between both layouts. I'd assume that things will get more complicated if we > suddenly have a "compound_head/folio" pointer and a "compound_info" pointer > at the same time. > > But it's really Willy who has the concept in mind as he is very likely right > now busy writing some of that code. > > I'm just the messenger. > > :) > > [I would hope that Willy could share his thoughts] If you or Willy think that this patch will impede memdesc progress, I am okay not pushing this patchset upstream. I was really excited when I found this trick to get rid of fake heads. But ultimately, it is a clean up. I failed to find a performance win I hoped for. Also, I try to understand what 32-byte layout means for fake heads. _refcount in struct page is going to 0 and refcounting happens on folios. So I wounder if we can all pages identical (no tail pages per se) and avoid fake heads this way? -- Kiryl Shutsemau / Kirill A. Shutemov
>> For quite some time there will be a magical config option that will switch >> between both layouts. I'd assume that things will get more complicated if we >> suddenly have a "compound_head/folio" pointer and a "compound_info" pointer >> at the same time. >> >> But it's really Willy who has the concept in mind as he is very likely right >> now busy writing some of that code. >> >> I'm just the messenger. >> >> :) >> >> [I would hope that Willy could share his thoughts] > > If you or Willy think that this patch will impede memdesc progress, I am > okay not pushing this patchset upstream. I pinged Willy. > > I was really excited when I found this trick to get rid of fake heads. > But ultimately, it is a clean up. I failed to find a performance win I > hoped for. I think it's quite nice as a cleanup, and if we wouldn't have memdescs on the horizon that essentially change the code completely in another direction (having all pages point to a struct folio, not just the tail pages), I wouldn't be bringing this up :) > > Also, I try to understand what 32-byte layout means for fake heads. > _refcount in struct page is going to 0 and refcounting happens on folios. Yes, for folios. > So I wounder if we can all pages identical (no tail pages per se) and > avoid fake heads this way? That's the ultimate goal, yes. Essentially, all pages will point to the memdesc, and there will not be a reason to check for head/fake-head etc. I think initially, the compound-page concept might still co-exist for some memdescs that we won't initially allocate separately. But I don't know the details of that. I know that the transition phase is tricky :) Regarding reference and folios: yes exactly. When trying to get a reference, we'll spot in the memdesc field that this is a folio and try on the folio instead. In the future, most pages will either be permanently frozen and not have a refcount (e.g., struct ptdesc), or have a refcount in their memdesc. In the transition, the location of the refcount depends on memdesc type (in memdesc vs. in page). -- Cheers David
On Thu, Jan 08, 2026 at 12:32:47PM +0000, Kiryl Shutsemau wrote: > On Thu, Jan 08, 2026 at 12:08:35AM +0100, David Hildenbrand (Red Hat) wrote: > > > > "Then we make page->compound_head point to the dynamically allocated memdesc > > > > rather than the first page. Then we can transition to the above layout. " > > > > > > > Sorry for the late reply, it's been a bit crazy over here. > > > > > I am not sure I understand how it is going to work. > > > > > > > I don't recall all the details that Willy shared over the last years while > > working on folios, but I will try to answer as best as I can from the top of > > my head. (there are plenty of resources on the list, on the web, in his > > presentations etc.). > > > > > 32-byte layout indicates that flags will stay in the statically > > > allocated part, but most (all?) flags are in the head page and we would > > > need a way to redirect from tail to head in the statically allocated > > > pages. > > > > When working with folios we will never go through the head page flags. > > That's why Willy has incrementally converted most folio code that worked on > > pages to work on folios. > > > > For example, PageUptodate() does a > > > > folio_test_uptodate(page_folio(page)); > > > > The flags in the 32-byte layout will be used by some non-folio things for > > which we won't allocate memdescs (just yet) (e.g., free pages in the buddy > > and other things that does not require a lot of metadata). Some of these > > flags will be moved into the memdesc pointer in the future as the conversion > > proceeeds. > > Okay, makes sense. > > > > > The "memdesc" could be a pointer to a "struct folio" that is allocated from > > > > the slab. > > > > > > > > So in the new memdesc world, all pages part of a folio will point at the > > > > allocated "struct folio", not the head page where "struct folio" currently > > > > overlays "struct page". > > > > > > > > That would mean that the proposal in this patch set will have to be reverted > > > > again. > > > > > > > > > > > > At LPC, Willy said that he wants to have something out there in the first > > > > half of 2026. > > > > > > Okay, seems ambitious to me. > > > > When the program was called "2025" I considered it very ambitious :) Now I > > consider it ambitious. I think Willy already shared early versions of the > > "struct slab" split and the "struct ptdesc" split recently on the list. > > > > > > > > Last time I asked, we had no idea how much performance would additional > > > indirection cost us. Do we have a clue? > > > > I raised that in the past, and I think the answer I got was that > > > > (a) We always had these indirection cost when going from tail page to > > head page / folio. > > (b) We must convert the code to do as little page_folio() as possible. > > That's why we saw so much code conversion to stop working on pages > > and only work on folios. > > > > There are certainly cases where we cannot currently avoid the indirection, > > like when we traverse a page table and go > > > > pfn -> page -> folio > > > > and cannot simply go > > > > pfn -> folio > > > > On the bright side, we'll lose the head-page checks and can simply > > dereference the pointer. > > > > I don't know whether Willy has more information yet, but I would assume that > > in most cases this will be similar to the performance summary in your cover > > letter: "... has shown either no change or only a slight improvement within > > the noise.", just that it will be "only a slight degradation within the > > noise". :) > > > > We'll learn I guess, in particular which other page -> folio conversions > > cannot be optimized out by caching the folio. > > > > > > For quite some time there will be a magical config option that will switch > > between both layouts. I'd assume that things will get more complicated if we > > suddenly have a "compound_head/folio" pointer and a "compound_info" pointer > > at the same time. > > > > But it's really Willy who has the concept in mind as he is very likely right > > now busy writing some of that code. > > > > I'm just the messenger. > > > > :) > > > > [I would hope that Willy could share his thoughts] > > If you or Willy think that this patch will impede memdesc progress, I am > okay not pushing this patchset upstream. Or other option is to get this patchset upstream (I need to fix/test few things still) and revert it later when (if?) memdesc lands. What do you think? -- Kiryl Shutsemau / Kirill A. Shutemov
> On Jan 8, 2026, at 21:30, Kiryl Shutsemau <kas@kernel.org> wrote: > > On Thu, Jan 08, 2026 at 12:32:47PM +0000, Kiryl Shutsemau wrote: >> On Thu, Jan 08, 2026 at 12:08:35AM +0100, David Hildenbrand (Red Hat) wrote: >>>>> "Then we make page->compound_head point to the dynamically allocated memdesc >>>>> rather than the first page. Then we can transition to the above layout. " >>>> >>> >>> Sorry for the late reply, it's been a bit crazy over here. >>> >>>> I am not sure I understand how it is going to work. >>>> >>> >>> I don't recall all the details that Willy shared over the last years while >>> working on folios, but I will try to answer as best as I can from the top of >>> my head. (there are plenty of resources on the list, on the web, in his >>> presentations etc.). >>> >>>> 32-byte layout indicates that flags will stay in the statically >>>> allocated part, but most (all?) flags are in the head page and we would >>>> need a way to redirect from tail to head in the statically allocated >>>> pages. >>> >>> When working with folios we will never go through the head page flags. >>> That's why Willy has incrementally converted most folio code that worked on >>> pages to work on folios. >>> >>> For example, PageUptodate() does a >>> >>> folio_test_uptodate(page_folio(page)); >>> >>> The flags in the 32-byte layout will be used by some non-folio things for >>> which we won't allocate memdescs (just yet) (e.g., free pages in the buddy >>> and other things that does not require a lot of metadata). Some of these >>> flags will be moved into the memdesc pointer in the future as the conversion >>> proceeeds. >> >> Okay, makes sense. >> >>>>> The "memdesc" could be a pointer to a "struct folio" that is allocated from >>>>> the slab. >>>>> >>>>> So in the new memdesc world, all pages part of a folio will point at the >>>>> allocated "struct folio", not the head page where "struct folio" currently >>>>> overlays "struct page". >>>>> >>>>> That would mean that the proposal in this patch set will have to be reverted >>>>> again. >>>>> >>>>> >>>>> At LPC, Willy said that he wants to have something out there in the first >>>>> half of 2026. >>>> >>>> Okay, seems ambitious to me. >>> >>> When the program was called "2025" I considered it very ambitious :) Now I >>> consider it ambitious. I think Willy already shared early versions of the >>> "struct slab" split and the "struct ptdesc" split recently on the list. >>> >>>> >>>> Last time I asked, we had no idea how much performance would additional >>>> indirection cost us. Do we have a clue? >>> >>> I raised that in the past, and I think the answer I got was that >>> >>> (a) We always had these indirection cost when going from tail page to >>> head page / folio. >>> (b) We must convert the code to do as little page_folio() as possible. >>> That's why we saw so much code conversion to stop working on pages >>> and only work on folios. >>> >>> There are certainly cases where we cannot currently avoid the indirection, >>> like when we traverse a page table and go >>> >>> pfn -> page -> folio >>> >>> and cannot simply go >>> >>> pfn -> folio >>> >>> On the bright side, we'll lose the head-page checks and can simply >>> dereference the pointer. >>> >>> I don't know whether Willy has more information yet, but I would assume that >>> in most cases this will be similar to the performance summary in your cover >>> letter: "... has shown either no change or only a slight improvement within >>> the noise.", just that it will be "only a slight degradation within the >>> noise". :) >>> >>> We'll learn I guess, in particular which other page -> folio conversions >>> cannot be optimized out by caching the folio. >>> >>> >>> For quite some time there will be a magical config option that will switch >>> between both layouts. I'd assume that things will get more complicated if we >>> suddenly have a "compound_head/folio" pointer and a "compound_info" pointer >>> at the same time. >>> >>> But it's really Willy who has the concept in mind as he is very likely right >>> now busy writing some of that code. >>> >>> I'm just the messenger. >>> >>> :) >>> >>> [I would hope that Willy could share his thoughts] >> >> If you or Willy think that this patch will impede memdesc progress, I am >> okay not pushing this patchset upstream. > > Or other option is to get this patchset upstream (I need to fix/test few > things still) and revert it later when (if?) memdesc lands. > > What do you think? It seems the merge of memdesc is still some time away? If it’s going to take a while, my personal preference is to merge it first and then decide whether to revert the changes based on actual needs. Thanks. > > -- > Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 23, 2025, at 17:38, David Hildenbrand (Red Hat) <david@kernel.org> wrote: > > On 12/22/25 15:55, Muchun Song wrote: >>> On Dec 22, 2025, at 22:18, David Hildenbrand (Red Hat) <david@kernel.org> wrote: >>> >>> On 12/22/25 15:02, Kiryl Shutsemau wrote: >>>>> On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >>>>> >>>>> >>>>> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>>>>> The upcoming changes in compound_head() require memmap to be naturally >>>>>> aligned to the maximum folio size. >>>>>> >>>>>> Add a warning if it is not. >>>>>> >>>>>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>>>>> kernel is still likely to be functional if this strict check fails. >>>>> >>>>> Different architectures default to 2 MB alignment (mainly to >>>>> enable huge mappings), which only accommodates folios up to >>>>> 128 MB. Yet 1 GB huge pages are still fairly common, so >>>>> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >>>>> miss the most frequent case. >>>> I don't follow. 16 GB check is more strict that anything smaller. >>>> How can it miss the most frequent case? >>>>> I’m concerned that this might plant a hidden time bomb: it >>>>> could detonate at any moment in later code, silently triggering >>>>> memory corruption or similar failures. Therefore, I don’t >>>>> think a WARNING is a good choice. >>>> We can upgrade it BUG_ON(), but I want to understand your logic here >>>> first. >>> >>> Definitely no BUG_ON(). I would assume this is something we would find early during testing, so even a VM_WARN_ON_ONCE() should be good enough? >>> >>> This smells like a possible problem, though, as soon as some architecture wants to increase the folio size. What would be the expected step to ensure the alignment is done properly? >>> >>> But OTOH, as I raised Willy's work will make all of that here obsolete either way, so maybe not worth worrying about that case too much, >> Hi David, > > Hi! :) > >> I hope you're doing well. I must admit I have limited knowledge of Willy's work, and I was wondering if you might be kind enough to share any publicly available links where I could learn more about the future direction of this project. I would be truly grateful for your guidance. >> Thank you very much in advance. > > There is some information to be had at [1], but more at [2]. Take a look at [2] in "After those projects are complete - Then we can shrink struct page to 32 bytes:" > > In essence, all pages (belonging to a memdesc) will have a "memdesc" pointer (that replaces the compound_head pointer). > > "Then we make page->compound_head point to the dynamically allocated memdesc rather than the first page. Then we can transition to the above layout. " > > The "memdesc" could be a pointer to a "struct folio" that is allocated from the slab. > > So in the new memdesc world, all pages part of a folio will point at the allocated "struct folio", not the head page where "struct folio" currently overlays "struct page". > > That would mean that the proposal in this patch set will have to be reverted again. > > > At LPC, Willy said that he wants to have something out there in the first half of 2026. > > [1] https://kernelnewbies.org/MatthewWilcox/Memdescs > [2] https://kernelnewbies.org/MatthewWilcox/Memdescs/Path Many thanks for taking the time to explain everything in detail and for providing such valuable information. I plan to invest additional time to fully understand the details you’ve shared. Muchun, Thanks. > > -- > Cheers > > David
> On Dec 22, 2025, at 22:03, Kiryl Shutsemau <kas@kernel.org> wrote: > On Mon, Dec 22, 2025 at 04:34:40PM +0800, Muchun Song wrote: >> >> >> On 2025/12/18 23:09, Kiryl Shutsemau wrote: >>> The upcoming changes in compound_head() require memmap to be naturally >>> aligned to the maximum folio size. >>> Add a warning if it is not. >>> A warning is sufficient as MAX_FOLIO_ORDER is very rarely used, so the >>> kernel is still likely to be functional if this strict check fails. >> >> Different architectures default to 2 MB alignment (mainly to >> enable huge mappings), which only accommodates folios up to >> 128 MB. Yet 1 GB huge pages are still fairly common, so >> validating 16 GB (MAX_FOLIO_SIZE) alignment seems likely to >> miss the most frequent case. > > I don't follow. 16 GB check is more strict that anything smaller. > How can it miss the most frequent case? Sorry, I didn’t make myself clear. What I meant is that if this warning triggers, it implies the largest-sized folio isn’t properly aligned, and the 1 GB folios are probably mis-aligned too. Your commit message says “MAX_FOLIO_ORDER is very rarely used,” but I want to stress that 1 GB folios are actually common. If they’re also mis-aligned, we’re quietly planting a land-mine. That’s why I’m worried a mere warning isn’t enough—it leaves a latent bug in the system. If there’s a problem, we should stop right here—this is the earliest place where it will surface. As David assumed, if we expect to catch the problem during testing, then I think VM_BUG_ON would be more appropriate. Thanks. > >> I’m concerned that this might plant a hidden time bomb: it >> could detonate at any moment in later code, silently triggering >> memory corruption or similar failures. Therefore, I don’t >> think a WARNING is a good choice. > > We can upgrade it BUG_ON(), but I want to understand your logic here > first. > > -- > Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2026 Red Hat, Inc.