[v6] Enable THP support in drm_pagemap

[PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Francois Dugast 3 weeks, 3 days ago

From: Matthew Brost <matthew.brost@intel.com>

Reinitialize metadata for large zone device private folios in
zone_device_page_init prior to creating a higher-order zone device
private folio. This step is necessary when the folio’s order changes
dynamically between zone_device_page_init calls to avoid building a
corrupt folio. As part of the metadata reinitialization, the dev_pagemap
must be passed in from the caller because the pgmap stored in the folio
page may have been overwritten with a compound head.

Without this fix, individual pages could have invalid pgmap fields and
flags (with PG_locked being notably problematic) due to prior different
order allocations, which can, and will, result in kernel crashes.

Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: adhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>

---

The latest revision updates the commit message to explain what is broken
prior to this patch and also restructures the patch so it applies, and
works, on both the 6.19 branches and drm-tip, the latter in which includes
patches for the next kernel release PR. Intel CI passes on both the 6.19
branches and drm-tip at point of the first patch in this series and the
last (drm-tip only given subsequent patches in the series require in
patches drm-tip but not present 6.19).
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
 drivers/gpu/drm/drm_pagemap.c            |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 include/linux/memremap.h                 |  9 ++++--
 lib/test_hmm.c                           |  4 ++-
 mm/memremap.c                            | 35 +++++++++++++++++++++++-
 7 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e5000bef90f2..7cf9310de0ec 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
 
 	dpage = pfn_to_page(uvmem_pfn);
 	dpage->zone_device_data = pvt;
-	zone_device_page_init(dpage, 0);
+	zone_device_page_init(dpage, &kvmppc_uvmem_pgmap, 0);
 	return dpage;
 out_clear:
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af53e796ea1b..6ada7b4af7c6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
 	page = pfn_to_page(pfn);
 	svm_range_bo_ref(prange->svm_bo);
 	page->zone_device_data = prange->svm_bo;
-	zone_device_page_init(page, 0);
+	zone_device_page_init(page, page_pgmap(page), 0);
 }
 
 static void
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 03ee39a761a4..38eca94f01a1 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -201,7 +201,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
 					struct drm_pagemap_zdd *zdd)
 {
 	page->zone_device_data = drm_pagemap_zdd_get(zdd);
-	zone_device_page_init(page, 0);
+	zone_device_page_init(page, page_pgmap(page), 0);
 }
 
 /**
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 58071652679d..3d8031296eed 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -425,7 +425,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
 			order = ilog2(DMEM_CHUNK_NPAGES);
 	}
 
-	zone_device_folio_init(folio, order);
+	zone_device_folio_init(folio, page_pgmap(folio_page(folio, 0)), order);
 	return page;
 }
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 713ec0435b48..e3c2ccf872a8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -224,7 +224,8 @@ static inline bool is_fsdax_page(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page, unsigned int order);
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -234,9 +235,11 @@ bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
 
-static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
+static inline void zone_device_folio_init(struct folio *folio,
+					  struct dev_pagemap *pgmap,
+					  unsigned int order)
 {
-	zone_device_page_init(&folio->page, order);
+	zone_device_page_init(&folio->page, pgmap, order);
 	if (order)
 		folio_set_large_rmappable(folio);
 }
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 8af169d3873a..455a6862ae50 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -662,7 +662,9 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
 			goto error;
 	}
 
-	zone_device_folio_init(page_folio(dpage), order);
+	zone_device_folio_init(page_folio(dpage),
+			       page_pgmap(folio_page(page_folio(dpage), 0)),
+			       order);
 	dpage->zone_device_data = rpage;
 	return dpage;
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..ac7be07e3361 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page, unsigned int order)
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order)
 {
+	struct page *new_page = page;
+	unsigned int i;
+
 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
 
+	for (i = 0; i < (1UL << order); ++i, ++new_page) {
+		struct folio *new_folio = (struct folio *)new_page;
+
+		/*
+		 * new_page could have been part of previous higher order folio
+		 * which encodes the order, in page + 1, in the flags bits. We
+		 * blindly clear bits which could have set my order field here,
+		 * including page head.
+		 */
+		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
+
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		/*
+		 * This pointer math looks odd, but new_page could have been
+		 * part of a previous higher order folio, which sets _nr_pages
+		 * in page + 1 (new_page). Therefore, we use pointer casting to
+		 * correctly locate the _nr_pages bits within new_page which
+		 * could have modified by previous higher order folio.
+		 */
+		((struct folio *)(new_page - 1))->_nr_pages = 0;
+#endif
+
+		new_folio->mapping = NULL;
+		new_folio->pgmap = pgmap;	/* Also clear compound head */
+		new_folio->share = 0;   /* fsdax only, unused for device private */
+		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
+		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
+	}
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
-- 
2.43.0

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Andrew Morton 3 weeks, 2 days ago

On Fri, 16 Jan 2026 12:10:16 +0100 Francois Dugast <francois.dugast@intel.com> wrote:

> Reinitialize metadata for large zone device private folios in
> zone_device_page_init prior to creating a higher-order zone device
> private folio. This step is necessary when the folio’s order changes
> dynamically between zone_device_page_init calls to avoid building a
> corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> must be passed in from the caller because the pgmap stored in the folio
> page may have been overwritten with a compound head.
> 
> Without this fix, individual pages could have invalid pgmap fields and
> flags (with PG_locked being notably problematic) due to prior different
> order allocations, which can, and will, result in kernel crashes.

Is it OK to leave 6.18.x without this fixed?

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Matthew Brost 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 02:34:32PM -0800, Andrew Morton wrote:
> On Fri, 16 Jan 2026 12:10:16 +0100 Francois Dugast <francois.dugast@intel.com> wrote:
> 
> > Reinitialize metadata for large zone device private folios in
> > zone_device_page_init prior to creating a higher-order zone device
> > private folio. This step is necessary when the folio’s order changes
> > dynamically between zone_device_page_init calls to avoid building a
> > corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> > must be passed in from the caller because the pgmap stored in the folio
> > page may have been overwritten with a compound head.
> > 
> > Without this fix, individual pages could have invalid pgmap fields and
> > flags (with PG_locked being notably problematic) due to prior different
> > order allocations, which can, and will, result in kernel crashes.
> 
> Is it OK to leave 6.18.x without this fixed?

I think 6.18.x is fine, the offending patch + large device pages is
going into 6.19, right?

Matt

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks, 3 days ago

On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif

This seems too weird, why is it in the loop?  There is only one
_nr_pages per folio.

This is mostly zeroing some memory in the tail pages? Why?

Why can't this use the normal helpers, like memmap_init_compound()?

 struct folio *new_folio = page

 /* First 4 tail pages are part of struct folio */
 for (i = 4; i < (1UL << order); i++) {
     prep_compound_tail(..)
 }

 prep_comound_head(page, order)
 new_folio->_nr_pages = 0

??

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 3 weeks, 2 days ago

On 1/16/26 18:49, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
>> -void zone_device_page_init(struct page *page, unsigned int order)
>> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> +			   unsigned int order)
>>  {
>> +	struct page *new_page = page;
>> +	unsigned int i;
>> +
>>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>  
>> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> +		struct folio *new_folio = (struct folio *)new_page;
>> +
>> +		/*
>> +		 * new_page could have been part of previous higher order folio
>> +		 * which encodes the order, in page + 1, in the flags bits. We
>> +		 * blindly clear bits which could have set my order field here,
>> +		 * including page head.
>> +		 */
>> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> +
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +		/*
>> +		 * This pointer math looks odd, but new_page could have been
>> +		 * part of a previous higher order folio, which sets _nr_pages
>> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> +		 * correctly locate the _nr_pages bits within new_page which
>> +		 * could have modified by previous higher order folio.
>> +		 */
>> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> +#endif
> 
> This seems too weird, why is it in the loop?  There is only one
> _nr_pages per folio.

I suppose we could be getting say an order-9 folio that was previously used
as two order-8 folios? And each of them had their _nr_pages in their head
and we can't know that at this point so we have to reset everything?

AFAIU this would not be a problem if the clearing of the previous state was
done upon freeing, as e.g. v4 did, but I think you also argued it meant
processing the pages when freeing and then again at reallocation, so it's
now like this instead?

Or maybe you mean that stray _nr_pages in some tail page from previous
lifetimes can't affect the current lifetime in a wrong way for something
looking at said page? I don't know immediately.

> This is mostly zeroing some memory in the tail pages? Why?
> 
> Why can't this use the normal helpers, like memmap_init_compound()?
> 
>  struct folio *new_folio = page
> 
>  /* First 4 tail pages are part of struct folio */
>  for (i = 4; i < (1UL << order); i++) {
>      prep_compound_tail(..)
>  }
> 
>  prep_comound_head(page, order)
>  new_folio->_nr_pages = 0
> 
> ??
> 
> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> >> +		/*
> >> +		 * This pointer math looks odd, but new_page could have been
> >> +		 * part of a previous higher order folio, which sets _nr_pages
> >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> >> +		 * correctly locate the _nr_pages bits within new_page which
> >> +		 * could have modified by previous higher order folio.
> >> +		 */
> >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >> +#endif
> > 
> > This seems too weird, why is it in the loop?  There is only one
> > _nr_pages per folio.
> 
> I suppose we could be getting say an order-9 folio that was previously used
> as two order-8 folios? And each of them had their _nr_pages in their head
> and we can't know that at this point so we have to reset everything?

Er, did I miss something - who reads _nr_pages from a random tail
page? Doesn't everything working with random tail pages read order,
compute the head page, cast to folio and then access _nr_pages?

> Or maybe you mean that stray _nr_pages in some tail page from previous
> lifetimes can't affect the current lifetime in a wrong way for something
> looking at said page? I don't know immediately.

Yes, exactly.

Basically, what bytes exactly need to be set to what in tail pages for
the system to work? Those should be set.

And if we want to have things set on free that's fine too, but there
should be reasons for doing stuff, and this weird thing above makes
zero sense.

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Alistair Popple 3 weeks ago

On 2026-01-17 at 11:19 +1100, Jason Gunthorpe <jgg@nvidia.com> wrote...
> On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> > >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > >> +		/*
> > >> +		 * This pointer math looks odd, but new_page could have been
> > >> +		 * part of a previous higher order folio, which sets _nr_pages
> > >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> > >> +		 * correctly locate the _nr_pages bits within new_page which
> > >> +		 * could have modified by previous higher order folio.
> > >> +		 */
> > >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >> +#endif
> > > 
> > > This seems too weird, why is it in the loop?  There is only one
> > > _nr_pages per folio.

Yeah, I don't really know what the motivation is for going via the folio
field which needs the odd pointer math versus just setting page->memcg_data
= 0 directly which would work equally well and would have avoided a lot of
confusion.

> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> > and we can't know that at this point so we have to reset everything?
> 
> Er, did I miss something - who reads _nr_pages from a random tail
> page? Doesn't everything working with random tail pages read order,
> compute the head page, cast to folio and then access _nr_pages?
> 
> > Or maybe you mean that stray _nr_pages in some tail page from previous
> > lifetimes can't affect the current lifetime in a wrong way for something
> > looking at said page? I don't know immediately.
> 
> Yes, exactly.
> 
> Basically, what bytes exactly need to be set to what in tail pages for
> the system to work? Those should be set.
> 
> And if we want to have things set on free that's fine too, but there
> should be reasons for doing stuff, and this weird thing above makes
> zero sense.

You can't think of these as tail pages or head pages. They are just random
struct pages, possibly order-0 or PageHead or PageTail, with fields in a
"random" state based on what they were last used for.

All this function should be trying to do is initialising this random state to
something sane as defined by the core-mm for it to consume. Yes, some might
later end up being tail (or head) pages if order > 0 and prep_compound_page()
is called. But the point of this function and the loop is to initialise the
struct page as an order-0 page with "sane" fields to pass to core-mm or call
prep_compound_page() on.

This could for example just use memset(new_page, 0, sizeof(struct page)) and
then refill all the fields correctly (although Vlastimil pointed out some page
flags need preservation). But a big part of the problem is there is no single
definition (AFAIK) of what state a struct page should be in before handing it to
the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
state the kernel leaves it in once freed.

I would like to see this addressed because it leads to all sorts of weirdness -
for example vm_insert_page() and migrate_vma_*() both require the page refcount
to be 1 for no good reason (drivers usually have to drop it immediately after
the call and they implicitly own the ZONE_DEVICE page lifetimes anyway so why make them
hold a reference just to map the page). Yet only migrate_vma_*() requires the
page to be locked (so other ZONE_DEVICE users just have to immediately unlock).

And I presume page->memcg_data must be set to zero, or Matthew wouldn't have
run into problems prompting him to reinit it. But I don't really know what other
requirements there are for setting page fields, they all sort of come implicitly
from the vm_insert_page/migrate_vma APIs.

 - Alistair

> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks ago

On Mon, Jan 19, 2026 at 04:41:42PM +1100, Alistair Popple wrote:
> > And if we want to have things set on free that's fine too, but there
> > should be reasons for doing stuff, and this weird thing above makes
> > zero sense.
> 
> You can't think of these as tail pages or head pages. They are just random
> struct pages, possibly order-0 or PageHead or PageTail, with fields in a
> "random" state based on what they were last used for.

Agree on random state.

> All this function should be trying to do is initialising this random state to
> something sane as defined by the core-mm for it to consume. Yes, some might
> later end up being tail (or head) pages if order > 0 and prep_compound_page()
> is called. 

Not "later" during this function. The end result of this entire
function is to setup folio starting at page and at order. Meaning we
are deliberately *creating* head and tail pages out of random junk
left over in the struct page.

> But the point of this function and the loop is to initialise the
> struct page as an order-0 page with "sane" fields to pass to core-mm or call
> prep_compound_page() on.

Which is what seems nonsensical to me. prep_compound_page() does
another loop over all these pages and *re sets* many of the same
fields. You are aruging we should clean things and then call
perp_compund_page(), I'm arguing prep_compound_page() just just accept
the junk and fix it.

> This could for example just use memset(new_page, 0, sizeof(struct page)) and
> then refill all the fields correctly (although Vlastimil pointed out some page
> flags need preservation). But a big part of the problem is there is no single
> definition (AFAIK) of what state a struct page should be in before handing it to
> the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
> state the kernel leaves it in once freed.

I agree with this, but I argue that prep_compound_page() should codify
whatever that requirement is so we can trend toward such an agreed
definition. See Matthew's first missive on this about doing things
properly in the core code instead of hacking in drivers.

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Matthew Brost 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> On 1/16/26 18:49, Jason Gunthorpe wrote:
> > On Fri, Jan 16, 2026 at 12:10:16PM +0100, Francois Dugast wrote:
> >> -void zone_device_page_init(struct page *page, unsigned int order)
> >> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> >> +			   unsigned int order)
> >>  {
> >> +	struct page *new_page = page;
> >> +	unsigned int i;
> >> +
> >>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> >>  
> >> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> >> +		struct folio *new_folio = (struct folio *)new_page;
> >> +
> >> +		/*
> >> +		 * new_page could have been part of previous higher order folio
> >> +		 * which encodes the order, in page + 1, in the flags bits. We
> >> +		 * blindly clear bits which could have set my order field here,
> >> +		 * including page head.
> >> +		 */
> >> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> >> +
> >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> >> +		/*
> >> +		 * This pointer math looks odd, but new_page could have been
> >> +		 * part of a previous higher order folio, which sets _nr_pages
> >> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> >> +		 * correctly locate the _nr_pages bits within new_page which
> >> +		 * could have modified by previous higher order folio.
> >> +		 */
> >> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >> +#endif
> > 
> > This seems too weird, why is it in the loop?  There is only one
> > _nr_pages per folio.
> 
> I suppose we could be getting say an order-9 folio that was previously used
> as two order-8 folios? And each of them had their _nr_pages in their head

Yes, this is a good example. At this point we have idea what previous
allocation(s) order(s) were - we could have multiple places in the loop
where _nr_pages is populated, thus we have to clear this everywhere. 

> and we can't know that at this point so we have to reset everything?
> 

Yes, see above, correct. We have no visablity to previous state of the
pages so the only option is to reset everything.

> AFAIU this would not be a problem if the clearing of the previous state was
> done upon freeing, as e.g. v4 did, but I think you also argued it meant
> processing the pages when freeing and then again at reallocation, so it's
> now like this instead?

Yes, if we cleanup the previous folio state upon freeing, then this
problem goes away but the we back passing in the order as argument to
->folio_free(). 

> 
> Or maybe you mean that stray _nr_pages in some tail page from previous
> lifetimes can't affect the current lifetime in a wrong way for something
> looking at said page? I don't know immediately.
> 
> > This is mostly zeroing some memory in the tail pages? Why?
> > 
> > Why can't this use the normal helpers, like memmap_init_compound()?
> > 
> >  struct folio *new_folio = page
> > 
> >  /* First 4 tail pages are part of struct folio */
> >  for (i = 4; i < (1UL << order); i++) {
> >      prep_compound_tail(..)
> >  }
> > 
> >  prep_comound_head(page, order)
> >  new_folio->_nr_pages = 0
> > 
> > ??

I've beat this to death with Alistair, normal helpers do not work here.

An order zero allocation could have _nr_pages set in its page,
new_folio->_nr_pages is page + 1 memory.

Matt

> > 
> > Jason
>

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> 
> Yes, this is a good example. At this point we have idea what previous
> allocation(s) order(s) were - we could have multiple places in the loop
> where _nr_pages is populated, thus we have to clear this everywhere. 

Why? The fact you have to use such a crazy expression to even access
_nr_pages strongly says nothing will read it as _nr_pages.

Explain each thing:

		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */

OK, the tail page flags need to be set right, and prep_compound_page()
called later depends on them being zero.

		((struct folio *)(new_page - 1))->_nr_pages = 0;

Can't see a reason, nothing reads _nr_pages from a random tail
page. _nr_pages is the last 8 bytes of struct page so it overlaps
memcg_data, which is also not supposed to be read from a tail page?

		new_folio->mapping = NULL;

Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;

		new_folio->pgmap = pgmap;	/* Also clear compound head */

Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);

		new_folio->share = 0;   /* fsdax only, unused for device private */

Not sure, certainly share isn't read from a tail page..

> > > Why can't this use the normal helpers, like memmap_init_compound()?
> > > 
> > >  struct folio *new_folio = page
> > > 
> > >  /* First 4 tail pages are part of struct folio */
> > >  for (i = 4; i < (1UL << order); i++) {
> > >      prep_compound_tail(..)
> > >  }
> > > 
> > >  prep_comound_head(page, order)
> > >  new_folio->_nr_pages = 0
> > > 
> > > ??
> 
> I've beat this to death with Alistair, normal helpers do not work here.

What do you mean? It already calls prep_compound_page()! The issue
seems to be that prep_compound_page() makes assumptions about what
values are in flags already?

So how about move that page flags mask logic into
prep_compound_tail()? I think that would help Vlastimil's
concern. That function is already touching most of the cache line so
an extra word shouldn't make a performance difference.

> An order zero allocation could have _nr_pages set in its page,
> new_folio->_nr_pages is page + 1 memory.

An order zero allocation does not have _nr_pages because it is in page
+1 memory that doesn't exist.

An order zero allocation might have memcg_data in the same slot, does
it need zeroing? If so why not add that to prep_compound_head() ?

Also, prep_compound_head() handles order 0 too:

	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
		atomic_set(&folio->_pincount, 0);
		atomic_set(&folio->_entire_mapcount, -1);
	}
	if (order > 1)
		INIT_LIST_HEAD(&folio->_deferred_list);

So some of the problem here looks to be not calling it:

	if (order)
		prep_compound_page(page, order);

So, remove that if ? Also shouldn't it be moved above the
set_page_count/lock_page ?

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Matthew Brost 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > > I suppose we could be getting say an order-9 folio that was previously used
> > > as two order-8 folios? And each of them had their _nr_pages in their head
> > 
> > Yes, this is a good example. At this point we have idea what previous
> > allocation(s) order(s) were - we could have multiple places in the loop
> > where _nr_pages is populated, thus we have to clear this everywhere. 
> 
> Why? The fact you have to use such a crazy expression to even access
> _nr_pages strongly says nothing will read it as _nr_pages.
> 
> Explain each thing:
> 
> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> 
> OK, the tail page flags need to be set right, and prep_compound_page()
> called later depends on them being zero.
> 
> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> 
> Can't see a reason, nothing reads _nr_pages from a random tail
> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> memcg_data, which is also not supposed to be read from a tail page?
> 
> 		new_folio->mapping = NULL;
> 
> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
> 
> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> 
> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> 
> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> 
> Not sure, certainly share isn't read from a tail page..
> 
> > > > Why can't this use the normal helpers, like memmap_init_compound()?
> > > > 
> > > >  struct folio *new_folio = page
> > > > 
> > > >  /* First 4 tail pages are part of struct folio */
> > > >  for (i = 4; i < (1UL << order); i++) {
> > > >      prep_compound_tail(..)
> > > >  }
> > > > 
> > > >  prep_comound_head(page, order)
> > > >  new_folio->_nr_pages = 0
> > > > 
> > > > ??
> > 
> > I've beat this to death with Alistair, normal helpers do not work here.
> 
> What do you mean? It already calls prep_compound_page()! The issue
> seems to be that prep_compound_page() makes assumptions about what
> values are in flags already?
> 
> So how about move that page flags mask logic into
> prep_compound_tail()? I think that would help Vlastimil's
> concern. That function is already touching most of the cache line so
> an extra word shouldn't make a performance difference.
> 
> > An order zero allocation could have _nr_pages set in its page,
> > new_folio->_nr_pages is page + 1 memory.
> 
> An order zero allocation does not have _nr_pages because it is in page
> +1 memory that doesn't exist.
> 
> An order zero allocation might have memcg_data in the same slot, does
> it need zeroing? If so why not add that to prep_compound_head() ?
> 
> Also, prep_compound_head() handles order 0 too:
> 
> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> 		atomic_set(&folio->_pincount, 0);
> 		atomic_set(&folio->_entire_mapcount, -1);
> 	}
> 	if (order > 1)
> 		INIT_LIST_HEAD(&folio->_deferred_list);
> 
> So some of the problem here looks to be not calling it:
> 
> 	if (order)
> 		prep_compound_page(page, order);
> 
> So, remove that if ? Also shouldn't it be moved above the
> set_page_count/lock_page ?
> 

I'm not addressing each comment, some might be valid, others are not.

Ok, can I rework this in a follow-up - I will commit to that? Anything
we touch here is extremely sensitive to failures - Intel is the primary
test vector for any modification to device pages for what I can tell.

The fact is that large device pages do not really work without this
patch, or prior revs. I’ve spent a lot of time getting large device
pages stable — both here and in the initial series, commiting to help in
follow on series touch SVM related things.

I’m going to miss my merge window with this (RB’d) patch blocked for
large device pages. Expect my commitment to helping other vendors to
drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
try again.

Or perhaps we just revert large device pages in 6.19 if we can't get a
consensus here as we shouldn't ship a non-functional kernel.

Matt

> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Balbir Singh 3 weeks, 2 days ago

On 1/17/26 14:55, Matthew Brost wrote:
> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>> I suppose we could be getting say an order-9 folio that was previously used
>>>> as two order-8 folios? And each of them had their _nr_pages in their head
>>>
>>> Yes, this is a good example. At this point we have idea what previous
>>> allocation(s) order(s) were - we could have multiple places in the loop
>>> where _nr_pages is populated, thus we have to clear this everywhere. 
>>
>> Why? The fact you have to use such a crazy expression to even access
>> _nr_pages strongly says nothing will read it as _nr_pages.
>>
>> Explain each thing:
>>
>> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>
>> OK, the tail page flags need to be set right, and prep_compound_page()
>> called later depends on them being zero.
>>
>> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
>>
>> Can't see a reason, nothing reads _nr_pages from a random tail
>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> 		new_folio->mapping = NULL;
>>
>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>>
>> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
>>
>> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
>>
>> 		new_folio->share = 0;   /* fsdax only, unused for device private */
>>
>> Not sure, certainly share isn't read from a tail page..
>>
>>>>> Why can't this use the normal helpers, like memmap_init_compound()?
>>>>>
>>>>>  struct folio *new_folio = page
>>>>>
>>>>>  /* First 4 tail pages are part of struct folio */
>>>>>  for (i = 4; i < (1UL << order); i++) {
>>>>>      prep_compound_tail(..)
>>>>>  }
>>>>>
>>>>>  prep_comound_head(page, order)
>>>>>  new_folio->_nr_pages = 0
>>>>>
>>>>> ??
>>>
>>> I've beat this to death with Alistair, normal helpers do not work here.
>>
>> What do you mean? It already calls prep_compound_page()! The issue
>> seems to be that prep_compound_page() makes assumptions about what
>> values are in flags already?
>>
>> So how about move that page flags mask logic into
>> prep_compound_tail()? I think that would help Vlastimil's
>> concern. That function is already touching most of the cache line so
>> an extra word shouldn't make a performance difference.
>>
>>> An order zero allocation could have _nr_pages set in its page,
>>> new_folio->_nr_pages is page + 1 memory.
>>
>> An order zero allocation does not have _nr_pages because it is in page
>> +1 memory that doesn't exist.
>>
>> An order zero allocation might have memcg_data in the same slot, does
>> it need zeroing? If so why not add that to prep_compound_head() ?
>>
>> Also, prep_compound_head() handles order 0 too:
>>
>> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
>> 		atomic_set(&folio->_pincount, 0);
>> 		atomic_set(&folio->_entire_mapcount, -1);
>> 	}
>> 	if (order > 1)
>> 		INIT_LIST_HEAD(&folio->_deferred_list);
>>
>> So some of the problem here looks to be not calling it:
>>
>> 	if (order)
>> 		prep_compound_page(page, order);
>>
>> So, remove that if ? Also shouldn't it be moved above the
>> set_page_count/lock_page ?
>>
> 
> I'm not addressing each comment, some might be valid, others are not.
> 
> Ok, can I rework this in a follow-up - I will commit to that? Anything
> we touch here is extremely sensitive to failures - Intel is the primary
> test vector for any modification to device pages for what I can tell.
> 
> The fact is that large device pages do not really work without this
> patch, or prior revs. I’ve spent a lot of time getting large device
> pages stable — both here and in the initial series, commiting to help in
> follow on series touch SVM related things.
> 

Matthew, I feel your frustration and appreciate your help.
For the current state of 6.19, your changes work for me, I added a
Reviewed-by to the patch. It affects a small number of drivers and makes
them work for zone device folios. I am happy to maintain the changes
sent out as a part of zone_device_page_init()

We can rework the details in a follow up series, there are many ideas
and ways of doing this (Jason, Alistair, Zi have good ideas as well).

> I’m going to miss my merge window with this (RB’d) patch blocked for
> large device pages. Expect my commitment to helping other vendors to
> drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> try again.
> 
> Or perhaps we just revert large device pages in 6.19 if we can't get a
> consensus here as we shouldn't ship a non-functional kernel.
> 
> Matt
> 
>> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Matthew Brost 3 weeks, 2 days ago

On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> On 1/17/26 14:55, Matthew Brost wrote:
> > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> >>>> I suppose we could be getting say an order-9 folio that was previously used
> >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> >>>
> >>> Yes, this is a good example. At this point we have idea what previous
> >>> allocation(s) order(s) were - we could have multiple places in the loop
> >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> >>
> >> Why? The fact you have to use such a crazy expression to even access
> >> _nr_pages strongly says nothing will read it as _nr_pages.
> >>
> >> Explain each thing:
> >>
> >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> >>
> >> OK, the tail page flags need to be set right, and prep_compound_page()
> >> called later depends on them being zero.
> >>
> >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> >>
> >> Can't see a reason, nothing reads _nr_pages from a random tail
> >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> >> memcg_data, which is also not supposed to be read from a tail page?
> >>
> >> 		new_folio->mapping = NULL;
> >>
> >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
> >>
> >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> >>
> >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> >>
> >> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> >>
> >> Not sure, certainly share isn't read from a tail page..
> >>
> >>>>> Why can't this use the normal helpers, like memmap_init_compound()?
> >>>>>
> >>>>>  struct folio *new_folio = page
> >>>>>
> >>>>>  /* First 4 tail pages are part of struct folio */
> >>>>>  for (i = 4; i < (1UL << order); i++) {
> >>>>>      prep_compound_tail(..)
> >>>>>  }
> >>>>>
> >>>>>  prep_comound_head(page, order)
> >>>>>  new_folio->_nr_pages = 0
> >>>>>
> >>>>> ??
> >>>
> >>> I've beat this to death with Alistair, normal helpers do not work here.
> >>
> >> What do you mean? It already calls prep_compound_page()! The issue
> >> seems to be that prep_compound_page() makes assumptions about what
> >> values are in flags already?
> >>
> >> So how about move that page flags mask logic into
> >> prep_compound_tail()? I think that would help Vlastimil's
> >> concern. That function is already touching most of the cache line so
> >> an extra word shouldn't make a performance difference.
> >>
> >>> An order zero allocation could have _nr_pages set in its page,
> >>> new_folio->_nr_pages is page + 1 memory.
> >>
> >> An order zero allocation does not have _nr_pages because it is in page
> >> +1 memory that doesn't exist.
> >>
> >> An order zero allocation might have memcg_data in the same slot, does
> >> it need zeroing? If so why not add that to prep_compound_head() ?
> >>
> >> Also, prep_compound_head() handles order 0 too:
> >>
> >> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> >> 		atomic_set(&folio->_pincount, 0);
> >> 		atomic_set(&folio->_entire_mapcount, -1);
> >> 	}
> >> 	if (order > 1)
> >> 		INIT_LIST_HEAD(&folio->_deferred_list);
> >>
> >> So some of the problem here looks to be not calling it:
> >>
> >> 	if (order)
> >> 		prep_compound_page(page, order);
> >>
> >> So, remove that if ? Also shouldn't it be moved above the
> >> set_page_count/lock_page ?
> >>
> > 
> > I'm not addressing each comment, some might be valid, others are not.
> > 
> > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > we touch here is extremely sensitive to failures - Intel is the primary
> > test vector for any modification to device pages for what I can tell.
> > 
> > The fact is that large device pages do not really work without this
> > patch, or prior revs. I’ve spent a lot of time getting large device
> > pages stable — both here and in the initial series, commiting to help in
> > follow on series touch SVM related things.
> > 
> 
> Matthew, I feel your frustration and appreciate your help.
> For the current state of 6.19, your changes work for me, I added a
> Reviewed-by to the patch. It affects a small number of drivers and makes
> them work for zone device folios. I am happy to maintain the changes
> sent out as a part of zone_device_page_init()
> 

+1

> We can rework the details in a follow up series, there are many ideas
> and ways of doing this (Jason, Alistair, Zi have good ideas as well).
> 

I agree we can rework this in a follow-up — the core MM is hard, and for
valid reasons, but we can all work together on cleaning it up.

Matt

> > I’m going to miss my merge window with this (RB’d) patch blocked for
> > large device pages. Expect my commitment to helping other vendors to
> > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > try again.
> > 
> > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > consensus here as we shouldn't ship a non-functional kernel.
> > 
> > Matt
> > 
> >> Jason
>

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Alistair Popple 3 weeks ago

On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> > On 1/17/26 14:55, Matthew Brost wrote:
> > > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> > >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > >>>> I suppose we could be getting say an order-9 folio that was previously used
> > >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> > >>>
> > >>> Yes, this is a good example. At this point we have idea what previous
> > >>> allocation(s) order(s) were - we could have multiple places in the loop
> > >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> > >>
> > >> Why? The fact you have to use such a crazy expression to even access
> > >> _nr_pages strongly says nothing will read it as _nr_pages.
> > >>
> > >> Explain each thing:
> > >>
> > >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > >>
> > >> OK, the tail page flags need to be set right, and prep_compound_page()
> > >> called later depends on them being zero.
> > >>
> > >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >>
> > >> Can't see a reason, nothing reads _nr_pages from a random tail
> > >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> > >> memcg_data, which is also not supposed to be read from a tail page?

This is (or was) either a order-0 page, a head page or a tail page, who
knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
supposed to be read from a tail page or not. What really matters is does any of
vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
particular value when called on this page?

AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
on an order-0 page, which means it has to be cleared.

Although I think it would be far less confusing if it was just written like that
rather than the folio math but it achieves the same thing and is technically
correct.

> > >> 		new_folio->mapping = NULL;
> > >>
> > >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;

Not pointless - vm_insert_page() for example expects folio_test_anon() which
which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
needs to be cleared. migrate_vma_setup() has a similar issue.

> > >>
> > >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > >>
> > >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);

No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
struct pages ready for use by the core-mm which means the pgmap needs to be
correct.

> > >> 		new_folio->share = 0;   /* fsdax only, unused for device private */
> > >>
> > >> Not sure, certainly share isn't read from a tail page..

Yeah, not useful for now because FS DAX isn't using this function. Arguably it
should though.

> > >>>>> Why can't this use the normal helpers, like memmap_init_compound()?

Because that's not what this function is trying to do - eg. we might not be
trying to create a compound page. Although something like
memmap_init_zone_device() looks like it would be a good starting point, with the
page order being a parameter instead of read from the pgmap.

> > >>>>>
> > >>>>>  struct folio *new_folio = page
> > >>>>>
> > >>>>>  /* First 4 tail pages are part of struct folio */
> > >>>>>  for (i = 4; i < (1UL << order); i++) {
> > >>>>>      prep_compound_tail(..)
> > >>>>>  }
> > >>>>>
> > >>>>>  prep_comound_head(page, order)
> > >>>>>  new_folio->_nr_pages = 0
> > >>>>>
> > >>>>> ??
> > >>>
> > >>> I've beat this to death with Alistair, normal helpers do not work here.
> > >> What do you mean? It already calls prep_compound_page()! The issue
> > >> seems to be that prep_compound_page() makes assumptions about what
> > >> values are in flags already?
> > >>
> > >> So how about move that page flags mask logic into
> > >> prep_compound_tail()? I think that would help Vlastimil's
> > >> concern. That function is already touching most of the cache line so
> > >> an extra word shouldn't make a performance difference.
> > >>
> > >>> An order zero allocation could have _nr_pages set in its page,
> > >>> new_folio->_nr_pages is page + 1 memory.
> > >>
> > >> An order zero allocation does not have _nr_pages because it is in page
> > >> +1 memory that doesn't exist.
> > >>
> > >> An order zero allocation might have memcg_data in the same slot, does
> > >> it need zeroing? If so why not add that to prep_compound_head() ?
> > >>
> > >> Also, prep_compound_head() handles order 0 too:
> > >>
> > >> 	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> > >> 		atomic_set(&folio->_pincount, 0);
> > >> 		atomic_set(&folio->_entire_mapcount, -1);
> > >> 	}
> > >> 	if (order > 1)
> > >> 		INIT_LIST_HEAD(&folio->_deferred_list);
> > >>
> > >> So some of the problem here looks to be not calling it:
> > >>
> > >> 	if (order)
> > >> 		prep_compound_page(page, order);
> > >>
> > >> So, remove that if ? Also shouldn't it be moved above the
> > >> set_page_count/lock_page ?
> > >>
> > > 
> > > I'm not addressing each comment, some might be valid, others are not.

Hopefully some of my explainations above help.

> > > 
> > > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > > we touch here is extremely sensitive to failures - Intel is the primary
> > > test vector for any modification to device pages for what I can tell.
> > > 
> > > The fact is that large device pages do not really work without this
> > > patch, or prior revs. I’ve spent a lot of time getting large device
> > > pages stable — both here and in the initial series, commiting to help in
> > > follow on series touch SVM related things.
> > > 
> > 
> > Matthew, I feel your frustration and appreciate your help.
> > For the current state of 6.19, your changes work for me, I added a
> > Reviewed-by to the patch. It affects a small number of drivers and makes
> > them work for zone device folios. I am happy to maintain the changes
> > sent out as a part of zone_device_page_init()

No problem with the above, and FWIW it seems correct. Although I suspect just
setting page->memcg_data = 0 would have been far less controversial ;)

> +1
> 
> > We can rework the details in a follow up series, there are many ideas
> > and ways of doing this (Jason, Alistair, Zi have good ideas as well).
> > 
> 
> I agree we can rework this in a follow-up — the core MM is hard, and for
> valid reasons, but we can all work together on cleaning it up.
> 
> Matt
> 
> > > I’m going to miss my merge window with this (RB’d) patch blocked for
> > > large device pages. Expect my commitment to helping other vendors to
> > > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > > try again.
> > > 
> > > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > > consensus here as we shouldn't ship a non-functional kernel.
> > > 
> > > Matt
> > > 
> > >> Jason
> >

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks ago

On Mon, Jan 19, 2026 at 04:59:56PM +1100, Alistair Popple wrote:
> On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> > > On 1/17/26 14:55, Matthew Brost wrote:
> > > > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> > > >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > > >>>> I suppose we could be getting say an order-9 folio that was previously used
> > > >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> > > >>>
> > > >>> Yes, this is a good example. At this point we have idea what previous
> > > >>> allocation(s) order(s) were - we could have multiple places in the loop
> > > >>> where _nr_pages is populated, thus we have to clear this everywhere. 
> > > >>
> > > >> Why? The fact you have to use such a crazy expression to even access
> > > >> _nr_pages strongly says nothing will read it as _nr_pages.
> > > >>
> > > >> Explain each thing:
> > > >>
> > > >> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > > >>
> > > >> OK, the tail page flags need to be set right, and prep_compound_page()
> > > >> called later depends on them being zero.
> > > >>
> > > >> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > > >>
> > > >> Can't see a reason, nothing reads _nr_pages from a random tail
> > > >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> > > >> memcg_data, which is also not supposed to be read from a tail page?
> 
> This is (or was) either a order-0 page, a head page or a tail page, who
> knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
> supposed to be read from a tail page or not. What really matters is does any of
> vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
> particular value when called on this page?

This weird expression is doing three things,
1) it is zeroing memcg on the head page
2) it is zeroing _nr_pages on the head folio
3) it is zeroing memcg on all the tail pages.

Are you aruging for 1, 2 or 3?

#1 is missing today
#2 is handled directly by the prep_compound_page() -> prep_compound_head() -> folio_set_order()
#3 I argue isn't necessary.

> AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
> on an order-0 page, which means it has to be cleared.

Great, so lets write that in prep_compound_head()!

> Although I think it would be far less confusing if it was just written like that
> rather than the folio math but it achieves the same thing and is technically
> correct.

I have yet to hear a reason to do #3.

> > > >> 		new_folio->mapping = NULL;
> > > >>
> > > >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>
> Not pointless - vm_insert_page() for example expects folio_test_anon() which
> which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
> needs to be cleared. migrate_vma_setup() has a similar issue.

It is pointless to put it in the loop! Sure set the head page.

> > > >> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > > >>
> > > >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
> 
> No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
> struct pages ready for use by the core-mm which means the pgmap needs to be
> correct.

See above, same issue. The tail pages have pgmap set to NULL because
prep_compound_tail() does it. So why do we set it to pgmap here and
then clear it a few lines below?

Set it once in the head folio outside this loop.

> No problem with the above, and FWIW it seems correct. Although I suspect just
> setting page->memcg_data = 0 would have been far less controversial ;)

It is "correct" but horrible.

What is wrong with this? Isn't it so much better and more efficient??

diff --git a/mm/internal.h b/mm/internal.h
index e430da900430a1..a7d3f5e4b85e49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 		atomic_set(&folio->_pincount, 0);
 		atomic_set(&folio->_entire_mapcount, -1);
 	}
-	if (order > 1)
+	if (order > 1) {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+	} else {
+		folio->mapping = NULL;
+#ifdef CONFIG_MEMCG
+		folio->memcg_data = 0;
+#endif
+	}
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
 {
 	struct page *p = head + tail_idx;
 
+	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
 	p->mapping = TAIL_MAPPING;
 	set_compound_head(p, head);
 	set_page_private(p, 0);
diff --git a/mm/memremap.c b/mm/memremap.c
index 4c2e0d68eb2798..7ec034c11068e1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -479,19 +479,23 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page, unsigned int order)
+void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
+			   unsigned int order)
 {
 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+	struct folio *folio;
 
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
 	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
-	set_page_count(page, 1);
-	lock_page(page);
 
-	if (order)
-		prep_compound_page(page, order);
+	prep_compound_page(page, order);
+
+	folio = page_folio(page);
+	folio->pgmap = pgmap;
+	folio_lock(folio);
+	folio_set_count(folio, 1);
 }
 EXPORT_SYMBOL_GPL(zone_device_page_init);

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Zi Yan 2 weeks, 6 days ago

On 19 Jan 2026, at 9:20, Jason Gunthorpe wrote:

> On Mon, Jan 19, 2026 at 04:59:56PM +1100, Alistair Popple wrote:
>> On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
>>> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
>>>> On 1/17/26 14:55, Matthew Brost wrote:
>>>>> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>>>>>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>>>>>> I suppose we could be getting say an order-9 folio that was previously used
>>>>>>>> as two order-8 folios? And each of them had their _nr_pages in their head
>>>>>>>
>>>>>>> Yes, this is a good example. At this point we have idea what previous
>>>>>>> allocation(s) order(s) were - we could have multiple places in the loop
>>>>>>> where _nr_pages is populated, thus we have to clear this everywhere.
>>>>>>
>>>>>> Why? The fact you have to use such a crazy expression to even access
>>>>>> _nr_pages strongly says nothing will read it as _nr_pages.
>>>>>>
>>>>>> Explain each thing:
>>>>>>
>>>>>> 		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>>>>
>>>>>> OK, the tail page flags need to be set right, and prep_compound_page()
>>>>>> called later depends on them being zero.
>>>>>>
>>>>>> 		((struct folio *)(new_page - 1))->_nr_pages = 0;
>>>>>>
>>>>>> Can't see a reason, nothing reads _nr_pages from a random tail
>>>>>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>>>>>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> This is (or was) either a order-0 page, a head page or a tail page, who
>> knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
>> supposed to be read from a tail page or not. What really matters is does any of
>> vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
>> particular value when called on this page?
>
> This weird expression is doing three things,
> 1) it is zeroing memcg on the head page
> 2) it is zeroing _nr_pages on the head folio
> 3) it is zeroing memcg on all the tail pages.
>
> Are you aruging for 1, 2 or 3?
>
> #1 is missing today
> #2 is handled directly by the prep_compound_page() -> prep_compound_head() -> folio_set_order()
> #3 I argue isn't necessary.
>
>> AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
>> on an order-0 page, which means it has to be cleared.
>
> Great, so lets write that in prep_compound_head()!
>
>> Although I think it would be far less confusing if it was just written like that
>> rather than the folio math but it achieves the same thing and is technically
>> correct.
>
> I have yet to hear a reason to do #3.
>
>>>>>> 		new_folio->mapping = NULL;
>>>>>>
>>>>>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
>>
>> Not pointless - vm_insert_page() for example expects folio_test_anon() which
>> which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
>> needs to be cleared. migrate_vma_setup() has a similar issue.
>
> It is pointless to put it in the loop! Sure set the head page.
>
>>>>>> 		new_folio->pgmap = pgmap;	/* Also clear compound head */
>>>>>>
>>>>>> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
>>
>> No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
>> struct pages ready for use by the core-mm which means the pgmap needs to be
>> correct.
>
> See above, same issue. The tail pages have pgmap set to NULL because
> prep_compound_tail() does it. So why do we set it to pgmap here and
> then clear it a few lines below?
>
> Set it once in the head folio outside this loop.
>
>> No problem with the above, and FWIW it seems correct. Although I suspect just
>> setting page->memcg_data = 0 would have been far less controversial ;)
>
> It is "correct" but horrible.
>
> What is wrong with this? Isn't it so much better and more efficient??
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e430da900430a1..a7d3f5e4b85e49 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>  		atomic_set(&folio->_pincount, 0);
>  		atomic_set(&folio->_entire_mapcount, -1);
>  	}
> -	if (order > 1)
> +	if (order > 1) {
>  		INIT_LIST_HEAD(&folio->_deferred_list);
> +	} else {
> +		folio->mapping = NULL;
> +#ifdef CONFIG_MEMCG
> +		folio->memcg_data = 0;
> +#endif
> +	}

prep_compound_head() is only called on >0 order pages. The above
code means when order == 1, folio->mapping and folio->memcg_data are
assigned NULL.

>  }
>
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>  {
>  	struct page *p = head + tail_idx;
>
> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */

No one cares about tail page flags if it is not checked in check_new_page()
from mm/page_alloc.c.

>  	p->mapping = TAIL_MAPPING;
>  	set_compound_head(p, head);
>  	set_page_private(p, 0);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 4c2e0d68eb2798..7ec034c11068e1 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -479,19 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +	struct folio *folio;
>
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().
>  	 */
>  	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> -	set_page_count(page, 1);
> -	lock_page(page);
>
> -	if (order)
> -		prep_compound_page(page, order);
> +	prep_compound_page(page, order);

prep_compound_page() should only be called for >0 order pages. This creates
another weirdness in device pages by assuming all pages are compound.

> +
> +	folio = page_folio(page);
> +	folio->pgmap = pgmap;
> +	folio_lock(folio);
> +	folio_set_count(folio, 1);

/* clear possible previous page->mapping */
folio->mapping = NULL;

/* clear possible previous page->_nr_pages */
#ifdef CONFIG_MEMCG
	folio->memcg_data = 0;
#endif

With two above and still call prep_compound_page() only when order > 0,
the code should work. There is no need to change prep_compoun_*()
functions.

>  }
>  EXPORT_SYMBOL_GPL(zone_device_page_init);


This patch mixed the concept of page and folio together, thus
causing confusion. Core MM sees page and folio two separate things:
1. page is the smallest internal physical memory management unit,
2. folio is an abstraction on top of pages, and other abstractions can be
   slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).

Compound page is a high-order page that all subpages are managed as a whole,
but it is converted to folio after page_rmappable_folio() (see
__folio_alloc_noprof()). And a slab page can be a compound page too (see
page_slab() does compound_head() like operation). So a compound page is
not the same as a folio.

I can see folio is used in prep_compound_head()
and think it is confusing, since these pages should not be regarded as
a folio yet. I probably blame willy (cc'd), since he started it from commit
94688e8eb453 ("mm: remove folio_pincount_ptr() and head_compound_pincount()")
and before that prep_compound_head() was all about pages. folio_set_order()
was set_compound_order() before commit 1e3be4856f49d ("mm/folio: replace
set_compound_order with folio_set_order").

If device pages have to initialize on top of pages with obsolete states,
at least it should be first initialized as pages, then as folios to avoid
confusion.


--
Best Regards,
Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 2 weeks, 6 days ago

On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
> > diff --git a/mm/internal.h b/mm/internal.h
> > index e430da900430a1..a7d3f5e4b85e49 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
> >  		atomic_set(&folio->_pincount, 0);
> >  		atomic_set(&folio->_entire_mapcount, -1);
> >  	}
> > -	if (order > 1)
> > +	if (order > 1) {
> >  		INIT_LIST_HEAD(&folio->_deferred_list);
> > +	} else {
> > +		folio->mapping = NULL;
> > +#ifdef CONFIG_MEMCG
> > +		folio->memcg_data = 0;
> > +#endif
> > +	}
> 
> prep_compound_head() is only called on >0 order pages. The above
> code means when order == 1, folio->mapping and folio->memcg_data are
> assigned NULL.

OK, fair enough, the conditionals would have to change and maybe it
shouldn't be called "compound_head" if it also cleans up normal pages.

> >  static inline void prep_compound_tail(struct page *head, int tail_idx)
> >  {
> >  	struct page *p = head + tail_idx;
> >
> > +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> 
> No one cares about tail page flags if it is not checked in check_new_page()
> from mm/page_alloc.c.

At least page_fixed_fake_head() does check PG_head in some
configurations. It does seem safer to clear it. Possibly order is
never used, but it is free to clear it.

> > -	if (order)
> > -		prep_compound_page(page, order);
> > +	prep_compound_page(page, order);
> 
> prep_compound_page() should only be called for >0 order pages. This creates
> another weirdness in device pages by assuming all pages are
> compound.

OK

> > +	folio = page_folio(page);
> > +	folio->pgmap = pgmap;
> > +	folio_lock(folio);
> > +	folio_set_count(folio, 1);
> 
> /* clear possible previous page->mapping */
> folio->mapping = NULL;
> 
> /* clear possible previous page->_nr_pages */
> #ifdef CONFIG_MEMCG
> 	folio->memcg_data = 0;
> #endif

This is reasonable too, but prep_compound_head() was doing more than
that, it is also clearing the order, and this needs to clear the head
bit.  That's why it was apppealing to reuse those functions, but you
are right they are not ideal.

I suppose we want some prep_single_page(page) and some reorg to share
code with the other prep function.

> This patch mixed the concept of page and folio together, thus
> causing confusion. Core MM sees page and folio two separate things:
> 1. page is the smallest internal physical memory management unit,
> 2. folio is an abstraction on top of pages, and other abstractions can be
>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).

I think the users of zone_device_page_init() are principally trying to
create something that can be installed in a non-special PTE. Meaning
the output is always a folio because it is going to be read as a folio
in the page walkers.

Thus, the job of this function is to take the memory range starting at
page for 2^order and turn it into a single valid folio with refcount
of 1.

> If device pages have to initialize on top of pages with obsolete states,
> at least it should be first initialized as pages, then as folios to avoid
> confusion.

I don't think so. It should do the above job efficiently and iterate
over the page list exactly once.

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Balbir Singh 2 weeks, 6 days ago

On 1/20/26 07:35, Jason Gunthorpe wrote:
> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>  		atomic_set(&folio->_pincount, 0);
>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>  	}
>>> -	if (order > 1)
>>> +	if (order > 1) {
>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>> +	} else {
>>> +		folio->mapping = NULL;
>>> +#ifdef CONFIG_MEMCG
>>> +		folio->memcg_data = 0;
>>> +#endif
>>> +	}
>>
>> prep_compound_head() is only called on >0 order pages. The above
>> code means when order == 1, folio->mapping and folio->memcg_data are
>> assigned NULL.
> 
> OK, fair enough, the conditionals would have to change and maybe it
> shouldn't be called "compound_head" if it also cleans up normal pages.
> 
>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>  {
>>>  	struct page *p = head + tail_idx;
>>>
>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>
>> No one cares about tail page flags if it is not checked in check_new_page()
>> from mm/page_alloc.c.
> 
> At least page_fixed_fake_head() does check PG_head in some
> configurations. It does seem safer to clear it. Possibly order is
> never used, but it is free to clear it.
> 
>>> -	if (order)
>>> -		prep_compound_page(page, order);
>>> +	prep_compound_page(page, order);
>>
>> prep_compound_page() should only be called for >0 order pages. This creates
>> another weirdness in device pages by assuming all pages are
>> compound.
> 
> OK
> 
>>> +	folio = page_folio(page);
>>> +	folio->pgmap = pgmap;
>>> +	folio_lock(folio);
>>> +	folio_set_count(folio, 1);
>>
>> /* clear possible previous page->mapping */
>> folio->mapping = NULL;
>>
>> /* clear possible previous page->_nr_pages */
>> #ifdef CONFIG_MEMCG
>> 	folio->memcg_data = 0;
>> #endif
> 
> This is reasonable too, but prep_compound_head() was doing more than
> that, it is also clearing the order, and this needs to clear the head
> bit.  That's why it was apppealing to reuse those functions, but you
> are right they are not ideal.
> 
> I suppose we want some prep_single_page(page) and some reorg to share
> code with the other prep function.
> 

There is __init_zone_device_page() and __init_single_page(), 
it does zero out the page and sets the zone, pfn, nid among other things.
I propose we use the current version with zone_device_free_folio() as is.

We can figure out if __init_zone_device_page() can be reused or refactored
for the purposes to doing this with core MM API's


>> This patch mixed the concept of page and folio together, thus
>> causing confusion. Core MM sees page and folio two separate things:
>> 1. page is the smallest internal physical memory management unit,
>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
> 
> I think the users of zone_device_page_init() are principally trying to
> create something that can be installed in a non-special PTE. Meaning
> the output is always a folio because it is going to be read as a folio
> in the page walkers.
> 
> Thus, the job of this function is to take the memory range starting at
> page for 2^order and turn it into a single valid folio with refcount
> of 1.
> 
>> If device pages have to initialize on top of pages with obsolete states,
>> at least it should be first initialized as pages, then as folios to avoid
>> confusion.
> 
> I don't think so. It should do the above job efficiently and iterate
> over the page list exactly once.
> 
> Jason

Agreed

Balbir

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Zi Yan 2 weeks, 6 days ago

On 19 Jan 2026, at 17:15, Balbir Singh wrote:

> On 1/20/26 07:35, Jason Gunthorpe wrote:
>> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>>  		atomic_set(&folio->_pincount, 0);
>>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>>  	}
>>>> -	if (order > 1)
>>>> +	if (order > 1) {
>>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>>> +	} else {
>>>> +		folio->mapping = NULL;
>>>> +#ifdef CONFIG_MEMCG
>>>> +		folio->memcg_data = 0;
>>>> +#endif
>>>> +	}
>>>
>>> prep_compound_head() is only called on >0 order pages. The above
>>> code means when order == 1, folio->mapping and folio->memcg_data are
>>> assigned NULL.
>>
>> OK, fair enough, the conditionals would have to change and maybe it
>> shouldn't be called "compound_head" if it also cleans up normal pages.
>>
>>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>>  {
>>>>  	struct page *p = head + tail_idx;
>>>>
>>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>
>>> No one cares about tail page flags if it is not checked in check_new_page()
>>> from mm/page_alloc.c.
>>
>> At least page_fixed_fake_head() does check PG_head in some
>> configurations. It does seem safer to clear it. Possibly order is
>> never used, but it is free to clear it.
>>
>>>> -	if (order)
>>>> -		prep_compound_page(page, order);
>>>> +	prep_compound_page(page, order);
>>>
>>> prep_compound_page() should only be called for >0 order pages. This creates
>>> another weirdness in device pages by assuming all pages are
>>> compound.
>>
>> OK
>>
>>>> +	folio = page_folio(page);
>>>> +	folio->pgmap = pgmap;
>>>> +	folio_lock(folio);
>>>> +	folio_set_count(folio, 1);
>>>
>>> /* clear possible previous page->mapping */
>>> folio->mapping = NULL;
>>>
>>> /* clear possible previous page->_nr_pages */
>>> #ifdef CONFIG_MEMCG
>>> 	folio->memcg_data = 0;
>>> #endif
>>
>> This is reasonable too, but prep_compound_head() was doing more than
>> that, it is also clearing the order, and this needs to clear the head
>> bit.  That's why it was apppealing to reuse those functions, but you
>> are right they are not ideal.

PG_head is and must be bit 6, that means the stored order needs to be
at least 2^6=64 to get it set. Who allocates a folio with that large order?
This p->flags.f &= ~0xffUL thing is unnecessary. What really needs
to be done is folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP to make
sure the new folio flags are the same as newly allocated folios
from core MM page allocator.

>>
>> I suppose we want some prep_single_page(page) and some reorg to share
>> code with the other prep function.

This is just an unnecessary need due to lack of knowledge of/do not want
to investigate core MM page and folio initialization code.

>>
>
> There is __init_zone_device_page() and __init_single_page(),
> it does zero out the page and sets the zone, pfn, nid among other things.
> I propose we use the current version with zone_device_free_folio() as is.
>
> We can figure out if __init_zone_device_page() can be reused or refactored
> for the purposes to doing this with core MM API's
>
>
>>> This patch mixed the concept of page and folio together, thus
>>> causing confusion. Core MM sees page and folio two separate things:
>>> 1. page is the smallest internal physical memory management unit,
>>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
>>
>> I think the users of zone_device_page_init() are principally trying to
>> create something that can be installed in a non-special PTE. Meaning
>> the output is always a folio because it is going to be read as a folio
>> in the page walkers.
>>
>> Thus, the job of this function is to take the memory range starting at
>> page for 2^order and turn it into a single valid folio with refcount
>> of 1.
>>
>>> If device pages have to initialize on top of pages with obsolete states,
>>> at least it should be first initialized as pages, then as folios to avoid
>>> confusion.
>>
>> I don't think so. It should do the above job efficiently and iterate
>> over the page list exactly once.

folio initialization should not iterate over any page list, since folio is
supposed to be treated as a whole instead of individual pages.

Based on my understanding,

folio->mapping = NULL;
folio->memcg_data = 0;
folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;

should be enough.

if (order)
	folio_set_large_rmappable(folio);

is done at zone_device_folio_init().

Best Regards,
Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Balbir Singh 2 weeks, 5 days ago

On 1/20/26 13:50, Zi Yan wrote:
> On 19 Jan 2026, at 17:15, Balbir Singh wrote:
> 
>> On 1/20/26 07:35, Jason Gunthorpe wrote:
>>> On Mon, Jan 19, 2026 at 03:09:00PM -0500, Zi Yan wrote:
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index e430da900430a1..a7d3f5e4b85e49 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
>>>>>  		atomic_set(&folio->_pincount, 0);
>>>>>  		atomic_set(&folio->_entire_mapcount, -1);
>>>>>  	}
>>>>> -	if (order > 1)
>>>>> +	if (order > 1) {
>>>>>  		INIT_LIST_HEAD(&folio->_deferred_list);
>>>>> +	} else {
>>>>> +		folio->mapping = NULL;
>>>>> +#ifdef CONFIG_MEMCG
>>>>> +		folio->memcg_data = 0;
>>>>> +#endif
>>>>> +	}
>>>>
>>>> prep_compound_head() is only called on >0 order pages. The above
>>>> code means when order == 1, folio->mapping and folio->memcg_data are
>>>> assigned NULL.
>>>
>>> OK, fair enough, the conditionals would have to change and maybe it
>>> shouldn't be called "compound_head" if it also cleans up normal pages.
>>>
>>>>>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>>>>>  {
>>>>>  	struct page *p = head + tail_idx;
>>>>>
>>>>> +	p->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>>>>
>>>> No one cares about tail page flags if it is not checked in check_new_page()
>>>> from mm/page_alloc.c.
>>>
>>> At least page_fixed_fake_head() does check PG_head in some
>>> configurations. It does seem safer to clear it. Possibly order is
>>> never used, but it is free to clear it.
>>>
>>>>> -	if (order)
>>>>> -		prep_compound_page(page, order);
>>>>> +	prep_compound_page(page, order);
>>>>
>>>> prep_compound_page() should only be called for >0 order pages. This creates
>>>> another weirdness in device pages by assuming all pages are
>>>> compound.
>>>
>>> OK
>>>
>>>>> +	folio = page_folio(page);
>>>>> +	folio->pgmap = pgmap;
>>>>> +	folio_lock(folio);
>>>>> +	folio_set_count(folio, 1);
>>>>
>>>> /* clear possible previous page->mapping */
>>>> folio->mapping = NULL;
>>>>
>>>> /* clear possible previous page->_nr_pages */
>>>> #ifdef CONFIG_MEMCG
>>>> 	folio->memcg_data = 0;
>>>> #endif
>>>
>>> This is reasonable too, but prep_compound_head() was doing more than
>>> that, it is also clearing the order, and this needs to clear the head
>>> bit.  That's why it was apppealing to reuse those functions, but you
>>> are right they are not ideal.
> 
> PG_head is and must be bit 6, that means the stored order needs to be
> at least 2^6=64 to get it set. Who allocates a folio with that large order?
> This p->flags.f &= ~0xffUL thing is unnecessary. What really needs
> to be done is folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP to make
> sure the new folio flags are the same as newly allocated folios
> from core MM page allocator.
> 
>>>
>>> I suppose we want some prep_single_page(page) and some reorg to share
>>> code with the other prep function.
> 
> This is just an unnecessary need due to lack of knowledge of/do not want
> to investigate core MM page and folio initialization code.
> 
>>>
>>
>> There is __init_zone_device_page() and __init_single_page(),
>> it does zero out the page and sets the zone, pfn, nid among other things.
>> I propose we use the current version with zone_device_free_folio() as is.
>>
>> We can figure out if __init_zone_device_page() can be reused or refactored
>> for the purposes to doing this with core MM API's
>>
>>
>>>> This patch mixed the concept of page and folio together, thus
>>>> causing confusion. Core MM sees page and folio two separate things:
>>>> 1. page is the smallest internal physical memory management unit,
>>>> 2. folio is an abstraction on top of pages, and other abstractions can be
>>>>    slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).
>>>
>>> I think the users of zone_device_page_init() are principally trying to
>>> create something that can be installed in a non-special PTE. Meaning
>>> the output is always a folio because it is going to be read as a folio
>>> in the page walkers.
>>>
>>> Thus, the job of this function is to take the memory range starting at
>>> page for 2^order and turn it into a single valid folio with refcount
>>> of 1.
>>>
>>>> If device pages have to initialize on top of pages with obsolete states,
>>>> at least it should be first initialized as pages, then as folios to avoid
>>>> confusion.
>>>
>>> I don't think so. It should do the above job efficiently and iterate
>>> over the page list exactly once.
> 
> folio initialization should not iterate over any page list, since folio is
> supposed to be treated as a whole instead of individual pages.
> 
> Based on my understanding,
> 
> folio->mapping = NULL;
> folio->memcg_data = 0;
> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> should be enough.
> 

I think it should be enough as well, worst case memcg_data is aliased with
slab_obj_exts, but we don't expect zone device folios to have slab_obj_exts
to be set.

folio->memcg_data needs to be under an #ifdef CONFIG_MEMCG and folio->mapping
was set to NULL during previous free (one could assume it's unchanged)


> if (order)
> 	folio_set_large_rmappable(folio);
> 
> is done at zone_device_folio_init().
> 



Balbir

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 2 weeks, 6 days ago

On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >> I suppose we want some prep_single_page(page) and some reorg to share
> >> code with the other prep function.
> 
> This is just an unnecessary need due to lack of knowledge of/do not want
> to investigate core MM page and folio initialization code.

It will be better to keep this related code together, not spread all
around.

> >> I don't think so. It should do the above job efficiently and iterate
> >> over the page list exactly once.
> 
> folio initialization should not iterate over any page list, since folio is
> supposed to be treated as a whole instead of individual pages.

The tail pages need to have the right data in them or compound_head
won't work.

> folio->mapping = NULL;
> folio->memcg_data = 0;
> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> 
> should be enough.

This seems believable to me for setting up an order 0 page.

> if (order)
> 	folio_set_large_rmappable(folio);

That one is in zone_device_folio_init()

And maybe the naming has got really confused if we have both functions
now :\

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Zi Yan 2 weeks, 5 days ago

On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:

> On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>>>> I suppose we want some prep_single_page(page) and some reorg to share
>>>> code with the other prep function.
>>
>> This is just an unnecessary need due to lack of knowledge of/do not want
>> to investigate core MM page and folio initialization code.
>
> It will be better to keep this related code together, not spread all
> around.

Or clarify what code is for preparing pages, which would go away at memdesc
time, and what code is for preparing folios, which would stay.

>
>>>> I don't think so. It should do the above job efficiently and iterate
>>>> over the page list exactly once.
>>
>> folio initialization should not iterate over any page list, since folio is
>> supposed to be treated as a whole instead of individual pages.
>
> The tail pages need to have the right data in them or compound_head
> won't work.

That is done by set_compound_head() in prep_compound_tail().
prep_compound_page() take cares of it. As long as it is called, even if
the pages in that compound page have random states before, the compound
page should function correctly afterwards.

>
>> folio->mapping = NULL;
>> folio->memcg_data = 0;
>> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>
>> should be enough.
>
> This seems believable to me for setting up an order 0 page.

It works for any folio, regardless of its order. fields used in second
or third subpages are all taken care of by prep_compound_page().

>
>> if (order)
>> 	folio_set_large_rmappable(folio);
>
> That one is in zone_device_folio_init()

Yes. And the code location looks right to me.

>
> And maybe the naming has got really confused if we have both functions
> now :\

Yes. One of the issues is that device private code used to only handles
order-0 pages and was converted to use high order folio directly without
using high order page (namely compound page) as an intermediate step.
This two-step-in-one caused confusion. But the key thing to avoid the
confusion is that to form a high order folio, a list of contiguous pages
would become a compound page by calling prep_compound_page(), then
the compound page becomes a folio by calling folio_set_large_rmappable().

BTW, the code in prep_compound_head() after folio_set_order(folio, order)
should belong to folio_set_large_rmappable() and they are causing confusion,
since they are only applicable to rmappable large folios. I am going to
send a patch to fix it.

Best Regards,
Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 2 weeks, 4 days ago

On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
> 
> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >>>> I suppose we want some prep_single_page(page) and some reorg to share
> >>>> code with the other prep function.
> >>
> >> This is just an unnecessary need due to lack of knowledge of/do not want
> >> to investigate core MM page and folio initialization code.
> >
> > It will be better to keep this related code together, not spread all
> > around.
> 
> Or clarify what code is for preparing pages, which would go away at memdesc
> time, and what code is for preparing folios, which would stay.

That comes back to the question of 'what are the rules for frozen
pages'

Now that we have frozen pages where the frozen owner can use some of
the struct page memory however it likes that memory needs to be reset
before the page is thawed and converted back to a folio.

memdesc time is only useful for memory that is not writable by frozen
owners - basically must be constant forever.

> >
> >>>> I don't think so. It should do the above job efficiently and iterate
> >>>> over the page list exactly once.
> >>
> >> folio initialization should not iterate over any page list, since folio is
> >> supposed to be treated as a whole instead of individual pages.
> >
> > The tail pages need to have the right data in them or compound_head
> > won't work.
> 
> That is done by set_compound_head() in prep_compound_tail().

Inside a page loop :)

	__SetPageHead(page);
	for (i = 1; i < nr_pages; i++)
		prep_compound_tail(page, i);

> Yes. One of the issues is that device private code used to only handles
> order-0 pages and was converted to use high order folio directly without
> using high order page (namely compound page) as an intermediate step.
> This two-step-in-one caused confusion. But the key thing to avoid the
> confusion is that to form a high order folio, a list of contiguous pages
> would become a compound page by calling prep_compound_page(), then
> the compound page becomes a folio by calling folio_set_large_rmappable().

That seems logical to me.

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Zi Yan 2 weeks, 3 days ago

On 22 Jan 2026, at 10:46, Jason Gunthorpe wrote:

> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>>
>>> On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>>>>>> I suppose we want some prep_single_page(page) and some reorg to share
>>>>>> code with the other prep function.
>>>>
>>>> This is just an unnecessary need due to lack of knowledge of/do not want
>>>> to investigate core MM page and folio initialization code.
>>>
>>> It will be better to keep this related code together, not spread all
>>> around.
>>
>> Or clarify what code is for preparing pages, which would go away at memdesc
>> time, and what code is for preparing folios, which would stay.
>
> That comes back to the question of 'what are the rules for frozen
> pages'
>
> Now that we have frozen pages where the frozen owner can use some of
> the struct page memory however it likes that memory needs to be reset
> before the page is thawed and converted back to a folio.

Based on my understanding, a frozen folio cannot be changed however the
owner wants, since the modification needs to prevent parallel scanner
from misusing the folio. For example, PFN scanners like memory compaction
needs to know this is a frozen folio with a certain order, so that it
will skip it as a whole. But if you change the frozen folio in a way
that a parallel scanner cannot recognize the right order (e.g., the frozen
folio order becomes lower) and finds some of the subpages have non-zero
refcount, it can cause issues.

But I assume device private pages do not have such a parallel scanner
looking at each struct page one by one and examining their state.

>
> memdesc time is only useful for memory that is not writable by frozen
> owners - basically must be constant forever.

Bits 0-3 of memdesc are a type field, so the owner should be able to
set it, so that others will stay away.

BTW, it seems that you treat frozen folio and free folio interchangeable
in this device private folio discussion. To me, they are different,
since frozen folio is transient to prevent others from touching the folio,
e.g., a free page is taken from buddy and allocator is setting up its
state, or a folio is split. You do not want memory compaction code
to touch these transient folios/pages. In terms of free folio, they
are stable before next allocation and others can recognize it and perform
reasonable operations. For example, memory compaction code can take
a free page out of buddy and use it as a migration destination.
That is why I want to remove all device private folio states when it
is freed. But memory compaction code never scans device private folios
and there is no other similar scanners, so that requirement might not
be needed.

>
>>>
>>>>>> I don't think so. It should do the above job efficiently and iterate
>>>>>> over the page list exactly once.
>>>>
>>>> folio initialization should not iterate over any page list, since folio is
>>>> supposed to be treated as a whole instead of individual pages.
>>>
>>> The tail pages need to have the right data in them or compound_head
>>> won't work.
>>
>> That is done by set_compound_head() in prep_compound_tail().
>
> Inside a page loop :)
>
> 	__SetPageHead(page);
> 	for (i = 1; i < nr_pages; i++)
> 		prep_compound_tail(page, i);

Yes, but to a folio, the fields of tail page 1 and 2 are used because
we do not want to inflate struct folio for high order folios. In this
loop, all tail pages are processed in the same way. To follow your method,
there will be some ifs for tail page 1 to clear _nr_pages and tail page 2
to clear other fields. It feels to me that we are clearly mixing
struct page and struct folio.

>
>> Yes. One of the issues is that device private code used to only handles
>> order-0 pages and was converted to use high order folio directly without
>> using high order page (namely compound page) as an intermediate step.
>> This two-step-in-one caused confusion. But the key thing to avoid the
>> confusion is that to form a high order folio, a list of contiguous pages
>> would become a compound page by calling prep_compound_page(), then
>> the compound page becomes a folio by calling folio_set_large_rmappable().
>
> That seems logical to me.
>
> Jason

Best Regards,
Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 2 weeks, 3 days ago

On Thu, Jan 22, 2026 at 09:41:03PM -0500, Zi Yan wrote:
> > Now that we have frozen pages where the frozen owner can use some of
> > the struct page memory however it likes that memory needs to be reset
> > before the page is thawed and converted back to a folio.
> 
> Based on my understanding, a frozen folio cannot be changed however the
> owner wants, since the modification needs to prevent parallel scanner
> from misusing the folio. For example, PFN scanners like memory compaction
> needs to know this is a frozen folio with a certain order, so that it
> will skip it as a whole. But if you change the frozen folio in a way
> that a parallel scanner cannot recognize the right order (e.g., the frozen
> folio order becomes lower) and finds some of the subpages have non-zero
> refcount, it can cause issues.

Yes, and this is part of the rules of what bits in the struct page
memory you can use for frozen pages. I've never seen it clearly
written unfortunately.

> But I assume device private pages do not have such a parallel scanner
> looking at each struct page one by one and examining their state.

I hope not!
 
> BTW, it seems that you treat frozen folio and free folio interchangeable
> in this device private folio discussion. To me, they are different,
> since frozen folio is transient to prevent others from touching the folio,

Yes, but really it is the same issue. Once the folio is frozen either
for free or any other use case it must follow a certain set of rules
to be compatible with the parallel scanners and things that may still
inspect the page without taking any refcounts.

The scanner can't tell if the refcount is 0 because it is frozen or
because it is free.

> >>>>>> I don't think so. It should do the above job efficiently and iterate
> >>>>>> over the page list exactly once.
> >>>>
> >>>> folio initialization should not iterate over any page list, since folio is
> >>>> supposed to be treated as a whole instead of individual pages.
> >>>
> >>> The tail pages need to have the right data in them or compound_head
> >>> won't work.
> >>
> >> That is done by set_compound_head() in prep_compound_tail().
> >
> > Inside a page loop :)
> >
> > 	__SetPageHead(page);
> > 	for (i = 1; i < nr_pages; i++)
> > 		prep_compound_tail(page, i);
> 
> Yes, but to a folio, the fields of tail page 1 and 2 are used because
> we do not want to inflate struct folio for high order folios. In this
> loop, all tail pages are processed in the same way. To follow your method,
> there will be some ifs for tail page 1 to clear _nr_pages and tail page 2
> to clear other fields. It feels to me that we are clearly mixing
> struct page and struct folio.

I'm not saying mixing, I'm just pointing out we have to clean up the
tail pages by writing to every one. This can also adjust the flags and
order if required on tail pages. Initing the folio is a seperate step,
and you already showed the right way to code that.

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Matthew Brost 2 weeks, 4 days ago

On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
> 

This whole thread makes my head hurt, as does core MM.

IMO the TL;DR is:

- Why is Intel the only one proving this stuff works? We can debate all
  day about what should or should not work — but someone else needs to
  actually prove it.i, rather than type hypotheticals.

- Intel has demonstrated that this works and is still getting blocked.

- This entire thread is about a fixes patch for large device pages.
  Changing prep_compound_page is completely out of scope for a fixes
  patch, and honestly so is most of the rest of what’s being proposed.

- At a minimum, you must clear every page’s flags in the loop. So why not
  conservatively clear anything else a folio might have set before calling
  an existing core-MM function, ensuring the pages are in a known state?
  This is a fixes patch.

- Given the current state of the discussion, I don’t think large device
  pages should be in 6.19. And if so, why didn’t the entire device pages
  series receive this level of scrutiny earlier? It’s my mistake for not
  saying “no” until the reallocation at different sizes issue was resolved.

@Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
we seemly cannot close on this.

Matt

> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
> >>>> I suppose we want some prep_single_page(page) and some reorg to share
> >>>> code with the other prep function.
> >>
> >> This is just an unnecessary need due to lack of knowledge of/do not want
> >> to investigate core MM page and folio initialization code.
> >
> > It will be better to keep this related code together, not spread all
> > around.
> 
> Or clarify what code is for preparing pages, which would go away at memdesc
> time, and what code is for preparing folios, which would stay.
> 
> >
> >>>> I don't think so. It should do the above job efficiently and iterate
> >>>> over the page list exactly once.
> >>
> >> folio initialization should not iterate over any page list, since folio is
> >> supposed to be treated as a whole instead of individual pages.
> >
> > The tail pages need to have the right data in them or compound_head
> > won't work.
> 
> That is done by set_compound_head() in prep_compound_tail().
> prep_compound_page() take cares of it. As long as it is called, even if
> the pages in that compound page have random states before, the compound
> page should function correctly afterwards.
> 
> >
> >> folio->mapping = NULL;
> >> folio->memcg_data = 0;
> >> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>
> >> should be enough.
> >
> > This seems believable to me for setting up an order 0 page.
> 
> It works for any folio, regardless of its order. fields used in second
> or third subpages are all taken care of by prep_compound_page().
> 
> >
> >> if (order)
> >> 	folio_set_large_rmappable(folio);
> >
> > That one is in zone_device_folio_init()
> 
> Yes. And the code location looks right to me.
> 
> >
> > And maybe the naming has got really confused if we have both functions
> > now :\
> 
> Yes. One of the issues is that device private code used to only handles
> order-0 pages and was converted to use high order folio directly without
> using high order page (namely compound page) as an intermediate step.
> This two-step-in-one caused confusion. But the key thing to avoid the
> confusion is that to form a high order folio, a list of contiguous pages
> would become a compound page by calling prep_compound_page(), then
> the compound page becomes a folio by calling folio_set_large_rmappable().
> 
> BTW, the code in prep_compound_head() after folio_set_order(folio, order)
> should belong to folio_set_large_rmappable() and they are causing confusion,
> since they are only applicable to rmappable large folios. I am going to
> send a patch to fix it.
> 
> 
> Best Regards,
> Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 2 weeks, 4 days ago

On Wed, Jan 21, 2026 at 11:19:45PM -0800, Matthew Brost wrote:
> - Why is Intel the only one proving this stuff works? We can debate all
>   day about what should or should not work — but someone else needs to
>   actually prove it.i, rather than type hypotheticals.

Oh come on, NVIDIA has done an *enormous* amount of work to get these
things to this point where they are actually getting close to
functional and usable.

Don't "Oh poor intel" me :P

> - Intel has demonstrated that this works and is still getting blocked.

We generally don't merge patches because they "works for me". The
issue is the thing you presented is very ugly and inefficient, and
when we start talking about the right way to do it you get all
defensive and disappear.

> - Given the current state of the discussion, I don’t think large device
>   pages should be in 6.19. And if so, why didn’t the entire device pages
>   series receive this level of scrutiny earlier? It’s my mistake for not
>   saying “no” until the reallocation at different sizes issue was resolved.

It did, nobody noticed this bug or post something so obviously ugly :P

> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
> we seemly cannot close on this.

What's the issue here? You said you were going to go ahead with the
ugly thing, go do it and come back with something better. That's what
you wanted, right?

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 2 weeks, 4 days ago

On 1/22/26 08:19, Matthew Brost wrote:
> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>> 
> 
> This whole thread makes my head hurt, as does core MM.
> 
> IMO the TL;DR is:
> 
> - Why is Intel the only one proving this stuff works? We can debate all
>   day about what should or should not work — but someone else needs to
>   actually prove it.i, rather than type hypotheticals.
> 
> - Intel has demonstrated that this works and is still getting blocked.
> 
> - This entire thread is about a fixes patch for large device pages.
>   Changing prep_compound_page is completely out of scope for a fixes
>   patch, and honestly so is most of the rest of what’s being proposed.

FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
about how refactor things more properly afterwards, going forward.

> - At a minimum, you must clear every page’s flags in the loop. So why not
>   conservatively clear anything else a folio might have set before calling
>   an existing core-MM function, ensuring the pages are in a known state?
>   This is a fixes patch.
> 
> - Given the current state of the discussion, I don’t think large device
>   pages should be in 6.19. And if so, why didn’t the entire device pages
>   series receive this level of scrutiny earlier? It’s my mistake for not
>   saying “no” until the reallocation at different sizes issue was resolved.
> 
> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
> we seemly cannot close on this.
> 
> Matt
> 
>> > On Mon, Jan 19, 2026 at 09:50:16PM -0500, Zi Yan wrote:
>> >>>> I suppose we want some prep_single_page(page) and some reorg to share
>> >>>> code with the other prep function.
>> >>
>> >> This is just an unnecessary need due to lack of knowledge of/do not want
>> >> to investigate core MM page and folio initialization code.
>> >
>> > It will be better to keep this related code together, not spread all
>> > around.
>> 
>> Or clarify what code is for preparing pages, which would go away at memdesc
>> time, and what code is for preparing folios, which would stay.
>> 
>> >
>> >>>> I don't think so. It should do the above job efficiently and iterate
>> >>>> over the page list exactly once.
>> >>
>> >> folio initialization should not iterate over any page list, since folio is
>> >> supposed to be treated as a whole instead of individual pages.
>> >
>> > The tail pages need to have the right data in them or compound_head
>> > won't work.
>> 
>> That is done by set_compound_head() in prep_compound_tail().
>> prep_compound_page() take cares of it. As long as it is called, even if
>> the pages in that compound page have random states before, the compound
>> page should function correctly afterwards.
>> 
>> >
>> >> folio->mapping = NULL;
>> >> folio->memcg_data = 0;
>> >> folio->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>> >>
>> >> should be enough.
>> >
>> > This seems believable to me for setting up an order 0 page.
>> 
>> It works for any folio, regardless of its order. fields used in second
>> or third subpages are all taken care of by prep_compound_page().
>> 
>> >
>> >> if (order)
>> >> 	folio_set_large_rmappable(folio);
>> >
>> > That one is in zone_device_folio_init()
>> 
>> Yes. And the code location looks right to me.
>> 
>> >
>> > And maybe the naming has got really confused if we have both functions
>> > now :\
>> 
>> Yes. One of the issues is that device private code used to only handles
>> order-0 pages and was converted to use high order folio directly without
>> using high order page (namely compound page) as an intermediate step.
>> This two-step-in-one caused confusion. But the key thing to avoid the
>> confusion is that to form a high order folio, a list of contiguous pages
>> would become a compound page by calling prep_compound_page(), then
>> the compound page becomes a folio by calling folio_set_large_rmappable().
>> 
>> BTW, the code in prep_compound_head() after folio_set_order(folio, order)
>> should belong to folio_set_large_rmappable() and they are causing confusion,
>> since they are only applicable to rmappable large folios. I am going to
>> send a patch to fix it.
>> 
>> 
>> Best Regards,
>> Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Balbir Singh 2 weeks, 4 days ago

On 1/22/26 19:00, Vlastimil Babka wrote:
> On 1/22/26 08:19, Matthew Brost wrote:
>> On Tue, Jan 20, 2026 at 10:01:18PM -0500, Zi Yan wrote:
>>> On 20 Jan 2026, at 8:53, Jason Gunthorpe wrote:
>>>
>>
>> This whole thread makes my head hurt, as does core MM.
>>
>> IMO the TL;DR is:
>>
>> - Why is Intel the only one proving this stuff works? We can debate all
>>   day about what should or should not work — but someone else needs to
>>   actually prove it.i, rather than type hypotheticals.
>>
>> - Intel has demonstrated that this works and is still getting blocked.
>>
>> - This entire thread is about a fixes patch for large device pages.
>>   Changing prep_compound_page is completely out of scope for a fixes
>>   patch, and honestly so is most of the rest of what’s being proposed.
> 
> FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> about how refactor things more properly afterwards, going forward.
> 

I've said the same thing and I concur, we can use the patch as-is and
change this to set the relevant identified fields after 6.19

Balbir

>> - At a minimum, you must clear every page’s flags in the loop. So why not
>>   conservatively clear anything else a folio might have set before calling
>>   an existing core-MM function, ensuring the pages are in a known state?
>>   This is a fixes patch.
>>
>> - Given the current state of the discussion, I don’t think large device
>>   pages should be in 6.19. And if so, why didn’t the entire device pages
>>   series receive this level of scrutiny earlier? It’s my mistake for not
>>   saying “no” until the reallocation at different sizes issue was resolved.
>>
>> @Andrew. - I'd revert large device pages in 6.19 as it doesn't work and
>> we seemly cannot close on this.
>>
>> Matt


<snip>

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Andrew Morton 2 weeks, 3 days ago

On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:

> >> - Intel has demonstrated that this works and is still getting blocked.
> >>
> >> - This entire thread is about a fixes patch for large device pages.
> >>   Changing prep_compound_page is completely out of scope for a fixes
> >>   patch, and honestly so is most of the rest of what’s being proposed.
> > 
> > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> > about how refactor things more properly afterwards, going forward.
> > 
> 
> I've said the same thing and I concur, we can use the patch as-is and
> change this to set the relevant identified fields after 6.19

So the plan is to add this patch to 6.19-rc and take another look at
patches [2-5] during next -rc cycle?

I think the plan is to take Matthew's work via the DRM tree?  But if people
want me to patchbunny this fix then please lmk.

I presently have

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Balbir Singh <balbirs@nvidia.com>

If people wish to add to this then please do so.

I'll restore this patch into mm.git's hotfix branch (and hence
linux-next) because testing.

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 2 weeks, 3 days ago

On 1/22/26 22:41, Andrew Morton wrote:
> On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
>> >> - Intel has demonstrated that this works and is still getting blocked.
>> >>
>> >> - This entire thread is about a fixes patch for large device pages.
>> >>   Changing prep_compound_page is completely out of scope for a fixes
>> >>   patch, and honestly so is most of the rest of what’s being proposed.
>> > 
>> > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
>> > about how refactor things more properly afterwards, going forward.
>> > 
>> 
>> I've said the same thing and I concur, we can use the patch as-is and
>> change this to set the relevant identified fields after 6.19
> 
> So the plan is to add this patch to 6.19-rc and take another look at
> patches [2-5] during next -rc cycle?
> 
> I think the plan is to take Matthew's work via the DRM tree?  But if people
> want me to patchbunny this fix then please lmk.
> 
> I presently have
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Felix Kuehling <felix.kuehling@amd.com>
> Reviewed-by: Balbir Singh <balbirs@nvidia.com>
> 
> If people wish to add to this then please do so.

I did too.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> I'll restore this patch into mm.git's hotfix branch (and hence
> linux-next) because testing.

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Alistair Popple 2 weeks, 3 days ago

On 2026-01-23 at 08:41 +1100, Andrew Morton <akpm@linux-foundation.org> wrote...
> On Thu, 22 Jan 2026 20:10:44 +1100 Balbir Singh <balbirs@nvidia.com> wrote:
> 
> > >> - Intel has demonstrated that this works and is still getting blocked.
> > >>
> > >> - This entire thread is about a fixes patch for large device pages.
> > >>   Changing prep_compound_page is completely out of scope for a fixes
> > >>   patch, and honestly so is most of the rest of what’s being proposed.
> > > 
> > > FWIW I'm ok if this lands as a fix patch, and perceived the discussion to be
> > > about how refactor things more properly afterwards, going forward.
> > > 
> > 
> > I've said the same thing and I concur, we can use the patch as-is and
> > change this to set the relevant identified fields after 6.19
> 
> So the plan is to add this patch to 6.19-rc and take another look at
> patches [2-5] during next -rc cycle?

I'm ok with this as a a quick fix, and happy to take a look at cleaning this up
in the next cycle or two as it's been on my TODO list for a while anyway.

> I think the plan is to take Matthew's work via the DRM tree?  But if people
> want me to patchbunny this fix then please lmk.
> 
> I presently have
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Felix Kuehling <felix.kuehling@amd.com>
> Reviewed-by: Balbir Singh <balbirs@nvidia.com>
> 
> If people wish to add to this then please do so.
> 
> I'll restore this patch into mm.git's hotfix branch (and hence
> linux-next) because testing.

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 3 weeks, 3 days ago

On 1/16/26 12:10, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 63c6ab4fdf08..ac7be07e3361 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif
> +
> +		new_folio->mapping = NULL;
> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> +		new_folio->share = 0;   /* fsdax only, unused for device private */
> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> +	}
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().

Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
seems rather fragile. It seems likely to me somebody will try to change some
implementation detail in the page allocator and not notice it breaks this,
for example. I hope we can eventually get to something more robust.

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 2 weeks, 4 days ago

On 1/16/26 17:07, Vlastimil Babka wrote:
> On 1/16/26 12:10, Francois Dugast wrote:
>> From: Matthew Brost <matthew.brost@intel.com>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 63c6ab4fdf08..ac7be07e3361 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>>  	}
>>  }
>>  
>> -void zone_device_page_init(struct page *page, unsigned int order)
>> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> +			   unsigned int order)
>>  {
>> +	struct page *new_page = page;
>> +	unsigned int i;
>> +
>>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>  
>> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> +		struct folio *new_folio = (struct folio *)new_page;
>> +
>> +		/*
>> +		 * new_page could have been part of previous higher order folio
>> +		 * which encodes the order, in page + 1, in the flags bits. We
>> +		 * blindly clear bits which could have set my order field here,
>> +		 * including page head.
>> +		 */
>> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> +
>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> +		/*
>> +		 * This pointer math looks odd, but new_page could have been
>> +		 * part of a previous higher order folio, which sets _nr_pages
>> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> +		 * correctly locate the _nr_pages bits within new_page which
>> +		 * could have modified by previous higher order folio.
>> +		 */
>> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> +#endif
>> +
>> +		new_folio->mapping = NULL;
>> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
>> +		new_folio->share = 0;   /* fsdax only, unused for device private */
>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>> +	}
>> +
>>  	/*
>>  	 * Drivers shouldn't be allocating pages after calling
>>  	 * memunmap_pages().
> 
> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
> seems rather fragile. It seems likely to me somebody will try to change some
> implementation detail in the page allocator and not notice it breaks this,
> for example. I hope we can eventually get to something more robust.

For doing this as a hotfix for 6.19, assuming we'll refactor later:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Jason Gunthorpe 3 weeks, 3 days ago

On Fri, Jan 16, 2026 at 05:07:09PM +0100, Vlastimil Babka wrote:
> On 1/16/26 12:10, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 63c6ab4fdf08..ac7be07e3361 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
> >  	}
> >  }
> >  
> > -void zone_device_page_init(struct page *page, unsigned int order)
> > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> > +			   unsigned int order)
> >  {
> > +	struct page *new_page = page;
> > +	unsigned int i;
> > +
> >  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> >  
> > +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> > +		struct folio *new_folio = (struct folio *)new_page;
> > +
> > +		/*
> > +		 * new_page could have been part of previous higher order folio
> > +		 * which encodes the order, in page + 1, in the flags bits. We
> > +		 * blindly clear bits which could have set my order field here,
> > +		 * including page head.
> > +		 */
> > +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> > +
> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > +		/*
> > +		 * This pointer math looks odd, but new_page could have been
> > +		 * part of a previous higher order folio, which sets _nr_pages
> > +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> > +		 * correctly locate the _nr_pages bits within new_page which
> > +		 * could have modified by previous higher order folio.
> > +		 */
> > +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> > +#endif
> > +
> > +		new_folio->mapping = NULL;
> > +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> > +		new_folio->share = 0;   /* fsdax only, unused for device private */
> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > +	}
> > +
> >  	/*
> >  	 * Drivers shouldn't be allocating pages after calling
> >  	 * memunmap_pages().
> 
> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
> seems rather fragile. It seems likely to me somebody will try to change some
> implementation detail in the page allocator and not notice it breaks this,
> for example. I hope we can eventually get to something more robust.

These pages shouldn't be in the buddy allocator at all? The driver
using the ZONE_DEVICE pages is responsible to provide its own
allocator.

Did you mean something else?

Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Vlastimil Babka 3 weeks, 3 days ago

On 1/16/26 18:20, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 05:07:09PM +0100, Vlastimil Babka wrote:
>> On 1/16/26 12:10, Francois Dugast wrote:
>> > From: Matthew Brost <matthew.brost@intel.com>
>> > diff --git a/mm/memremap.c b/mm/memremap.c
>> > index 63c6ab4fdf08..ac7be07e3361 100644
>> > --- a/mm/memremap.c
>> > +++ b/mm/memremap.c
>> > @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>> >  	}
>> >  }
>> >  
>> > -void zone_device_page_init(struct page *page, unsigned int order)
>> > +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
>> > +			   unsigned int order)
>> >  {
>> > +	struct page *new_page = page;
>> > +	unsigned int i;
>> > +
>> >  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> >  
>> > +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
>> > +		struct folio *new_folio = (struct folio *)new_page;
>> > +
>> > +		/*
>> > +		 * new_page could have been part of previous higher order folio
>> > +		 * which encodes the order, in page + 1, in the flags bits. We
>> > +		 * blindly clear bits which could have set my order field here,
>> > +		 * including page head.
>> > +		 */
>> > +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
>> > +
>> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
>> > +		/*
>> > +		 * This pointer math looks odd, but new_page could have been
>> > +		 * part of a previous higher order folio, which sets _nr_pages
>> > +		 * in page + 1 (new_page). Therefore, we use pointer casting to
>> > +		 * correctly locate the _nr_pages bits within new_page which
>> > +		 * could have modified by previous higher order folio.
>> > +		 */
>> > +		((struct folio *)(new_page - 1))->_nr_pages = 0;
>> > +#endif
>> > +
>> > +		new_folio->mapping = NULL;
>> > +		new_folio->pgmap = pgmap;	/* Also clear compound head */
>> > +		new_folio->share = 0;   /* fsdax only, unused for device private */
>> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>> > +	}
>> > +
>> >  	/*
>> >  	 * Drivers shouldn't be allocating pages after calling
>> >  	 * memunmap_pages().
>> 
>> Can't say I'm a fan of this. It probably works now (so I'm not nacking) but
>> seems rather fragile. It seems likely to me somebody will try to change some
>> implementation detail in the page allocator and not notice it breaks this,
>> for example. I hope we can eventually get to something more robust.
> 
> These pages shouldn't be in the buddy allocator at all? The driver
> using the ZONE_DEVICE pages is responsible to provide its own
> allocator.
> 
> Did you mean something else?

Yeah sorry that was imprecise. I meant the struct page/folio layout
implementation details (which may or may not be related to the page allocator).

> Jason
>

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Posted by Balbir Singh 3 weeks, 3 days ago

On 1/16/26 22:10, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Reinitialize metadata for large zone device private folios in
> zone_device_page_init prior to creating a higher-order zone device
> private folio. This step is necessary when the folio’s order changes
> dynamically between zone_device_page_init calls to avoid building a
> corrupt folio. As part of the metadata reinitialization, the dev_pagemap
> must be passed in from the caller because the pgmap stored in the folio
> page may have been overwritten with a compound head.
> 
> Without this fix, individual pages could have invalid pgmap fields and
> flags (with PG_locked being notably problematic) due to prior different
> order allocations, which can, and will, result in kernel crashes.
> 
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: adhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> Cc: linux-mm@kvack.org
> Cc: linux-cxl@vger.kernel.org
> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> 
> ---
> 
> The latest revision updates the commit message to explain what is broken
> prior to this patch and also restructures the patch so it applies, and
> works, on both the 6.19 branches and drm-tip, the latter in which includes
> patches for the next kernel release PR. Intel CI passes on both the 6.19
> branches and drm-tip at point of the first patch in this series and the
> last (drm-tip only given subsequent patches in the series require in
> patches drm-tip but not present 6.19).
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c       |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
>  drivers/gpu/drm/drm_pagemap.c            |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
>  include/linux/memremap.h                 |  9 ++++--
>  lib/test_hmm.c                           |  4 ++-
>  mm/memremap.c                            | 35 +++++++++++++++++++++++-
>  7 files changed, 47 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index e5000bef90f2..7cf9310de0ec 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -723,7 +723,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>  
>  	dpage = pfn_to_page(uvmem_pfn);
>  	dpage->zone_device_data = pvt;
> -	zone_device_page_init(dpage, 0);
> +	zone_device_page_init(dpage, &kvmppc_uvmem_pgmap, 0);
>  	return dpage;
>  out_clear:
>  	spin_lock(&kvmppc_uvmem_bitmap_lock);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> index af53e796ea1b..6ada7b4af7c6 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> @@ -217,7 +217,7 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn)
>  	page = pfn_to_page(pfn);
>  	svm_range_bo_ref(prange->svm_bo);
>  	page->zone_device_data = prange->svm_bo;
> -	zone_device_page_init(page, 0);
> +	zone_device_page_init(page, page_pgmap(page), 0);
>  }
>  
>  static void
> diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
> index 03ee39a761a4..38eca94f01a1 100644
> --- a/drivers/gpu/drm/drm_pagemap.c
> +++ b/drivers/gpu/drm/drm_pagemap.c
> @@ -201,7 +201,7 @@ static void drm_pagemap_get_devmem_page(struct page *page,
>  					struct drm_pagemap_zdd *zdd)
>  {
>  	page->zone_device_data = drm_pagemap_zdd_get(zdd);
> -	zone_device_page_init(page, 0);
> +	zone_device_page_init(page, page_pgmap(page), 0);
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 58071652679d..3d8031296eed 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -425,7 +425,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_large)
>  			order = ilog2(DMEM_CHUNK_NPAGES);
>  	}
>  
> -	zone_device_folio_init(folio, order);
> +	zone_device_folio_init(folio, page_pgmap(folio_page(folio, 0)), order);
>  	return page;
>  }
>  
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 713ec0435b48..e3c2ccf872a8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -224,7 +224,8 @@ static inline bool is_fsdax_page(const struct page *page)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page, unsigned int order);
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -234,9 +235,11 @@ bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long memremap_compat_align(void);
>  
> -static inline void zone_device_folio_init(struct folio *folio, unsigned int order)
> +static inline void zone_device_folio_init(struct folio *folio,
> +					  struct dev_pagemap *pgmap,
> +					  unsigned int order)
>  {
> -	zone_device_page_init(&folio->page, order);
> +	zone_device_page_init(&folio->page, pgmap, order);
>  	if (order)
>  		folio_set_large_rmappable(folio);
>  }
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 8af169d3873a..455a6862ae50 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -662,7 +662,9 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror *dmirror,
>  			goto error;
>  	}
>  
> -	zone_device_folio_init(page_folio(dpage), order);
> +	zone_device_folio_init(page_folio(dpage),
> +			       page_pgmap(folio_page(page_folio(dpage), 0)),
> +			       order);
>  	dpage->zone_device_data = rpage;
>  	return dpage;
>  
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 63c6ab4fdf08..ac7be07e3361 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -477,10 +477,43 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>  
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +			   unsigned int order)
>  {
> +	struct page *new_page = page;
> +	unsigned int i;
> +
>  	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>  
> +	for (i = 0; i < (1UL << order); ++i, ++new_page) {
> +		struct folio *new_folio = (struct folio *)new_page;
> +
> +		/*
> +		 * new_page could have been part of previous higher order folio
> +		 * which encodes the order, in page + 1, in the flags bits. We
> +		 * blindly clear bits which could have set my order field here,
> +		 * including page head.
> +		 */
> +		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */
> +
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +		/*
> +		 * This pointer math looks odd, but new_page could have been
> +		 * part of a previous higher order folio, which sets _nr_pages
> +		 * in page + 1 (new_page). Therefore, we use pointer casting to
> +		 * correctly locate the _nr_pages bits within new_page which
> +		 * could have modified by previous higher order folio.
> +		 */
> +		((struct folio *)(new_page - 1))->_nr_pages = 0;
> +#endif
> +
> +		new_folio->mapping = NULL;
> +		new_folio->pgmap = pgmap;	/* Also clear compound head */
> +		new_folio->share = 0;   /* fsdax only, unused for device private */

It could use a

BUILD_BUG_ON(offsetof(struct folio, pgmap) > sizeof(struct page));
BUILD_BUG_ON(offsetof(struct folio, share) > sizeof(struct page));

> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> +	}
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().


Very subtle, I wonder if from a new page perspective if memset of new_page to 0
is cleaner, but I guess it does touch more bytes.

Reviewed-by: Balbir Singh <balbirs@nvidia.com>