[v4] Enable THP support in drm_pagemap

[PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Francois Dugast 4 weeks ago

From: Matthew Brost <matthew.brost@intel.com>

The core MM splits the folio before calling folio_free, restoring the
zone pages associated with the folio to an initialized state (e.g.,
non-compound, pgmap valid, etc...). The order argument represents the
folio’s order prior to the split which can be used driver side to know
how many pages are being freed.

Fixes: 3a5a06554566 ("mm/zone_device: rename page_free callback to folio_free")
Cc: Zi Yan <ziy@nvidia.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-pci@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c       | 2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
 drivers/gpu/drm/drm_pagemap.c            | 3 ++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 4 ++--
 drivers/pci/p2pdma.c                     | 2 +-
 include/linux/memremap.h                 | 7 ++++++-
 lib/test_hmm.c                           | 4 +---
 mm/memremap.c                            | 5 +++--
 8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e5000bef90f2..b58f34eec6e5 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1014,7 +1014,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
  * to a normal PFN during H_SVM_PAGE_OUT.
  * Gets called with kvm->arch.uvmem_lock held.
  */
-static void kvmppc_uvmem_folio_free(struct folio *folio)
+static void kvmppc_uvmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	unsigned long pfn = page_to_pfn(page) -
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index af53e796ea1b..a26e3c448e47 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -567,7 +567,7 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
 	return r < 0 ? r : 0;
 }
 
-static void svm_migrate_folio_free(struct folio *folio)
+static void svm_migrate_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct svm_range_bo *svm_bo = page->zone_device_data;
diff --git a/drivers/gpu/drm/drm_pagemap.c b/drivers/gpu/drm/drm_pagemap.c
index 03ee39a761a4..df253b13cf85 100644
--- a/drivers/gpu/drm/drm_pagemap.c
+++ b/drivers/gpu/drm/drm_pagemap.c
@@ -1144,11 +1144,12 @@ static int __drm_pagemap_migrate_to_ram(struct vm_area_struct *vas,
 /**
  * drm_pagemap_folio_free() - Put GPU SVM zone device data associated with a folio
  * @folio: Pointer to the folio
+ * @order: Order of the folio prior to being split by core MM
  *
  * This function is a callback used to put the GPU SVM zone device data
  * associated with a page when it is being released.
  */
-static void drm_pagemap_folio_free(struct folio *folio)
+static void drm_pagemap_folio_free(struct folio *folio, unsigned int order)
 {
 	drm_pagemap_zdd_put(folio->page.zone_device_data);
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 58071652679d..545f316fca14 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -115,14 +115,14 @@ unsigned long nouveau_dmem_page_addr(struct page *page)
 	return chunk->bo->offset + off;
 }
 
-static void nouveau_dmem_folio_free(struct folio *folio)
+static void nouveau_dmem_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
 	spin_lock(&dmem->lock);
-	if (folio_order(folio)) {
+	if (order) {
 		page->zone_device_data = dmem->free_folios;
 		dmem->free_folios = folio;
 	} else {
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a2fc7ab42c3..a6fa7610f8a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -200,7 +200,7 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
-static void p2pdma_folio_free(struct folio *folio)
+static void p2pdma_folio_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 713ec0435b48..97fcffeb1c1e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,8 +79,13 @@ struct dev_pagemap_ops {
 	 * Called once the folio refcount reaches 0.  The reference count will be
 	 * reset to one by the core code after the method is called to prepare
 	 * for handing out the folio again.
+	 *
+	 * The core MM splits the folio before calling folio_free, restoring the
+	 * zone pages associated with the folio to an initialized state (e.g.,
+	 * non-compound, pgmap valid, etc...). The order argument represents the
+	 * folio’s order prior to the split.
 	 */
-	void (*folio_free)(struct folio *folio);
+	void (*folio_free)(struct folio *folio, unsigned int order);
 
 	/*
 	 * Used for private (un-addressable) device memory only.  Must migrate
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 8af169d3873a..e17c71d02a3a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1580,13 +1580,11 @@ static const struct file_operations dmirror_fops = {
 	.owner		= THIS_MODULE,
 };
 
-static void dmirror_devmem_free(struct folio *folio)
+static void dmirror_devmem_free(struct folio *folio, unsigned int order)
 {
 	struct page *page = &folio->page;
 	struct page *rpage = BACKING_PAGE(page);
 	struct dmirror_device *mdevice;
-	struct folio *rfolio = page_folio(rpage);
-	unsigned int order = folio_order(rfolio);
 
 	if (rpage != page) {
 		if (order)
diff --git a/mm/memremap.c b/mm/memremap.c
index 63c6ab4fdf08..39dc4bd190d0 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -417,6 +417,7 @@ void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
 	unsigned long nr = folio_nr_pages(folio);
+	unsigned int order = folio_order(folio);
 	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
@@ -453,7 +454,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
@@ -472,7 +473,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_PCI_P2PDMA:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
-		pgmap->ops->folio_free(folio);
+		pgmap->ops->folio_free(folio, order);
 		break;
 	}
 }
-- 
2.43.0

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Matthew Wilcox 4 weeks ago

On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> The core MM splits the folio before calling folio_free, restoring the
> zone pages associated with the folio to an initialized state (e.g.,
> non-compound, pgmap valid, etc...). The order argument represents the
> folio’s order prior to the split which can be used driver side to know
> how many pages are being freed.

This really feels like the wrong way to fix this problem.

I think someone from the graphics side really needs to take the lead on
understanding what the MM is doing (both currently and in the future).
I'm happy to work with you, but it feels like there's a lot of churn right
now because there's a lot of people working on this without understanding
the MM side of things (and conversely, I don't think (m)any people on the
MM side really understand what graphics cards are trying to accomplish).

Who is that going to be?  I'm happy to get on the phone with someone.

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Balbir Singh 4 weeks ago

On 1/12/26 08:35, Matthew Wilcox wrote:
> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>> The core MM splits the folio before calling folio_free, restoring the
>> zone pages associated with the folio to an initialized state (e.g.,
>> non-compound, pgmap valid, etc...). The order argument represents the
>> folio’s order prior to the split which can be used driver side to know
>> how many pages are being freed.
> 
> This really feels like the wrong way to fix this problem.
> 

This stems from a special requirement, freeing is done in two phases

1. Free the folio -> inform the driver (which implies freeing the backing device memory)
2. Return the folio back, split it back to single order folios

The current code does not do 2. 1 followed by 2 does not work for
Francois since the backing memory can get reused before we reach step 2.
The proposed patch does 2 followed 1, but doing 2 means we've lost the 
folio order and thus the old order is passed in. Although, I wonder if the
backing folio's zone_device_data can be used to encode any order information
about the device side allocation.

@Francois, I hope I did not miss anything in the explanation above.

> I think someone from the graphics side really needs to take the lead on
> understanding what the MM is doing (both currently and in the future).
> I'm happy to work with you, but it feels like there's a lot of churn right
> now because there's a lot of people working on this without understanding
> the MM side of things (and conversely, I don't think (m)any people on the
> MM side really understand what graphics cards are trying to accomplish).
> 

I suspect you are referring to folio specialization and/or downsizing?

> Who is that going to be?  I'm happy to get on the phone with someone.

Happy to work with you, but I am not the authority on graphics, I can speak
to zone device folios. I suspect we'd need to speak to more than one person.

Balbir

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 4 weeks ago

On 11 Jan 2026, at 19:19, Balbir Singh wrote:

> On 1/12/26 08:35, Matthew Wilcox wrote:
>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>> The core MM splits the folio before calling folio_free, restoring the
>>> zone pages associated with the folio to an initialized state (e.g.,
>>> non-compound, pgmap valid, etc...). The order argument represents the
>>> folio’s order prior to the split which can be used driver side to know
>>> how many pages are being freed.
>>
>> This really feels like the wrong way to fix this problem.
>>

Hi Matthew,

I think the wording is confusing, since the actual issue is that:

1. zone_device_page_init() calls prep_compound_page() to form a large folio,
2. but free_zone_device_folio() never reverse the course,
3. the undo of prep_compound_page() in free_zone_device_folio() needs to
   be done before driver callback ->folio_free(), since once ->folio_free()
   is called, the folio can be reallocated immediately,
4. after the undo of prep_compound_page(), folio_order() can no longer provide
   the original order information, thus, folio_free() needs that for proper
   device side ref manipulation.

So this is not used for "split" but undo of prep_compound_page(). It might
look like a split to non core MM people, since it changes a large folio
to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
open codes it in free_pages_prepare() by resetting page flags, page->mapping,
and so on. So it might be why the undo prep_compound_page() is missed
by non core MM people.

>
> This stems from a special requirement, freeing is done in two phases
>
> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> 2. Return the folio back, split it back to single order folios

Hi Balbir,

Please refrain from using "split" here, since it confuses MM people. A folio
is split when it is still in use, but in this case, the folio has been freed
and needs to be restored to "free page" state.

>
> The current code does not do 2. 1 followed by 2 does not work for
> Francois since the backing memory can get reused before we reach step 2.
> The proposed patch does 2 followed 1, but doing 2 means we've lost the
> folio order and thus the old order is passed in. Although, I wonder if the
> backing folio's zone_device_data can be used to encode any order information
> about the device side allocation.
>
> @Francois, I hope I did not miss anything in the explanation above.
>
>> I think someone from the graphics side really needs to take the lead on
>> understanding what the MM is doing (both currently and in the future).
>> I'm happy to work with you, but it feels like there's a lot of churn right
>> now because there's a lot of people working on this without understanding
>> the MM side of things (and conversely, I don't think (m)any people on the
>> MM side really understand what graphics cards are trying to accomplish).
>>
>
> I suspect you are referring to folio specialization and/or downsizing?
>
>> Who is that going to be?  I'm happy to get on the phone with someone.
>
> Happy to work with you, but I am not the authority on graphics, I can speak
> to zone device folios. I suspect we'd need to speak to more than one person.
>

--
Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 4 weeks ago

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.

There is something wrong with the driver if the "folio can be
reallocated immediately".

The flow generally expects there to be a driver allocator linked to
folio_free()

1) Allocator finds free memory
2) zone_device_page_init() allocates the memory and makes refcount=1
3) __folio_put() knows the recount 0.
4) free_zone_device_folio() calls folio_free(), but it doesn't
   actually need to undo prep_compound_page() because *NOTHING* can
   use the page pointer at this point.
5) Driver puts the memory back into the allocator and now #1 can
   happen. It knows how much memory to put back because folio->order
   is valid from #2
6) #1 happens again, then #2 happens again and the folio is in the
   right state for use. The successor #2 fully undoes the work of the
   predecessor #2.

If you have races where #1 can happen immediately after #3 then the
driver design is fundamentally broken and passing around order isn't
going to help anything.

If the allocator is using the struct page memory then step #5 should
also clean up the struct page with the allocator data before returning
it to the allocator.

I vaugely remember talking about this before in the context of the Xe
driver.. You can't just take an existing VRAM allocator and layer it
on top of the folios and have it broadly ignore the folio_free
callback.

Jsaon

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Matthew Brost 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:

Hi, catching up here.

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > 
> > > On 1/12/26 08:35, Matthew Wilcox wrote:
> > >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>> The core MM splits the folio before calling folio_free, restoring the
> > >>> zone pages associated with the folio to an initialized state (e.g.,
> > >>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>> folio’s order prior to the split which can be used driver side to know
> > >>> how many pages are being freed.
> > >>
> > >> This really feels like the wrong way to fix this problem.
> > >>
> > 
> > Hi Matthew,
> > 
> > I think the wording is confusing, since the actual issue is that:
> > 
> > 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > 2. but free_zone_device_folio() never reverse the course,
> > 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >    be done before driver callback ->folio_free(), since once ->folio_free()
> >    is called, the folio can be reallocated immediately,
> > 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >    the original order information, thus, folio_free() needs that for proper
> >    device side ref manipulation.
> 
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
> 
> The flow generally expects there to be a driver allocator linked to
> folio_free()
> 
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.

Correct—nothing can use the folio prior to calling folio_free(). Once
folio_free() returns, the driver side is free to immediately reallocate
the folio (or a subset of its pages).

> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.
> 
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>

The above race does not exist; if it did, I agree we’d be solving
nothing here.
 
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.
> 

We could move the call to free_zone_device_folio_prepare() [1] into the
driver-side implementation of ->folio_free() and drop the order argument
here. Zi didn’t particularly like that; he preferred calling
free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
which is why this patch exists.

FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
actually need the order regardless of where
free_zone_device_folio_prepare() is called, but Nouveau does need the
order if free_zone_device_folio_prepare() is called before
->folio_free().

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
[2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405

> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.
> 

We are definitely not ignoring the ->folio_free callback—that is the
point at which we tell our VRAM allocator (DRM buddy) it is okay to
release the allocation and make it available for reuse.

Matt

> Jsaon

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 16:49, Matthew Brost wrote:

> On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
>
> Hi, catching up here.
>
>> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>>
>>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>>> folio’s order prior to the split which can be used driver side to know
>>>>>> how many pages are being freed.
>>>>>
>>>>> This really feels like the wrong way to fix this problem.
>>>>>
>>>
>>> Hi Matthew,
>>>
>>> I think the wording is confusing, since the actual issue is that:
>>>
>>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>>> 2. but free_zone_device_folio() never reverse the course,
>>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>>    is called, the folio can be reallocated immediately,
>>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>>    the original order information, thus, folio_free() needs that for proper
>>>    device side ref manipulation.
>>
>> There is something wrong with the driver if the "folio can be
>> reallocated immediately".
>>
>> The flow generally expects there to be a driver allocator linked to
>> folio_free()
>>
>> 1) Allocator finds free memory
>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>> 3) __folio_put() knows the recount 0.
>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>    actually need to undo prep_compound_page() because *NOTHING* can
>>    use the page pointer at this point.
>
> Correct—nothing can use the folio prior to calling folio_free(). Once
> folio_free() returns, the driver side is free to immediately reallocate
> the folio (or a subset of its pages).
>
>> 5) Driver puts the memory back into the allocator and now #1 can
>>    happen. It knows how much memory to put back because folio->order
>>    is valid from #2
>> 6) #1 happens again, then #2 happens again and the folio is in the
>>    right state for use. The successor #2 fully undoes the work of the
>>    predecessor #2.
>>
>> If you have races where #1 can happen immediately after #3 then the
>> driver design is fundamentally broken and passing around order isn't
>> going to help anything.
>>
>
> The above race does not exist; if it did, I agree we’d be solving
> nothing here.
>
>> If the allocator is using the struct page memory then step #5 should
>> also clean up the struct page with the allocator data before returning
>> it to the allocator.
>>
>
> We could move the call to free_zone_device_folio_prepare() [1] into the
> driver-side implementation of ->folio_free() and drop the order argument
> here. Zi didn’t particularly like that; he preferred calling
> free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> which is why this patch exists.

On a second thought, if calling free_zone_device_folio_prepare() in
->folio_free() works, feel free to do so.

>
> FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> actually need the order regardless of where
> free_zone_device_folio_prepare() is called, but Nouveau does need the
> order if free_zone_device_folio_prepare() is called before
> ->folio_free().
>
> [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
>
>> I vaugely remember talking about this before in the context of the Xe
>> driver.. You can't just take an existing VRAM allocator and layer it
>> on top of the folios and have it broadly ignore the folio_free
>> callback.
>>
>
> We are definitely not ignoring the ->folio_free callback—that is the
> point at which we tell our VRAM allocator (DRM buddy) it is okay to
> release the allocation and make it available for reuse.
>
> Matt
>
>> Jsaon


Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> > We could move the call to free_zone_device_folio_prepare() [1] into the
> > driver-side implementation of ->folio_free() and drop the order argument
> > here. Zi didn’t particularly like that; he preferred calling
> > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > which is why this patch exists.
> 
> On a second thought, if calling free_zone_device_folio_prepare() in
> ->folio_free() works, feel free to do so.

I don't think there is anything "prepare" about
free_zone_device_folio_prepare() it effectively zeros the struct page
memory - ie undoes some amount of zone_device_page_init() and AFAIK
there are only two reasons to do this:

 1) It helps catch bugs where things are UAF'ing the folio, now they
    read back zeros (it also creates bugs where zero might be OK, so
    you might be better to poison it under a debug flag)

 2) It avoids the allocate side having to zero the page memory - and
    perhaps the allocate side is not doing a good job of this right now
    but I think you should state a position why it makes more sense for
    the free side to do this instead of the allocate side.

    IOW why should it be mandatory to call
    free_zone_device_folio_prepare() prior to zone_device_page_init()
    ?

Certainly if the only reason you are passing the order is because the
core code zero'd the order too early, that doesn't make alot of sense.

I think calling the deinit function paired with
zone_device_page_init() within the driver does make alot of sense and
I see no issue with that. But please name it more sensibly and
describe concretely why it should be split up like this.

Because what I see is you write to all the folios on free and then
write to them all again on allocation - which is 2x the cost that is
probably really needed...

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Matthew Brost 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> 
> > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> >
> > Hi, catching up here.
> >
> >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> >>>
> >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>>>>> The core MM splits the folio before calling folio_free, restoring the
> >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> >>>>>> folio’s order prior to the split which can be used driver side to know
> >>>>>> how many pages are being freed.
> >>>>>
> >>>>> This really feels like the wrong way to fix this problem.
> >>>>>
> >>>
> >>> Hi Matthew,
> >>>
> >>> I think the wording is confusing, since the actual issue is that:
> >>>
> >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> >>> 2. but free_zone_device_folio() never reverse the course,
> >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> >>>    be done before driver callback ->folio_free(), since once ->folio_free()
> >>>    is called, the folio can be reallocated immediately,
> >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> >>>    the original order information, thus, folio_free() needs that for proper
> >>>    device side ref manipulation.
> >>
> >> There is something wrong with the driver if the "folio can be
> >> reallocated immediately".
> >>
> >> The flow generally expects there to be a driver allocator linked to
> >> folio_free()
> >>
> >> 1) Allocator finds free memory
> >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >> 3) __folio_put() knows the recount 0.
> >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>    actually need to undo prep_compound_page() because *NOTHING* can
> >>    use the page pointer at this point.
> >
> > Correct—nothing can use the folio prior to calling folio_free(). Once
> > folio_free() returns, the driver side is free to immediately reallocate
> > the folio (or a subset of its pages).
> >
> >> 5) Driver puts the memory back into the allocator and now #1 can
> >>    happen. It knows how much memory to put back because folio->order
> >>    is valid from #2
> >> 6) #1 happens again, then #2 happens again and the folio is in the
> >>    right state for use. The successor #2 fully undoes the work of the
> >>    predecessor #2.
> >>
> >> If you have races where #1 can happen immediately after #3 then the
> >> driver design is fundamentally broken and passing around order isn't
> >> going to help anything.
> >>
> >
> > The above race does not exist; if it did, I agree we’d be solving
> > nothing here.
> >
> >> If the allocator is using the struct page memory then step #5 should
> >> also clean up the struct page with the allocator data before returning
> >> it to the allocator.
> >>
> >
> > We could move the call to free_zone_device_folio_prepare() [1] into the
> > driver-side implementation of ->folio_free() and drop the order argument
> > here. Zi didn’t particularly like that; he preferred calling
> > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > which is why this patch exists.
> 
> On a second thought, if calling free_zone_device_folio_prepare() in
> ->folio_free() works, feel free to do so.
> 

+1, testing this change right now and it does indeed work.

Matt

> >
> > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > actually need the order regardless of where
> > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > order if free_zone_device_folio_prepare() is called before
> > ->folio_free().
> >
> > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> >
> >> I vaugely remember talking about this before in the context of the Xe
> >> driver.. You can't just take an existing VRAM allocator and layer it
> >> on top of the folios and have it broadly ignore the folio_free
> >> callback.
> >>
> >
> > We are definitely not ignoring the ->folio_free callback—that is the
> > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > release the allocation and make it available for reuse.
> >
> > Matt
> >
> >> Jsaon
> 
> 
> Best Regards,
> Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-13 at 10:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> > On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> > 
> > > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> > >
> > > Hi, catching up here.
> > >
> > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > >>>
> > >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> > >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>>>>> The core MM splits the folio before calling folio_free, restoring the
> > >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> > >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>>>>> folio’s order prior to the split which can be used driver side to know
> > >>>>>> how many pages are being freed.
> > >>>>>
> > >>>>> This really feels like the wrong way to fix this problem.
> > >>>>>
> > >>>
> > >>> Hi Matthew,
> > >>>
> > >>> I think the wording is confusing, since the actual issue is that:
> > >>>
> > >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > >>> 2. but free_zone_device_folio() never reverse the course,
> > >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> > >>>    be done before driver callback ->folio_free(), since once ->folio_free()
> > >>>    is called, the folio can be reallocated immediately,
> > >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> > >>>    the original order information, thus, folio_free() needs that for proper
> > >>>    device side ref manipulation.
> > >>
> > >> There is something wrong with the driver if the "folio can be
> > >> reallocated immediately".
> > >>
> > >> The flow generally expects there to be a driver allocator linked to
> > >> folio_free()
> > >>
> > >> 1) Allocator finds free memory
> > >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> > >> 3) __folio_put() knows the recount 0.
> > >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >>    actually need to undo prep_compound_page() because *NOTHING* can
> > >>    use the page pointer at this point.
> > >
> > > Correct—nothing can use the folio prior to calling folio_free(). Once
> > > folio_free() returns, the driver side is free to immediately reallocate
> > > the folio (or a subset of its pages).
> > >
> > >> 5) Driver puts the memory back into the allocator and now #1 can
> > >>    happen. It knows how much memory to put back because folio->order
> > >>    is valid from #2
> > >> 6) #1 happens again, then #2 happens again and the folio is in the
> > >>    right state for use. The successor #2 fully undoes the work of the
> > >>    predecessor #2.
> > >>
> > >> If you have races where #1 can happen immediately after #3 then the
> > >> driver design is fundamentally broken and passing around order isn't
> > >> going to help anything.
> > >>
> > >
> > > The above race does not exist; if it did, I agree we’d be solving
> > > nothing here.
> > >
> > >> If the allocator is using the struct page memory then step #5 should
> > >> also clean up the struct page with the allocator data before returning
> > >> it to the allocator.
> > >>
> > >
> > > We could move the call to free_zone_device_folio_prepare() [1] into the
> > > driver-side implementation of ->folio_free() and drop the order argument
> > > here. Zi didn’t particularly like that; he preferred calling
> > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > > which is why this patch exists.
> > 
> > On a second thought, if calling free_zone_device_folio_prepare() in
> > ->folio_free() works, feel free to do so.

I think making drivers do this is the correct approach and is consistent with
what P2PDMA and DAX does. All the interfaces for mapping a ZONE_DEVICE folio
currently rely on the driver correctly initialising the folio, so this special
case for ZONE_DEVICE_PRIVATE/COHERENT seemed weird to me - they shouldn't rely
on the core-mm to do some of the re-initialisation in the free paths.

Also drivers may have different strategies than just resetting everything back
to small pages. For example the may choose to only ever allocate large folios
making the whole clearing/resetting of folio fields superfluous.

 - Alistair

> +1, testing this change right now and it does indeed work.
> 
> Matt
> 
> > >
> > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > > actually need the order regardless of where
> > > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > > order if free_zone_device_folio_prepare() is called before
> > > ->folio_free().
> > >
> > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> > >
> > >> I vaugely remember talking about this before in the context of the Xe
> > >> driver.. You can't just take an existing VRAM allocator and layer it
> > >> on top of the folios and have it broadly ignore the folio_free
> > >> callback.
> > >>
> > >
> > > We are definitely not ignoring the ->folio_free callback—that is the
> > > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > > release the allocation and make it available for reuse.
> > >
> > > Matt
> > >
> > >> Jsaon
> > 
> > 
> > Best Regards,
> > Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 10:44:27AM +1100, Alistair Popple wrote:

> Also drivers may have different strategies than just resetting everything back
> to small pages. For example the may choose to only ever allocate large folios
> making the whole clearing/resetting of folio fields superfluous.

+1

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 8:45, Jason Gunthorpe wrote:

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>
>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>> folio’s order prior to the split which can be used driver side to know
>>>>> how many pages are being freed.
>>>>
>>>> This really feels like the wrong way to fix this problem.
>>>>
>>
>> Hi Matthew,
>>
>> I think the wording is confusing, since the actual issue is that:
>>
>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>> 2. but free_zone_device_folio() never reverse the course,
>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>    is called, the folio can be reallocated immediately,
>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>>    the original order information, thus, folio_free() needs that for proper
>>    device side ref manipulation.
>
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
>
> The flow generally expects there to be a driver allocator linked to
> folio_free()
>
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.
> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.

But how can a successor #2 undo the work if the second #1 only allocates
half of the original folio? For example, an order-9 at PFN 0 is
allocated and freed, then an order-8 at PFN 0 is allocated and another
order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
without corrupting each other’s data?


>
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.

Do you mean ->folio_free() callback should undo prep_compound_page()
instead?

>
> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.


Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > folio_free()
> >
> > 1) Allocator finds free memory
> > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > 3) __folio_put() knows the recount 0.
> > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >    actually need to undo prep_compound_page() because *NOTHING* can
> >    use the page pointer at this point.
> > 5) Driver puts the memory back into the allocator and now #1 can
> >    happen. It knows how much memory to put back because folio->order
> >    is valid from #2
> > 6) #1 happens again, then #2 happens again and the folio is in the
> >    right state for use. The successor #2 fully undoes the work of the
> >    predecessor #2.
> 
> But how can a successor #2 undo the work if the second #1 only allocates
> half of the original folio? For example, an order-9 at PFN 0 is
> allocated and freed, then an order-8 at PFN 0 is allocated and another
> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> without corrupting each other’s data?

What do you mean? The fundamental rule is you can't read the folio or
the order outside folio_free once it's refcount reaches 0.

So the successor #2 will write updated heads and order to the order 8
pages at PFN 0 and the ones starting at PFN 256 will remain with
garbage.

This is OK because nothing is allowed to read them as their refcount
is 0.

If later PFN256 is allocated then it will get updated head and order
at the same time it's refcount becomes 1.

There is corruption and they don't corrupt each other's data.

> > If the allocator is using the struct page memory then step #5 should
> > also clean up the struct page with the allocator data before returning
> > it to the allocator.
> 
> Do you mean ->folio_free() callback should undo prep_compound_page()
> instead?

I wouldn't say undo, I was very careful to say it needs to get the
struct page memory into a state that the allocator algorithm expects,
whatever that means.

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Matthew Brost 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 12:50:01PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> > > folio_free()
> > >
> > > 1) Allocator finds free memory
> > > 2) zone_device_page_init() allocates the memory and makes refcount=1
> > > 3) __folio_put() knows the recount 0.
> > > 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >    actually need to undo prep_compound_page() because *NOTHING* can
> > >    use the page pointer at this point.
> > > 5) Driver puts the memory back into the allocator and now #1 can
> > >    happen. It knows how much memory to put back because folio->order
> > >    is valid from #2
> > > 6) #1 happens again, then #2 happens again and the folio is in the
> > >    right state for use. The successor #2 fully undoes the work of the
> > >    predecessor #2.
> > 
> > But how can a successor #2 undo the work if the second #1 only allocates
> > half of the original folio? For example, an order-9 at PFN 0 is
> > allocated and freed, then an order-8 at PFN 0 is allocated and another
> > order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> > without corrupting each other’s data?
> 
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.
> 
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
> 
> This is OK because nothing is allowed to read them as their refcount
> is 0.
> 
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
> 
> There is corruption and they don't corrupt each other's data.
> 
> > > If the allocator is using the struct page memory then step #5 should
> > > also clean up the struct page with the allocator data before returning
> > > it to the allocator.
> > 
> > Do you mean ->folio_free() callback should undo prep_compound_page()
> > instead?
> 
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
> 

Hi Jason,

A lot of back and forth with Zi — if I’m understanding correctly, your
suggestion is to just call free_zone_device_folio_prepare() [1] in
->folio_free() if required by the driver. This is the function that puts
struct page into a state my allocator expects. That works just fine for
me.

Matt

[1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4

> Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>> folio_free()
>>>
>>> 1) Allocator finds free memory
>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>> 3) __folio_put() knows the recount 0.
>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>    use the page pointer at this point.
>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>    happen. It knows how much memory to put back because folio->order
>>>    is valid from #2
>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>    right state for use. The successor #2 fully undoes the work of the
>>>    predecessor #2.
>>
>> But how can a successor #2 undo the work if the second #1 only allocates
>> half of the original folio? For example, an order-9 at PFN 0 is
>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>> without corrupting each other’s data?
>
> What do you mean? The fundamental rule is you can't read the folio or
> the order outside folio_free once it's refcount reaches 0.

There is no such a rule. In core MM, folio_split(), which splits a high
order folio to low order ones, freezes the folio (turning refcount to 0)
and manipulates the folio order and all tail pages compound_head to
restructure the folio. Your fundamental rule breaks this.
Allowing compound information to stay after a folio is freed means
you cannot tell whether a folio is under split or freed.

>
> So the successor #2 will write updated heads and order to the order 8
> pages at PFN 0 and the ones starting at PFN 256 will remain with
> garbage.
>
> This is OK because nothing is allowed to read them as their refcount
> is 0.
>
> If later PFN256 is allocated then it will get updated head and order
> at the same time it's refcount becomes 1.
>
> There is corruption and they don't corrupt each other's data.
>
>>> If the allocator is using the struct page memory then step #5 should
>>> also clean up the struct page with the allocator data before returning
>>> it to the allocator.
>>
>> Do you mean ->folio_free() callback should undo prep_compound_page()
>> instead?
>
> I wouldn't say undo, I was very careful to say it needs to get the
> struct page memory into a state that the allocator algorithm expects,
> whatever that means.
>
> Jason


Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
> 
> > On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
> >>> folio_free()
> >>>
> >>> 1) Allocator finds free memory
> >>> 2) zone_device_page_init() allocates the memory and makes refcount=1
> >>> 3) __folio_put() knows the recount 0.
> >>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> >>>    actually need to undo prep_compound_page() because *NOTHING* can
> >>>    use the page pointer at this point.
> >>> 5) Driver puts the memory back into the allocator and now #1 can
> >>>    happen. It knows how much memory to put back because folio->order
> >>>    is valid from #2
> >>> 6) #1 happens again, then #2 happens again and the folio is in the
> >>>    right state for use. The successor #2 fully undoes the work of the
> >>>    predecessor #2.
> >>
> >> But how can a successor #2 undo the work if the second #1 only allocates
> >> half of the original folio? For example, an order-9 at PFN 0 is
> >> allocated and freed, then an order-8 at PFN 0 is allocated and another
> >> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
> >> without corrupting each other’s data?
> >
> > What do you mean? The fundamental rule is you can't read the folio or
> > the order outside folio_free once it's refcount reaches 0.
> 
> There is no such a rule. In core MM, folio_split(), which splits a high
> order folio to low order ones, freezes the folio (turning refcount to 0)
> and manipulates the folio order and all tail pages compound_head to
> restructure the folio.

That's different, I am talking about reaching 0 because it has been
freed, meaning there are no external pointers to it.

Further, when a page is frozen page_ref_freeze() takes in the number
of references the caller has ownership over and it doesn't succeed if
there are stray references elsewhere.

This is very important because the entire operating model of split
only works if it has exclusive locks over all the valid pointers into
that page.

Spurious refcount failures concurrent with split cannot be allowed.

I don't see how pointing at __folio_freeze_and_split_unmapped() can
justify this series.

> Your fundamental rule breaks this.  Allowing compound information
> to stay after a folio is freed means you cannot tell whether a folio
> is under split or freed.

You can't refcount a folio out of nothing. It has to come from a
memory location that already is holding a refcount, and then you can
incr it.

For example lockless GUP fast will read the PTE, adjust to the head
page, attempt to incr it, then recheck the PTE.

If there are races then sure maybe the PTE will point to a stray tail
page that refers to an already allocated head page, but the re-check
of the PTE wille exclude this. The refcount system already has to
tolerate spurious refcount incrs because of GUP fast.

Nothing should be looking at order and refcount to try to guess if
concurrent split is happening!!

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 13:25, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 12:46:57PM -0500, Zi Yan wrote:
>> On 12 Jan 2026, at 11:50, Jason Gunthorpe wrote:
>>
>>> On Mon, Jan 12, 2026 at 11:31:04AM -0500, Zi Yan wrote:
>>>>> folio_free()
>>>>>
>>>>> 1) Allocator finds free memory
>>>>> 2) zone_device_page_init() allocates the memory and makes refcount=1
>>>>> 3) __folio_put() knows the recount 0.
>>>>> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>>>>>    actually need to undo prep_compound_page() because *NOTHING* can
>>>>>    use the page pointer at this point.
>>>>> 5) Driver puts the memory back into the allocator and now #1 can
>>>>>    happen. It knows how much memory to put back because folio->order
>>>>>    is valid from #2
>>>>> 6) #1 happens again, then #2 happens again and the folio is in the
>>>>>    right state for use. The successor #2 fully undoes the work of the
>>>>>    predecessor #2.
>>>>
>>>> But how can a successor #2 undo the work if the second #1 only allocates
>>>> half of the original folio? For example, an order-9 at PFN 0 is
>>>> allocated and freed, then an order-8 at PFN 0 is allocated and another
>>>> order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
>>>> without corrupting each other’s data?
>>>
>>> What do you mean? The fundamental rule is you can't read the folio or
>>> the order outside folio_free once it's refcount reaches 0.
>>
>> There is no such a rule. In core MM, folio_split(), which splits a high
>> order folio to low order ones, freezes the folio (turning refcount to 0)
>> and manipulates the folio order and all tail pages compound_head to
>> restructure the folio.
>
> That's different, I am talking about reaching 0 because it has been
> freed, meaning there are no external pointers to it.
>
> Further, when a page is frozen page_ref_freeze() takes in the number
> of references the caller has ownership over and it doesn't succeed if
> there are stray references elsewhere.
>
> This is very important because the entire operating model of split
> only works if it has exclusive locks over all the valid pointers into
> that page.
>
> Spurious refcount failures concurrent with split cannot be allowed.
>
> I don't see how pointing at __folio_freeze_and_split_unmapped() can
> justify this series.
>

But from anyone looking at the folio state, refcount == 0, compound_head
is set, they cannot tell the difference.

If what you said is true, why is free_pages_prepare() needed? No one
should touch these free pages. Why bother resetting these states.

>> Your fundamental rule breaks this.  Allowing compound information
>> to stay after a folio is freed means you cannot tell whether a folio
>> is under split or freed.
>
> You can't refcount a folio out of nothing. It has to come from a
> memory location that already is holding a refcount, and then you can
> incr it.

Right. There is also no guarantee that all code is correct and follows
this.

My point here is that calling prep_compound_page() on a compound page
does not follow core MM’s conventions.

Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote:
> > That's different, I am talking about reaching 0 because it has been
> > freed, meaning there are no external pointers to it.
> >
> > Further, when a page is frozen page_ref_freeze() takes in the number
> > of references the caller has ownership over and it doesn't succeed if
> > there are stray references elsewhere.
> >
> > This is very important because the entire operating model of split
> > only works if it has exclusive locks over all the valid pointers into
> > that page.
> >
> > Spurious refcount failures concurrent with split cannot be allowed.
> >
> > I don't see how pointing at __folio_freeze_and_split_unmapped() can
> > justify this series.
> >
> 
> But from anyone looking at the folio state, refcount == 0, compound_head
> is set, they cannot tell the difference.

This isn't reliable, nothing correct can be doing it :\

> If what you said is true, why is free_pages_prepare() needed? No one
> should touch these free pages. Why bother resetting these states.

? that function does alot of stuff, thinks like uncharging the cgroup
should obviously happen at free time.

What part of it are you looking at?

> > You can't refcount a folio out of nothing. It has to come from a
> > memory location that already is holding a refcount, and then you can
> > incr it.
> 
> Right. There is also no guarantee that all code is correct and follows
> this.

Let's concretely point at things that have a problem please.

> My point here is that calling prep_compound_page() on a compound page
> does not follow core MM’s conventions.

Maybe, but that doesn't mean it isn't the right solution..

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 14:28, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 01:55:18PM -0500, Zi Yan wrote:
>>> That's different, I am talking about reaching 0 because it has been
>>> freed, meaning there are no external pointers to it.
>>>
>>> Further, when a page is frozen page_ref_freeze() takes in the number
>>> of references the caller has ownership over and it doesn't succeed if
>>> there are stray references elsewhere.
>>>
>>> This is very important because the entire operating model of split
>>> only works if it has exclusive locks over all the valid pointers into
>>> that page.
>>>
>>> Spurious refcount failures concurrent with split cannot be allowed.
>>>
>>> I don't see how pointing at __folio_freeze_and_split_unmapped() can
>>> justify this series.
>>>
>>
>> But from anyone looking at the folio state, refcount == 0, compound_head
>> is set, they cannot tell the difference.
>
> This isn't reliable, nothing correct can be doing it :\
>
>> If what you said is true, why is free_pages_prepare() needed? No one
>> should touch these free pages. Why bother resetting these states.
>
> ? that function does alot of stuff, thinks like uncharging the cgroup
> should obviously happen at free time.
>
> What part of it are you looking at?

page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.

free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
compound_head at the end.

page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
pages.

These three parts undo prep_compound_page().

>
>>> You can't refcount a folio out of nothing. It has to come from a
>>> memory location that already is holding a refcount, and then you can
>>> incr it.
>>
>> Right. There is also no guarantee that all code is correct and follows
>> this.
>
> Let's concretely point at things that have a problem please.
>
>> My point here is that calling prep_compound_page() on a compound page
>> does not follow core MM’s conventions.
>
> Maybe, but that doesn't mean it isn't the right solution..

In current nouveau code, ->free_folios is used holding the freed folio.
In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
zone_device_folio_init(). If the allocated folio order is different
from the freed folio order, I do not know how you are going to keep
track of the rest of the freed folio. Of course you can implement a
buddy allocator there.

If this still does not convince you that overwriting an existing compound
page with a different order configuration is a bad idea, feel free to
do whatever you think it is right.

Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Jason Gunthorpe 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote:
> page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.
> 
> free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
> compound_head at the end.
> 
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
> pages.
> 
> These three parts undo prep_compound_page().

Well, mm doesn't clear all things on alloc..

> In current nouveau code, ->free_folios is used holding the freed folio.
> In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
> zone_device_folio_init(). If the allocated folio order is different
> from the freed folio order, I do not know how you are going to keep
> track of the rest of the freed folio. Of course you can implement a
> buddy allocator there.

nouveau doesn't support high order folios.

A simple linked list is not really a suitable data structure to ever
support high order folios with.. If it were to use such a thing, and
did want to take a high order folio off the list, and reduce its
order, then it would have to put the remainder back on the list with a
revised order value. That's all, nothing hard.

Again if the driver needs to store information in the struct page to
manage its free list mechanism (ie linked pointers, order, whatever)
then it should be doing that directly.

When it takes the memory range off the free list it should call
zone_device_page_init() to make it ready to be used again. I think it
is a poor argument to say that zone_device_page_init() should rely on
values already in the struct page to work properly :\

The usable space within the struct page, and what values must be fixed
for correct system function, should exactly mirror what frozen pages
require. After free it is effectively now a frozen page owned by the
device driver.

I haven't seen any documentation on that, but I suspect Matthew and
David have some ideas..

If there is a reason for order, flags and mapping to be something
particular then it should flow from the definition of frozen pages,
and be documented, IMHO.

Jason

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Zi Yan 3 weeks, 6 days ago

On 12 Jan 2026, at 18:53, Jason Gunthorpe wrote:

> On Mon, Jan 12, 2026 at 06:34:06PM -0500, Zi Yan wrote:
>> page[1].flags.f &= ~PAGE_FLAGS_SECOND. It clears folio->order.
>>
>> free_tail_page_prepare() clears ->mapping, which is TAIL_MAPPING, and
>> compound_head at the end.
>>
>> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP. It clears PG_head for compound
>> pages.
>>
>> These three parts undo prep_compound_page().
>
> Well, mm doesn't clear all things on alloc..
>
>> In current nouveau code, ->free_folios is used holding the freed folio.
>> In nouveau_dmem_page_alloc_locked(), the freed folio is passed to
>> zone_device_folio_init(). If the allocated folio order is different
>> from the freed folio order, I do not know how you are going to keep
>> track of the rest of the freed folio. Of course you can implement a
>> buddy allocator there.
>
> nouveau doesn't support high order folios.
>
> A simple linked list is not really a suitable data structure to ever
> support high order folios with.. If it were to use such a thing, and
> did want to take a high order folio off the list, and reduce its
> order, then it would have to put the remainder back on the list with a
> revised order value. That's all, nothing hard.
>
> Again if the driver needs to store information in the struct page to
> manage its free list mechanism (ie linked pointers, order, whatever)
> then it should be doing that directly.
>
> When it takes the memory range off the free list it should call
> zone_device_page_init() to make it ready to be used again. I think it
> is a poor argument to say that zone_device_page_init() should rely on
> values already in the struct page to work properly :\
>
> The usable space within the struct page, and what values must be fixed
> for correct system function, should exactly mirror what frozen pages
> require. After free it is effectively now a frozen page owned by the
> device driver.
>
> I haven't seen any documentation on that, but I suspect Matthew and
> David have some ideas..
>
> If there is a reason for order, flags and mapping to be something
> particular then it should flow from the definition of frozen pages,
> and be documented, IMHO.

Thank you for the explanation.

It seems that I do not have enough knowledge to comment on device private
pages. I will refrain myself from doing so from now on

Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Balbir Singh 4 weeks ago

On 1/12/26 10:51, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>> The core MM splits the folio before calling folio_free, restoring the
>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>> folio’s order prior to the split which can be used driver side to know
>>>> how many pages are being freed.
>>>
>>> This really feels like the wrong way to fix this problem.
>>>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 
>>
>> This stems from a special requirement, freeing is done in two phases
>>
>> 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
>> 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, the word split came from the initial version that called it folio_split_unref()
and I was also thinking of the split callback for zone device folios, but I agree
(re)initialization is a better term.

>>
>> The current code does not do 2. 1 followed by 2 does not work for
>> Francois since the backing memory can get reused before we reach step 2.
>> The proposed patch does 2 followed 1, but doing 2 means we've lost the
>> folio order and thus the old order is passed in. Although, I wonder if the
>> backing folio's zone_device_data can be used to encode any order information
>> about the device side allocation.
>>
>> @Francois, I hope I did not miss anything in the explanation above.
>>
>>> I think someone from the graphics side really needs to take the lead on
>>> understanding what the MM is doing (both currently and in the future).
>>> I'm happy to work with you, but it feels like there's a lot of churn right
>>> now because there's a lot of people working on this without understanding
>>> the MM side of things (and conversely, I don't think (m)any people on the
>>> MM side really understand what graphics cards are trying to accomplish).
>>>
>>
>> I suspect you are referring to folio specialization and/or downsizing?
>>
>>> Who is that going to be?  I'm happy to get on the phone with someone.
>>
>> Happy to work with you, but I am not the authority on graphics, I can speak
>> to zone device folios. I suspect we'd need to speak to more than one person.
>>
> 
> --
> Best Regards,
> Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Posted by Matthew Brost 4 weeks ago

On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> 
> > On 1/12/26 08:35, Matthew Wilcox wrote:
> >> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> >>> The core MM splits the folio before calling folio_free, restoring the
> >>> zone pages associated with the folio to an initialized state (e.g.,
> >>> non-compound, pgmap valid, etc...). The order argument represents the
> >>> folio’s order prior to the split which can be used driver side to know
> >>> how many pages are being freed.
> >>
> >> This really feels like the wrong way to fix this problem.
> >>
> 
> Hi Matthew,
> 
> I think the wording is confusing, since the actual issue is that:
> 
> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> 2. but free_zone_device_folio() never reverse the course,
> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>    be done before driver callback ->folio_free(), since once ->folio_free()
>    is called, the folio can be reallocated immediately,
> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
>    the original order information, thus, folio_free() needs that for proper
>    device side ref manipulation.
> 
> So this is not used for "split" but undo of prep_compound_page(). It might
> look like a split to non core MM people, since it changes a large folio
> to a bunch of base pages. BTW, core MM has no compound_page_dctor() but
> open codes it in free_pages_prepare() by resetting page flags, page->mapping,
> and so on. So it might be why the undo prep_compound_page() is missed
> by non core MM people.
> 

Let me try to reword this while avoiding the term “split” and properly
explaining the problem.

> >
> > This stems from a special requirement, freeing is done in two phases
> >
> > 1. Free the folio -> inform the driver (which implies freeing the backing device memory)
> > 2. Return the folio back, split it back to single order folios
> 
> Hi Balbir,
> 
> Please refrain from using "split" here, since it confuses MM people. A folio
> is split when it is still in use, but in this case, the folio has been freed
> and needs to be restored to "free page" state.
> 

Yeah, “split” is a bad term. We are reinitializing all zone pages in a
folio upon free.

> >
> > The current code does not do 2. 1 followed by 2 does not work for
> > Francois since the backing memory can get reused before we reach step 2.
> > The proposed patch does 2 followed 1, but doing 2 means we've lost the
> > folio order and thus the old order is passed in. Although, I wonder if the
> > backing folio's zone_device_data can be used to encode any order information
> > about the device side allocation.
> >
> > @Francois, I hope I did not miss anything in the explanation above.

Yes, correct. The pages in the folio must be reinitialized before
calling into the driver to free them, because once that happens, the
pages can be immediately reallocated.

> >
> >> I think someone from the graphics side really needs to take the lead on
> >> understanding what the MM is doing (both currently and in the future).
> >> I'm happy to work with you, but it feels like there's a lot of churn right
> >> now because there's a lot of people working on this without understanding
> >> the MM side of things (and conversely, I don't think (m)any people on the
> >> MM side really understand what graphics cards are trying to accomplish).

I can’t disagree with anything you’re saying. The core MM is about as
complex as it gets, and my understanding of what’s going on isn’t
great—it’s basically just reverse engineering until I reach a point
where I can fix a problem, think it’s correct, and hope I don’t get
shredded.

Graphics/DRM is also quite complex, but that’s where I work...
> >>
> >
> > I suspect you are referring to folio specialization and/or downsizing?
> >
> >> Who is that going to be?  I'm happy to get on the phone with someone.
> >
> > Happy to work with you, but I am not the authority on graphics, I can speak
> > to zone device folios. I suspect we'd need to speak to more than one person.
> >

Also happy to work with you, but I agree with Zi—graphics isn’t
something one company can speak as an authority on, much less one
person.

Matt

> 
> --
> Best Regards,
> Yan, Zi