[v4] Enable THP support in drm_pagemap

[PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Francois Dugast 4 weeks ago

From: Matthew Brost <matthew.brost@intel.com>

Add free_zone_device_folio_prepare(), a helper that restores large
ZONE_DEVICE folios to a sane, initial state before freeing them.

Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
compound metadata). Before returning such pages to the device pgmap
allocator, each constituent page must be reset to a standalone
ZONE_DEVICE folio with a valid pgmap and no compound state.

Use this helper prior to folio_free() for device-private and
device-coherent folios to ensure consistent device page state for
subsequent allocations.

Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 include/linux/memremap.h |  1 +
 mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 97fcffeb1c1e..88e1d4707296 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
 
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page, unsigned int order);
+void free_zone_device_folio_prepare(struct folio *folio);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
diff --git a/mm/memremap.c b/mm/memremap.c
index 39dc4bd190d0..375a61e18858 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
+/**
+ * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
+ * @folio: ZONE_DEVICE folio to prepare for release.
+ *
+ * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
+ * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
+ * must be restored to a sane ZONE_DEVICE state before they are released.
+ *
+ * This helper:
+ *   - Clears @folio->mapping and, for compound folios, clears each page's
+ *     compound-head state (ClearPageHead()/clear_compound_head()).
+ *   - Resets the compound order metadata (folio_reset_order()) and then
+ *     initializes each constituent page as a standalone ZONE_DEVICE folio:
+ *       * clears ->mapping
+ *       * restores ->pgmap (prep_compound_page() overwrites it)
+ *       * clears ->share (only relevant for fsdax; unused for device-private)
+ *
+ * If @folio is order-0, only the mapping is cleared and no further work is
+ * required.
+ */
+void free_zone_device_folio_prepare(struct folio *folio)
+{
+	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
+	int order, i;
+
+	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
+
+	folio->mapping = NULL;
+	order = folio_order(folio);
+	if (!order)
+		return;
+
+	folio_reset_order(folio);
+
+	for (i = 0; i < (1UL << order); i++) {
+		struct page *page = folio_page(folio, i);
+		struct folio *new_folio = (struct folio *)page;
+
+		ClearPageHead(page);
+		clear_compound_head(page);
+
+		new_folio->mapping = NULL;
+		/*
+		 * Reset pgmap which was over-written by
+		 * prep_compound_page().
+		 */
+		new_folio->pgmap = pgmap;
+		new_folio->share = 0;	/* fsdax only, unused for device private */
+		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
+		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
+	}
+}
+EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
+
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
@@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
 			break;
+		free_zone_device_folio_prepare(folio);
 		pgmap->ops->folio_free(folio, order);
 		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
-- 
2.43.0

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Balbir Singh 4 weeks ago

On 1/12/26 06:55, Francois Dugast wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Add free_zone_device_folio_prepare(), a helper that restores large
> ZONE_DEVICE folios to a sane, initial state before freeing them.
> 
> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> compound metadata). Before returning such pages to the device pgmap
> allocator, each constituent page must be reset to a standalone
> ZONE_DEVICE folio with a valid pgmap and no compound state.
> 
> Use this helper prior to folio_free() for device-private and
> device-coherent folios to ensure consistent device page state for
> subsequent allocations.
> 
> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: linux-mm@kvack.org
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  include/linux/memremap.h |  1 +
>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 97fcffeb1c1e..88e1d4707296 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  void zone_device_page_init(struct page *page, unsigned int order);
> +void free_zone_device_folio_prepare(struct folio *folio);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 39dc4bd190d0..375a61e18858 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>  }
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
> +/**
> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> + * @folio: ZONE_DEVICE folio to prepare for release.
> + *
> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> + * must be restored to a sane ZONE_DEVICE state before they are released.
> + *
> + * This helper:
> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> + *   - Resets the compound order metadata (folio_reset_order()) and then
> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> + *       * clears ->mapping
> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> + *
> + * If @folio is order-0, only the mapping is cleared and no further work is
> + * required.
> + */
> +void free_zone_device_folio_prepare(struct folio *folio)
> +{
> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> +	int order, i;
> +
> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> +
> +	folio->mapping = NULL;
> +	order = folio_order(folio);
> +	if (!order)
> +		return;
> +
> +	folio_reset_order(folio);
> +
> +	for (i = 0; i < (1UL << order); i++) {
> +		struct page *page = folio_page(folio, i);
> +		struct folio *new_folio = (struct folio *)page;
> +
> +		ClearPageHead(page);
> +		clear_compound_head(page);
> +
> +		new_folio->mapping = NULL;
> +		/*
> +		 * Reset pgmap which was over-written by
> +		 * prep_compound_page().
> +		 */
> +		new_folio->pgmap = pgmap;
> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);

Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
that PMD_ORDER more frees than we'd like?

> +	}
> +}
> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> +
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>  			break;
> +		free_zone_device_folio_prepare(folio);
>  		pgmap->ops->folio_free(folio, order);
>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>  		break;

Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 4 weeks ago

On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> On 1/12/26 06:55, Francois Dugast wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > Add free_zone_device_folio_prepare(), a helper that restores large
> > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > 
> > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > compound metadata). Before returning such pages to the device pgmap
> > allocator, each constituent page must be reset to a standalone
> > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > 
> > Use this helper prior to folio_free() for device-private and
> > device-coherent folios to ensure consistent device page state for
> > subsequent allocations.
> > 
> > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: David Hildenbrand <david@kernel.org>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-cxl@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  include/linux/memremap.h |  1 +
> >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> > 
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 97fcffeb1c1e..88e1d4707296 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >  
> >  #ifdef CONFIG_ZONE_DEVICE
> >  void zone_device_page_init(struct page *page, unsigned int order);
> > +void free_zone_device_folio_prepare(struct folio *folio);
> >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >  void memunmap_pages(struct dev_pagemap *pgmap);
> >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 39dc4bd190d0..375a61e18858 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >  }
> >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >  
> > +/**
> > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > + * @folio: ZONE_DEVICE folio to prepare for release.
> > + *
> > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > + *
> > + * This helper:
> > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > + *       * clears ->mapping
> > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > + *
> > + * If @folio is order-0, only the mapping is cleared and no further work is
> > + * required.
> > + */
> > +void free_zone_device_folio_prepare(struct folio *folio)
> > +{
> > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > +	int order, i;
> > +
> > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > +
> > +	folio->mapping = NULL;
> > +	order = folio_order(folio);
> > +	if (!order)
> > +		return;
> > +
> > +	folio_reset_order(folio);
> > +
> > +	for (i = 0; i < (1UL << order); i++) {
> > +		struct page *page = folio_page(folio, i);
> > +		struct folio *new_folio = (struct folio *)page;
> > +
> > +		ClearPageHead(page);
> > +		clear_compound_head(page);
> > +
> > +		new_folio->mapping = NULL;
> > +		/*
> > +		 * Reset pgmap which was over-written by
> > +		 * prep_compound_page().
> > +		 */
> > +		new_folio->pgmap = pgmap;
> > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> 
> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> that PMD_ORDER more frees than we'd like?
> 

No, calling free_folio() more often doesn’t solve anything—in fact, that
would make my implementation explode. I explained this in detail here [1]
to Zi.

To recap [1], my memory allocator has no visibility into individual
pages or folios; it is DRM Buddy layered on top of TTM BO. This design
allows VRAM to be allocated or evicted for both traditional GPU
allocations (GEMs) and SVM allocations.

Now, to recap the actual issue: if device folios are not split upon free
and are later reallocated with a different order in
zone_device_page_init, the implementation breaks. This problem is not
specific to Xe—Nouveau happens to always allocate at the same order, so
it works by coincidence. Reallocating at a different order is valid
behavior and must be supported.

Matt

[1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413

> > +	}
> > +}
> > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > +
> >  void free_zone_device_folio(struct folio *folio)
> >  {
> >  	struct dev_pagemap *pgmap = folio->pgmap;
> > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >  	case MEMORY_DEVICE_COHERENT:
> >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >  			break;
> > +		free_zone_device_folio_prepare(folio);
> >  		pgmap->ops->folio_free(folio, order);
> >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >  		break;
> 
> Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > On 1/12/26 06:55, Francois Dugast wrote:
> > > From: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > 
> > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > compound metadata). Before returning such pages to the device pgmap
> > > allocator, each constituent page must be reset to a standalone
> > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > 
> > > Use this helper prior to folio_free() for device-private and
> > > device-coherent folios to ensure consistent device page state for
> > > subsequent allocations.
> > > 
> > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: David Hildenbrand <david@kernel.org>
> > > Cc: Oscar Salvador <osalvador@suse.de>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Alistair Popple <apopple@nvidia.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-cxl@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > ---
> > >  include/linux/memremap.h |  1 +
> > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 56 insertions(+)
> > > 
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 97fcffeb1c1e..88e1d4707296 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > >  
> > >  #ifdef CONFIG_ZONE_DEVICE
> > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > +void free_zone_device_folio_prepare(struct folio *folio);
> > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 39dc4bd190d0..375a61e18858 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > >  }
> > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >  
> > > +/**
> > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > + *
> > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > + *
> > > + * This helper:
> > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > + *       * clears ->mapping
> > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > + *
> > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > + * required.
> > > + */
> > > +void free_zone_device_folio_prepare(struct folio *folio)

I don't really like the naming here - we're not preparing a folio to be
freed, from the core-mm perspective the folio is already free. This is just
reinitialising the folio metadata ready for the driver to reuse it, which may
actually involve just recreating a compound folio.

So maybe zone_device_folio_reinitialise()? Or would it be possible to
roll this into a zone_device_folio_init() type function (similar to
zone_device_page_init()) that just deals with everything at allocation time?

> > > +{
> > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > +	int order, i;
> > > +
> > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > +
> > > +	folio->mapping = NULL;
> > > +	order = folio_order(folio);
> > > +	if (!order)
> > > +		return;
> > > +
> > > +	folio_reset_order(folio);
> > > +
> > > +	for (i = 0; i < (1UL << order); i++) {
> > > +		struct page *page = folio_page(folio, i);
> > > +		struct folio *new_folio = (struct folio *)page;
> > > +
> > > +		ClearPageHead(page);
> > > +		clear_compound_head(page);
> > > +
> > > +		new_folio->mapping = NULL;
> > > +		/*
> > > +		 * Reset pgmap which was over-written by
> > > +		 * prep_compound_page().
> > > +		 */
> > > +		new_folio->pgmap = pgmap;
> > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > 
> > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > that PMD_ORDER more frees than we'd like?
> > 
> 
> No, calling free_folio() more often doesn’t solve anything—in fact, that
> would make my implementation explode. I explained this in detail here [1]
> to Zi.
> 
> To recap [1], my memory allocator has no visibility into individual
> pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> allows VRAM to be allocated or evicted for both traditional GPU
> allocations (GEMs) and SVM allocations.
> 
> Now, to recap the actual issue: if device folios are not split upon free
> and are later reallocated with a different order in
> zone_device_page_init, the implementation breaks. This problem is not
> specific to Xe—Nouveau happens to always allocate at the same order, so
> it works by coincidence. Reallocating at a different order is valid
> behavior and must be supported.

I agree it's probably by coincidence but it is a perfectly valid design to
always just (re)allocate at the same order and not worry about having to
reinitialise things to different orders.

 - Alistair

> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> 
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > +
> > >  void free_zone_device_folio(struct folio *folio)
> > >  {
> > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > >  	case MEMORY_DEVICE_COHERENT:
> > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > >  			break;
> > > +		free_zone_device_folio_prepare(folio);
> > >  		pgmap->ops->folio_free(folio, order);
> > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > >  		break;
> > 
> > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > 
> > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > 
> > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > compound metadata). Before returning such pages to the device pgmap
> > > > allocator, each constituent page must be reset to a standalone
> > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > 
> > > > Use this helper prior to folio_free() for device-private and
> > > > device-coherent folios to ensure consistent device page state for
> > > > subsequent allocations.
> > > > 
> > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > Cc: David Hildenbrand <david@kernel.org>
> > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > Cc: linux-mm@kvack.org
> > > > Cc: linux-cxl@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > ---
> > > >  include/linux/memremap.h |  1 +
> > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > >  2 files changed, 56 insertions(+)
> > > > 
> > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > --- a/include/linux/memremap.h
> > > > +++ b/include/linux/memremap.h
> > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > >  
> > > >  #ifdef CONFIG_ZONE_DEVICE
> > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > index 39dc4bd190d0..375a61e18858 100644
> > > > --- a/mm/memremap.c
> > > > +++ b/mm/memremap.c
> > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > >  
> > > > +/**
> > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > + *
> > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > + *
> > > > + * This helper:
> > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > + *       * clears ->mapping
> > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > + *
> > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > + * required.
> > > > + */
> > > > +void free_zone_device_folio_prepare(struct folio *folio)
> 
> I don't really like the naming here - we're not preparing a folio to be
> freed, from the core-mm perspective the folio is already free. This is just
> reinitialising the folio metadata ready for the driver to reuse it, which may
> actually involve just recreating a compound folio.
> 
> So maybe zone_device_folio_reinitialise()? Or would it be possible to

zone_device_folio_reinitialise - that works for me... but seem like
everyone has a opinion. 

> roll this into a zone_device_folio_init() type function (similar to
> zone_device_page_init()) that just deals with everything at allocation time?
> 

I don’t think doing this at allocation actually works without a big lock
per pgmap. Consider the case where a VRAM allocator allocates two
distinct subsets of a large folio and you have a multi-threaded GPU page
fault handler (Xe does). It’s possible two threads could call
zone_device_folio_reinitialise at the same time, racing and causing all
sorts of issues. My plan is to just call this function in the driver’s
->folio_free() prior to returning the VRAM allocation to my driver pool.

> > > > +{
> > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > +	int order, i;
> > > > +
> > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > +
> > > > +	folio->mapping = NULL;
> > > > +	order = folio_order(folio);
> > > > +	if (!order)
> > > > +		return;
> > > > +
> > > > +	folio_reset_order(folio);
> > > > +
> > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > +		struct page *page = folio_page(folio, i);
> > > > +		struct folio *new_folio = (struct folio *)page;
> > > > +
> > > > +		ClearPageHead(page);
> > > > +		clear_compound_head(page);
> > > > +
> > > > +		new_folio->mapping = NULL;
> > > > +		/*
> > > > +		 * Reset pgmap which was over-written by
> > > > +		 * prep_compound_page().
> > > > +		 */
> > > > +		new_folio->pgmap = pgmap;
> > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > 
> > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > that PMD_ORDER more frees than we'd like?
> > > 
> > 
> > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > would make my implementation explode. I explained this in detail here [1]
> > to Zi.
> > 
> > To recap [1], my memory allocator has no visibility into individual
> > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > allows VRAM to be allocated or evicted for both traditional GPU
> > allocations (GEMs) and SVM allocations.
> > 
> > Now, to recap the actual issue: if device folios are not split upon free
> > and are later reallocated with a different order in
> > zone_device_page_init, the implementation breaks. This problem is not
> > specific to Xe—Nouveau happens to always allocate at the same order, so
> > it works by coincidence. Reallocating at a different order is valid
> > behavior and must be supported.
> 
> I agree it's probably by coincidence but it is a perfectly valid design to
> always just (re)allocate at the same order and not worry about having to
> reinitialise things to different orders.
> 

I would agree with this statement too — it’s perfectly valid if a driver
always wants to (re)allocate at the same order.

Matt

>  - Alistair
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > 
> > > > +	}
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > +
> > > >  void free_zone_device_folio(struct folio *folio)
> > > >  {
> > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > >  	case MEMORY_DEVICE_COHERENT:
> > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > >  			break;
> > > > +		free_zone_device_folio_prepare(folio);
> > > >  		pgmap->ops->folio_free(folio, order);
> > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > >  		break;
> > > 
> > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > 
> > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > 
> > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > allocator, each constituent page must be reset to a standalone
> > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > 
> > > > > Use this helper prior to folio_free() for device-private and
> > > > > device-coherent folios to ensure consistent device page state for
> > > > > subsequent allocations.
> > > > > 
> > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > Cc: linux-mm@kvack.org
> > > > > Cc: linux-cxl@vger.kernel.org
> > > > > Cc: linux-kernel@vger.kernel.org
> > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > ---
> > > > >  include/linux/memremap.h |  1 +
> > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > >  2 files changed, 56 insertions(+)
> > > > > 
> > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > --- a/include/linux/memremap.h
> > > > > +++ b/include/linux/memremap.h
> > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > >  
> > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > --- a/mm/memremap.c
> > > > > +++ b/mm/memremap.c
> > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > >  
> > > > > +/**
> > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > + *
> > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > + *
> > > > > + * This helper:
> > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > + *       * clears ->mapping
> > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > + *
> > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > + * required.
> > > > > + */
> > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > 
> > I don't really like the naming here - we're not preparing a folio to be
> > freed, from the core-mm perspective the folio is already free. This is just
> > reinitialising the folio metadata ready for the driver to reuse it, which may
> > actually involve just recreating a compound folio.
> > 
> > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> 
> zone_device_folio_reinitialise - that works for me... but seem like
> everyone has a opinion. 

Well of course :) There are only two hard problems in programming and
I forget the other one. But I didn't want to just say I don't like
free_zone_device_folio_prepare() without offering an alternative, I'd be open
to others.

> 
> > roll this into a zone_device_folio_init() type function (similar to
> > zone_device_page_init()) that just deals with everything at allocation time?
> > 
> 
> I don’t think doing this at allocation actually works without a big lock
> per pgmap. Consider the case where a VRAM allocator allocates two
> distinct subsets of a large folio and you have a multi-threaded GPU page
> fault handler (Xe does). It’s possible two threads could call
> zone_device_folio_reinitialise at the same time, racing and causing all
> sorts of issues. My plan is to just call this function in the driver’s
> ->folio_free() prior to returning the VRAM allocation to my driver pool.

This doesn't make sense to me (at least as someone who doesn't know DRM SVM
intimately) - the folio metadata initialisation should only happen after the
VRAM allocation has occured.

IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
physical range you just initialise the folio/pages associated with that range
with zone_device_folio_(re)initialise() and you're done.

Is the concern that reinitialisation would touch pages outside of the allocated
VRAM range if it was previously a large folio?

> > > > > +{
> > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > +	int order, i;
> > > > > +
> > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > +
> > > > > +	folio->mapping = NULL;
> > > > > +	order = folio_order(folio);
> > > > > +	if (!order)
> > > > > +		return;
> > > > > +
> > > > > +	folio_reset_order(folio);
> > > > > +
> > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > +		struct page *page = folio_page(folio, i);
> > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > +
> > > > > +		ClearPageHead(page);
> > > > > +		clear_compound_head(page);
> > > > > +
> > > > > +		new_folio->mapping = NULL;
> > > > > +		/*
> > > > > +		 * Reset pgmap which was over-written by
> > > > > +		 * prep_compound_page().
> > > > > +		 */
> > > > > +		new_folio->pgmap = pgmap;
> > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > 
> > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > that PMD_ORDER more frees than we'd like?
> > > > 
> > > 
> > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > would make my implementation explode. I explained this in detail here [1]
> > > to Zi.
> > > 
> > > To recap [1], my memory allocator has no visibility into individual
> > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > allows VRAM to be allocated or evicted for both traditional GPU
> > > allocations (GEMs) and SVM allocations.
> > > 
> > > Now, to recap the actual issue: if device folios are not split upon free
> > > and are later reallocated with a different order in
> > > zone_device_page_init, the implementation breaks. This problem is not
> > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > it works by coincidence. Reallocating at a different order is valid
> > > behavior and must be supported.
> > 
> > I agree it's probably by coincidence but it is a perfectly valid design to
> > always just (re)allocate at the same order and not worry about having to
> > reinitialise things to different orders.
> > 
> 
> I would agree with this statement too — it’s perfectly valid if a driver
> always wants to (re)allocate at the same order.
> 
> Matt
> 
> >  - Alistair
> > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > 
> > > > > +	}
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > +
> > > > >  void free_zone_device_folio(struct folio *folio)
> > > > >  {
> > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > >  			break;
> > > > > +		free_zone_device_folio_prepare(folio);
> > > > >  		pgmap->ops->folio_free(folio, order);
> > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > >  		break;
> > > > 
> > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > 
> > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > 
> > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > 
> > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > subsequent allocations.
> > > > > > 
> > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > Cc: linux-mm@kvack.org
> > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > ---
> > > > > >  include/linux/memremap.h |  1 +
> > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > >  2 files changed, 56 insertions(+)
> > > > > > 
> > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > --- a/include/linux/memremap.h
> > > > > > +++ b/include/linux/memremap.h
> > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > >  
> > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > --- a/mm/memremap.c
> > > > > > +++ b/mm/memremap.c
> > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > >  
> > > > > > +/**
> > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > + *
> > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > + *
> > > > > > + * This helper:
> > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > + *       * clears ->mapping
> > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > + *
> > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > + * required.
> > > > > > + */
> > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > 
> > > I don't really like the naming here - we're not preparing a folio to be
> > > freed, from the core-mm perspective the folio is already free. This is just
> > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > actually involve just recreating a compound folio.
> > > 
> > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > 
> > zone_device_folio_reinitialise - that works for me... but seem like
> > everyone has a opinion. 
> 
> Well of course :) There are only two hard problems in programming and
> I forget the other one. But I didn't want to just say I don't like
> free_zone_device_folio_prepare() without offering an alternative, I'd be open
> to others.
> 

zone_device_folio_reinitialise is good with me.

> > 
> > > roll this into a zone_device_folio_init() type function (similar to
> > > zone_device_page_init()) that just deals with everything at allocation time?
> > > 
> > 
> > I don’t think doing this at allocation actually works without a big lock
> > per pgmap. Consider the case where a VRAM allocator allocates two
> > distinct subsets of a large folio and you have a multi-threaded GPU page
> > fault handler (Xe does). It’s possible two threads could call
> > zone_device_folio_reinitialise at the same time, racing and causing all
> > sorts of issues. My plan is to just call this function in the driver’s
> > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> 
> This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> intimately) - the folio metadata initialisation should only happen after the
> VRAM allocation has occured.
> 
> IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> physical range you just initialise the folio/pages associated with that range
> with zone_device_folio_(re)initialise() and you're done.
> 

Our VRAM allocator does have locking (via DRM buddy), but that layer
doesn’t have visibility into the folio or its pages. By the time we
handle the folio/pages in the GPU fault handler, there are no global
locks preventing two GPU faults from each having, say, 16 pages from the
same order-9 folio. I believe if both threads call
zone_device_folio_reinitialise/init at the same time, bad things could
happen.

> Is the concern that reinitialisation would touch pages outside of the allocated
> VRAM range if it was previously a large folio?

No just two threads call zone_device_folio_reinitialise/init at the same
time, on the same folio.

If we call zone_device_folio_reinitialise in ->folio_free this problem
goes away. We could solve this with split_lock or something but I'd
prefer not to add lock for this (although some of prior revs did do
this, maybe we will revist this later).

Anyways - this falls in driver detail / choice IMO.

Matt

> 
> > > > > > +{
> > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > +	int order, i;
> > > > > > +
> > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > +
> > > > > > +	folio->mapping = NULL;
> > > > > > +	order = folio_order(folio);
> > > > > > +	if (!order)
> > > > > > +		return;
> > > > > > +
> > > > > > +	folio_reset_order(folio);
> > > > > > +
> > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > +
> > > > > > +		ClearPageHead(page);
> > > > > > +		clear_compound_head(page);
> > > > > > +
> > > > > > +		new_folio->mapping = NULL;
> > > > > > +		/*
> > > > > > +		 * Reset pgmap which was over-written by
> > > > > > +		 * prep_compound_page().
> > > > > > +		 */
> > > > > > +		new_folio->pgmap = pgmap;
> > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > 
> > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > that PMD_ORDER more frees than we'd like?
> > > > > 
> > > > 
> > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > would make my implementation explode. I explained this in detail here [1]
> > > > to Zi.
> > > > 
> > > > To recap [1], my memory allocator has no visibility into individual
> > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > allocations (GEMs) and SVM allocations.
> > > > 
> > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > and are later reallocated with a different order in
> > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > it works by coincidence. Reallocating at a different order is valid
> > > > behavior and must be supported.
> > > 
> > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > always just (re)allocate at the same order and not worry about having to
> > > reinitialise things to different orders.
> > > 
> > 
> > I would agree with this statement too — it’s perfectly valid if a driver
> > always wants to (re)allocate at the same order.
> > 
> > Matt
> > 
> > >  - Alistair
> > > 
> > > > Matt
> > > > 
> > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > 
> > > > > > +	}
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > +
> > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > >  {
> > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > >  			break;
> > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > >  		break;
> > > > > 
> > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > 
> > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > 
> > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > 
> > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > subsequent allocations.
> > > > > > > 
> > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > Cc: linux-mm@kvack.org
> > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > ---
> > > > > > >  include/linux/memremap.h |  1 +
> > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > --- a/include/linux/memremap.h
> > > > > > > +++ b/include/linux/memremap.h
> > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > >  
> > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > --- a/mm/memremap.c
> > > > > > > +++ b/mm/memremap.c
> > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > >  
> > > > > > > +/**
> > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > + *
> > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > + *
> > > > > > > + * This helper:
> > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > + *       * clears ->mapping
> > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > + *
> > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > + * required.
> > > > > > > + */
> > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > 
> > > > I don't really like the naming here - we're not preparing a folio to be
> > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > actually involve just recreating a compound folio.
> > > > 
> > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > 
> > > zone_device_folio_reinitialise - that works for me... but seem like
> > > everyone has a opinion. 
> > 
> > Well of course :) There are only two hard problems in programming and
> > I forget the other one. But I didn't want to just say I don't like
> > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > to others.
> > 
> 
> zone_device_folio_reinitialise is good with me.
> 
> > > 
> > > > roll this into a zone_device_folio_init() type function (similar to
> > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > 
> > > 
> > > I don’t think doing this at allocation actually works without a big lock
> > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > fault handler (Xe does). It’s possible two threads could call
> > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > sorts of issues. My plan is to just call this function in the driver’s
> > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > 
> > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > intimately) - the folio metadata initialisation should only happen after the
> > VRAM allocation has occured.
> > 
> > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > physical range you just initialise the folio/pages associated with that range
> > with zone_device_folio_(re)initialise() and you're done.
> > 
> 
> Our VRAM allocator does have locking (via DRM buddy), but that layer

I mean I assumed it did :-)

> doesn’t have visibility into the folio or its pages. By the time we
> handle the folio/pages in the GPU fault handler, there are no global
> locks preventing two GPU faults from each having, say, 16 pages from the
> same order-9 folio. I believe if both threads call
> zone_device_folio_reinitialise/init at the same time, bad things could
> happen.

This is confusing to me. If you are getting a GPU fault it implies no page is
mapped at a particular virtual address. The normal process (or at least the
process I'm familiar with) for handling this is to allocate and map a page at
the faulting virtual address. So in the scenario of two GPUs faulting on the
same VA each thread would allocate VRAM using DRM buddy, presumably getting
different physical pages, and so the zone_device_folio_init() call would be to
different folios/pages.

Then eventually one thread would succeed in creating the mapping from VA->VRAM
and the losing thread would free the VRAM allocation back to DRM buddy.

So I'm a bit confused by the above statement that two GPUs faults could each
have the same pages or be calling zone_device_folio_init() on the same pages.
How would that happen?

> > Is the concern that reinitialisation would touch pages outside of the allocated
> > VRAM range if it was previously a large folio?
> 
> No just two threads call zone_device_folio_reinitialise/init at the same
> time, on the same folio.
> 
> If we call zone_device_folio_reinitialise in ->folio_free this problem
> goes away. We could solve this with split_lock or something but I'd
> prefer not to add lock for this (although some of prior revs did do
> this, maybe we will revist this later).
> 
> Anyways - this falls in driver detail / choice IMO.

Agreed.

 - Alistair

> Matt
> 
> > 
> > > > > > > +{
> > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > +	int order, i;
> > > > > > > +
> > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > +
> > > > > > > +	folio->mapping = NULL;
> > > > > > > +	order = folio_order(folio);
> > > > > > > +	if (!order)
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	folio_reset_order(folio);
> > > > > > > +
> > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > +
> > > > > > > +		ClearPageHead(page);
> > > > > > > +		clear_compound_head(page);
> > > > > > > +
> > > > > > > +		new_folio->mapping = NULL;
> > > > > > > +		/*
> > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > +		 * prep_compound_page().
> > > > > > > +		 */
> > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > 
> > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > 
> > > > > 
> > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > to Zi.
> > > > > 
> > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > allocations (GEMs) and SVM allocations.
> > > > > 
> > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > and are later reallocated with a different order in
> > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > behavior and must be supported.
> > > > 
> > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > always just (re)allocate at the same order and not worry about having to
> > > > reinitialise things to different orders.
> > > > 
> > > 
> > > I would agree with this statement too — it’s perfectly valid if a driver
> > > always wants to (re)allocate at the same order.
> > > 
> > > Matt
> > > 
> > > >  - Alistair
> > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > 
> > > > > > > +	}
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > +
> > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > >  {
> > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > >  			break;
> > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > >  		break;
> > > > > > 
> > > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > 
> > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > 
> > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > 
> > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > subsequent allocations.
> > > > > > > > 
> > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > ---
> > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > >  
> > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > --- a/mm/memremap.c
> > > > > > > > +++ b/mm/memremap.c
> > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > >  
> > > > > > > > +/**
> > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > + *
> > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > + *
> > > > > > > > + * This helper:
> > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > + *       * clears ->mapping
> > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > + *
> > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > + * required.
> > > > > > > > + */
> > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > 
> > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > actually involve just recreating a compound folio.
> > > > > 
> > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > 
> > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > everyone has a opinion. 
> > > 
> > > Well of course :) There are only two hard problems in programming and
> > > I forget the other one. But I didn't want to just say I don't like
> > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > to others.
> > > 
> > 
> > zone_device_folio_reinitialise is good with me.
> > 
> > > > 
> > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > 
> > > > 
> > > > I don’t think doing this at allocation actually works without a big lock
> > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > fault handler (Xe does). It’s possible two threads could call
> > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > 
> > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > intimately) - the folio metadata initialisation should only happen after the
> > > VRAM allocation has occured.
> > > 
> > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > physical range you just initialise the folio/pages associated with that range
> > > with zone_device_folio_(re)initialise() and you're done.
> > > 
> > 
> > Our VRAM allocator does have locking (via DRM buddy), but that layer
> 
> I mean I assumed it did :-)
> 
> > doesn’t have visibility into the folio or its pages. By the time we
> > handle the folio/pages in the GPU fault handler, there are no global
> > locks preventing two GPU faults from each having, say, 16 pages from the
> > same order-9 folio. I believe if both threads call
> > zone_device_folio_reinitialise/init at the same time, bad things could
> > happen.
> 
> This is confusing to me. If you are getting a GPU fault it implies no page is
> mapped at a particular virtual address. The normal process (or at least the
> process I'm familiar with) for handling this is to allocate and map a page at
> the faulting virtual address. So in the scenario of two GPUs faulting on the
> same VA each thread would allocate VRAM using DRM buddy, presumably getting

Different VAs.

> different physical pages, and so the zone_device_folio_init() call would be to

Yes, different physical pages but same folio which is possible if it
hasn't been split yet (i.e., both threads a different subset of pages in
the same folio, try to split at the same time and boom something bad
happens).

> different folios/pages.
> 
> Then eventually one thread would succeed in creating the mapping from VA->VRAM
> and the losing thread would free the VRAM allocation back to DRM buddy.
> 
> So I'm a bit confused by the above statement that two GPUs faults could each
> have the same pages or be calling zone_device_folio_init() on the same pages.
> How would that happen?
> 

See above. I hope my above statements make this clear.

Matt

> > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > VRAM range if it was previously a large folio?
> > 
> > No just two threads call zone_device_folio_reinitialise/init at the same
> > time, on the same folio.
> > 
> > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > goes away. We could solve this with split_lock or something but I'd
> > prefer not to add lock for this (although some of prior revs did do
> > this, maybe we will revist this later).
> > 
> > Anyways - this falls in driver detail / choice IMO.
> 
> Agreed.
> 
>  - Alistair
> 
> > Matt
> > 
> > > 
> > > > > > > > +{
> > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > +	int order, i;
> > > > > > > > +
> > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > +
> > > > > > > > +	folio->mapping = NULL;
> > > > > > > > +	order = folio_order(folio);
> > > > > > > > +	if (!order)
> > > > > > > > +		return;
> > > > > > > > +
> > > > > > > > +	folio_reset_order(folio);
> > > > > > > > +
> > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > +
> > > > > > > > +		ClearPageHead(page);
> > > > > > > > +		clear_compound_head(page);
> > > > > > > > +
> > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > +		/*
> > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > +		 * prep_compound_page().
> > > > > > > > +		 */
> > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > 
> > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > 
> > > > > > 
> > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > to Zi.
> > > > > > 
> > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > allocations (GEMs) and SVM allocations.
> > > > > > 
> > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > and are later reallocated with a different order in
> > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > behavior and must be supported.
> > > > > 
> > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > always just (re)allocate at the same order and not worry about having to
> > > > > reinitialise things to different orders.
> > > > > 
> > > > 
> > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > always wants to (re)allocate at the same order.
> > > > 
> > > > Matt
> > > > 
> > > > >  - Alistair
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > 
> > > > > > > > +	}
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > +
> > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > >  {
> > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > >  			break;
> > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > >  		break;
> > > > > > > 
> > > > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > 
> > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > 
> > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > 
> > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > subsequent allocations.
> > > > > > > > > 
> > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > ---
> > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > >  
> > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > >  }
> > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > >  
> > > > > > > > > +/**
> > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > + *
> > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > + *
> > > > > > > > > + * This helper:
> > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > + *       * clears ->mapping
> > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > + *
> > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > + * required.
> > > > > > > > > + */
> > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > 
> > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > actually involve just recreating a compound folio.
> > > > > > 
> > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > 
> > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > everyone has a opinion. 
> > > > 
> > > > Well of course :) There are only two hard problems in programming and
> > > > I forget the other one. But I didn't want to just say I don't like
> > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > to others.
> > > > 
> > > 
> > > zone_device_folio_reinitialise is good with me.
> > > 
> > > > > 
> > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > 
> > > > > 
> > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > 
> > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > intimately) - the folio metadata initialisation should only happen after the
> > > > VRAM allocation has occured.
> > > > 
> > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > physical range you just initialise the folio/pages associated with that range
> > > > with zone_device_folio_(re)initialise() and you're done.
> > > > 
> > > 
> > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > 
> > I mean I assumed it did :-)
> > 
> > > doesn’t have visibility into the folio or its pages. By the time we
> > > handle the folio/pages in the GPU fault handler, there are no global
> > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > same order-9 folio. I believe if both threads call
> > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > happen.
> > 
> > This is confusing to me. If you are getting a GPU fault it implies no page is
> > mapped at a particular virtual address. The normal process (or at least the
> > process I'm familiar with) for handling this is to allocate and map a page at
> > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> 
> Different VAs.
> 
> > different physical pages, and so the zone_device_folio_init() call would be to
> 
> Yes, different physical pages but same folio which is possible if it
> hasn't been split yet (i.e., both threads a different subset of pages in
> the same folio, try to split at the same time and boom something bad
> happens).

So is you're concern something like this:

1) There is a free folio A of order 9, starting at physical address 0.
2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
5) Both call zone_device_folio_init() which splits the folio, meaning the
   previous step would touch folio_page(folio_A, 0) even though it has not been
   allocated physical address 0.

If that's the concern then what I'm saying (and what I think Jason was getting
at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
just overwrite all the metadata in the struct pages it has been allocated. We're
not really splitting folios, because it makes no sense to talk of splitting a
free folio which I think is why some core-mm people took notice.

Also It doesn't matter that you are leaving the previous compound head struct
pages in some weird state, the core-mm doesn't care about them anymore and the
struct page/folio is only used by core-mm not drivers. They will get properly
(re)initialised when needed for the core-mm in zone_device_folio_init() which in
this case would happen in step 3.

 - Alistair

> > different folios/pages.
> > 
> > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > and the losing thread would free the VRAM allocation back to DRM buddy.
> > 
> > So I'm a bit confused by the above statement that two GPUs faults could each
> > have the same pages or be calling zone_device_folio_init() on the same pages.
> > How would that happen?
> > 
> 
> See above. I hope my above statements make this clear.
> 
> Matt
> 
> > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > VRAM range if it was previously a large folio?
> > > 
> > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > time, on the same folio.
> > > 
> > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > goes away. We could solve this with split_lock or something but I'd
> > > prefer not to add lock for this (although some of prior revs did do
> > > this, maybe we will revist this later).
> > > 
> > > Anyways - this falls in driver detail / choice IMO.
> > 
> > Agreed.
> > 
> >  - Alistair
> > 
> > > Matt
> > > 
> > > > 
> > > > > > > > > +{
> > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > +	int order, i;
> > > > > > > > > +
> > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > +
> > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > +	if (!order)
> > > > > > > > > +		return;
> > > > > > > > > +
> > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > +
> > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > +
> > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > +
> > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > +		/*
> > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > +		 */
> > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > 
> > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > 
> > > > > > > 
> > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > to Zi.
> > > > > > > 
> > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > 
> > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > and are later reallocated with a different order in
> > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > behavior and must be supported.
> > > > > > 
> > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > reinitialise things to different orders.
> > > > > > 
> > > > > 
> > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > always wants to (re)allocate at the same order.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > >  - Alistair
> > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > 
> > > > > > > > > +	}
> > > > > > > > > +}
> > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > +
> > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > >  {
> > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > >  			break;
> > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > >  		break;
> > > > > > > > 
> > > > > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 3 weeks, 6 days ago

On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote:
> On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > 
> > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > > 
> > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > > 
> > > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > > subsequent allocations.
> > > > > > > > > > 
> > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > > >  
> > > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > > >  }
> > > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > > >  
> > > > > > > > > > +/**
> > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > > + *
> > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > > + *
> > > > > > > > > > + * This helper:
> > > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > > + *       * clears ->mapping
> > > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > > + *
> > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > > + * required.
> > > > > > > > > > + */
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > > 
> > > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > > actually involve just recreating a compound folio.
> > > > > > > 
> > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > > 
> > > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > > everyone has a opinion. 
> > > > > 
> > > > > Well of course :) There are only two hard problems in programming and
> > > > > I forget the other one. But I didn't want to just say I don't like
> > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > > to others.
> > > > > 
> > > > 
> > > > zone_device_folio_reinitialise is good with me.
> > > > 
> > > > > > 
> > > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > > 
> > > > > > 
> > > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > > 
> > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > > intimately) - the folio metadata initialisation should only happen after the
> > > > > VRAM allocation has occured.
> > > > > 
> > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > > physical range you just initialise the folio/pages associated with that range
> > > > > with zone_device_folio_(re)initialise() and you're done.
> > > > > 
> > > > 
> > > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > > 
> > > I mean I assumed it did :-)
> > > 
> > > > doesn’t have visibility into the folio or its pages. By the time we
> > > > handle the folio/pages in the GPU fault handler, there are no global
> > > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > > same order-9 folio. I believe if both threads call
> > > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > > happen.
> > > 
> > > This is confusing to me. If you are getting a GPU fault it implies no page is
> > > mapped at a particular virtual address. The normal process (or at least the
> > > process I'm familiar with) for handling this is to allocate and map a page at
> > > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> > 
> > Different VAs.
> > 
> > > different physical pages, and so the zone_device_folio_init() call would be to
> > 
> > Yes, different physical pages but same folio which is possible if it
> > hasn't been split yet (i.e., both threads a different subset of pages in
> > the same folio, try to split at the same time and boom something bad
> > happens).
> 
> So is you're concern something like this:
> 
> 1) There is a free folio A of order 9, starting at physical address 0.
> 2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
> 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
> 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
> 5) Both call zone_device_folio_init() which splits the folio, meaning the
>    previous step would touch folio_page(folio_A, 0) even though it has not been
>    allocated physical address 0.
> 

Yes.

> If that's the concern then what I'm saying (and what I think Jason was getting
> at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
> compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
> just overwrite all the metadata in the struct pages it has been allocated. We're
> not really splitting folios, because it makes no sense to talk of splitting a
> free folio which I think is why some core-mm people took notice.
> 
> Also It doesn't matter that you are leaving the previous compound head struct
> pages in some weird state, the core-mm doesn't care about them anymore and the
> struct page/folio is only used by core-mm not drivers. They will get properly
> (re)initialised when needed for the core-mm in zone_device_folio_init() which in
> this case would happen in step 3.
>

Something like this should work too. I started implementing it on my
side earlier today, but of course, I was hitting hangs. From an API
point of view, zone_device_folio_init would need to be updated to accept
a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have
a valid pgmap to retrieve. It could look at the folio’s pgmap, but that
also seems like it could race under the right conditions.

Let me see what this looks like and whether I can get it working.

Matt
 
>  - Alistair
> 
> > > different folios/pages.
> > > 
> > > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > > and the losing thread would free the VRAM allocation back to DRM buddy.
> > > 
> > > So I'm a bit confused by the above statement that two GPUs faults could each
> > > have the same pages or be calling zone_device_folio_init() on the same pages.
> > > How would that happen?
> > > 
> > 
> > See above. I hope my above statements make this clear.
> > 
> > Matt
> > 
> > > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > > VRAM range if it was previously a large folio?
> > > > 
> > > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > > time, on the same folio.
> > > > 
> > > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > > goes away. We could solve this with split_lock or something but I'd
> > > > prefer not to add lock for this (although some of prior revs did do
> > > > this, maybe we will revist this later).
> > > > 
> > > > Anyways - this falls in driver detail / choice IMO.
> > > 
> > > Agreed.
> > > 
> > >  - Alistair
> > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > > > > > +{
> > > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > > +	int order, i;
> > > > > > > > > > +
> > > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > > +
> > > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > > +	if (!order)
> > > > > > > > > > +		return;
> > > > > > > > > > +
> > > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > > +
> > > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > > +
> > > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > > +
> > > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > > +		/*
> > > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > > +		 */
> > > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > > 
> > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > > to Zi.
> > > > > > > > 
> > > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > > 
> > > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > > and are later reallocated with a different order in
> > > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > > behavior and must be supported.
> > > > > > > 
> > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > > reinitialise things to different orders.
> > > > > > > 
> > > > > > 
> > > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > > always wants to (re)allocate at the same order.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > >  - Alistair
> > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > > 
> > > > > > > > > > +	}
> > > > > > > > > > +}
> > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > > +
> > > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  {
> > > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > > >  			break;
> > > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > > >  		break;
> > > > > > > > > 
> > > > > > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Alistair Popple 3 weeks, 6 days ago

On 2026-01-13 at 13:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote:
> > On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > 
> > > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > > > 
> > > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > > > 
> > > > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > > > subsequent allocations.
> > > > > > > > > > > 
> > > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > > > >  
> > > > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > > > >  }
> > > > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > > > >  
> > > > > > > > > > > +/**
> > > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > > > + *
> > > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > > > + *
> > > > > > > > > > > + * This helper:
> > > > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > > > + *       * clears ->mapping
> > > > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > > > + *
> > > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > > > + * required.
> > > > > > > > > > > + */
> > > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > > > 
> > > > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > > > actually involve just recreating a compound folio.
> > > > > > > > 
> > > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > > > 
> > > > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > > > everyone has a opinion. 
> > > > > > 
> > > > > > Well of course :) There are only two hard problems in programming and
> > > > > > I forget the other one. But I didn't want to just say I don't like
> > > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > > > to others.
> > > > > > 
> > > > > 
> > > > > zone_device_folio_reinitialise is good with me.
> > > > > 
> > > > > > > 
> > > > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > > > 
> > > > > > > 
> > > > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > > > 
> > > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > > > intimately) - the folio metadata initialisation should only happen after the
> > > > > > VRAM allocation has occured.
> > > > > > 
> > > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > > > physical range you just initialise the folio/pages associated with that range
> > > > > > with zone_device_folio_(re)initialise() and you're done.
> > > > > > 
> > > > > 
> > > > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > > > 
> > > > I mean I assumed it did :-)
> > > > 
> > > > > doesn’t have visibility into the folio or its pages. By the time we
> > > > > handle the folio/pages in the GPU fault handler, there are no global
> > > > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > > > same order-9 folio. I believe if both threads call
> > > > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > > > happen.
> > > > 
> > > > This is confusing to me. If you are getting a GPU fault it implies no page is
> > > > mapped at a particular virtual address. The normal process (or at least the
> > > > process I'm familiar with) for handling this is to allocate and map a page at
> > > > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > > > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> > > 
> > > Different VAs.
> > > 
> > > > different physical pages, and so the zone_device_folio_init() call would be to
> > > 
> > > Yes, different physical pages but same folio which is possible if it
> > > hasn't been split yet (i.e., both threads a different subset of pages in
> > > the same folio, try to split at the same time and boom something bad
> > > happens).
> > 
> > So is you're concern something like this:
> > 
> > 1) There is a free folio A of order 9, starting at physical address 0.
> > 2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
> > 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
> > 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
> > 5) Both call zone_device_folio_init() which splits the folio, meaning the
> >    previous step would touch folio_page(folio_A, 0) even though it has not been
> >    allocated physical address 0.
> > 
> 
> Yes.
> 
> > If that's the concern then what I'm saying (and what I think Jason was getting
> > at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
> > compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
> > just overwrite all the metadata in the struct pages it has been allocated. We're
> > not really splitting folios, because it makes no sense to talk of splitting a
> > free folio which I think is why some core-mm people took notice.
> > 
> > Also It doesn't matter that you are leaving the previous compound head struct
> > pages in some weird state, the core-mm doesn't care about them anymore and the
> > struct page/folio is only used by core-mm not drivers. They will get properly
> > (re)initialised when needed for the core-mm in zone_device_folio_init() which in
> > this case would happen in step 3.
> >
> 
> Something like this should work too. I started implementing it on my
> side earlier today, but of course, I was hitting hangs. From an API
> point of view, zone_device_folio_init would need to be updated to accept
> a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have
> a valid pgmap to retrieve. It could look at the folio’s pgmap, but that
> also seems like it could race under the right conditions.

I think passing a pgmap argument in would be fine - it allows us to maintain
the concept that zone_device_folio_init() does exactly what it says on the tin.
That is it initialises a ZONE_DEVICE folio ready for use by the core-mm without
placing any assumptions or restrictions on the current state of the folio/page
structs.

> Let me see what this looks like and whether I can get it working.
> 
> Matt
>  
> >  - Alistair
> > 
> > > > different folios/pages.
> > > > 
> > > > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > > > and the losing thread would free the VRAM allocation back to DRM buddy.
> > > > 
> > > > So I'm a bit confused by the above statement that two GPUs faults could each
> > > > have the same pages or be calling zone_device_folio_init() on the same pages.
> > > > How would that happen?
> > > > 
> > > 
> > > See above. I hope my above statements make this clear.
> > > 
> > > Matt
> > > 
> > > > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > > > VRAM range if it was previously a large folio?
> > > > > 
> > > > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > > > time, on the same folio.
> > > > > 
> > > > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > > > goes away. We could solve this with split_lock or something but I'd
> > > > > prefer not to add lock for this (although some of prior revs did do
> > > > > this, maybe we will revist this later).
> > > > > 
> > > > > Anyways - this falls in driver detail / choice IMO.
> > > > 
> > > > Agreed.
> > > > 
> > > >  - Alistair
> > > > 
> > > > > Matt
> > > > > 
> > > > > > 
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > > > +	int order, i;
> > > > > > > > > > > +
> > > > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > > > +
> > > > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > > > +	if (!order)
> > > > > > > > > > > +		return;
> > > > > > > > > > > +
> > > > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > > > +
> > > > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > > > +
> > > > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > > > +
> > > > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > > > +		/*
> > > > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > > > +		 */
> > > > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > > > 
> > > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > > > to Zi.
> > > > > > > > > 
> > > > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > > > 
> > > > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > > > and are later reallocated with a different order in
> > > > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > > > behavior and must be supported.
> > > > > > > > 
> > > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > > > reinitialise things to different orders.
> > > > > > > > 
> > > > > > > 
> > > > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > > > always wants to (re)allocate at the same order.
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > >  - Alistair
> > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > > > 
> > > > > > > > > > > +	}
> > > > > > > > > > > +}
> > > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > > > +
> > > > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > > > >  {
> > > > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > > > >  			break;
> > > > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > > > >  		break;
> > > > > > > > > > 
> > > > > > > > > > Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Balbir Singh 4 weeks ago

On 1/12/26 11:16, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
>> On 1/12/26 06:55, Francois Dugast wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> Add free_zone_device_folio_prepare(), a helper that restores large
>>> ZONE_DEVICE folios to a sane, initial state before freeing them.
>>>
>>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
>>> compound metadata). Before returning such pages to the device pgmap
>>> allocator, each constituent page must be reset to a standalone
>>> ZONE_DEVICE folio with a valid pgmap and no compound state.
>>>
>>> Use this helper prior to folio_free() for device-private and
>>> device-coherent folios to ensure consistent device page state for
>>> subsequent allocations.
>>>
>>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: David Hildenbrand <david@kernel.org>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Balbir Singh <balbirs@nvidia.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Mike Rapoport <rppt@kernel.org>
>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: linux-mm@kvack.org
>>> Cc: linux-cxl@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Suggested-by: Alistair Popple <apopple@nvidia.com>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  include/linux/memremap.h |  1 +
>>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 56 insertions(+)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 97fcffeb1c1e..88e1d4707296 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>  
>>>  #ifdef CONFIG_ZONE_DEVICE
>>>  void zone_device_page_init(struct page *page, unsigned int order);
>>> +void free_zone_device_folio_prepare(struct folio *folio);
>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 39dc4bd190d0..375a61e18858 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
>>>  }
>>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>  
>>> +/**
>>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
>>> + * @folio: ZONE_DEVICE folio to prepare for release.
>>> + *
>>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
>>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
>>> + * must be restored to a sane ZONE_DEVICE state before they are released.
>>> + *
>>> + * This helper:
>>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
>>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
>>> + *   - Resets the compound order metadata (folio_reset_order()) and then
>>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
>>> + *       * clears ->mapping
>>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
>>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
>>> + *
>>> + * If @folio is order-0, only the mapping is cleared and no further work is
>>> + * required.
>>> + */
>>> +void free_zone_device_folio_prepare(struct folio *folio)
>>> +{
>>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>>> +	int order, i;
>>> +
>>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
>>> +
>>> +	folio->mapping = NULL;
>>> +	order = folio_order(folio);
>>> +	if (!order)
>>> +		return;
>>> +
>>> +	folio_reset_order(folio);
>>> +
>>> +	for (i = 0; i < (1UL << order); i++) {
>>> +		struct page *page = folio_page(folio, i);
>>> +		struct folio *new_folio = (struct folio *)page;
>>> +
>>> +		ClearPageHead(page);
>>> +		clear_compound_head(page);
>>> +
>>> +		new_folio->mapping = NULL;
>>> +		/*
>>> +		 * Reset pgmap which was over-written by
>>> +		 * prep_compound_page().
>>> +		 */
>>> +		new_folio->pgmap = pgmap;
>>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
>>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
>>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
>>
>> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
>> that PMD_ORDER more frees than we'd like?
>>
> 
> No, calling free_folio() more often doesn’t solve anything—in fact, that
> would make my implementation explode. I explained this in detail here [1]
> to Zi.
> 
> To recap [1], my memory allocator has no visibility into individual
> pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> allows VRAM to be allocated or evicted for both traditional GPU
> allocations (GEMs) and SVM allocations.
> 

I assume it is still backed by pages that are ref counted? I suspect you'd
need to convert one reference count to PMD_ORDER reference counts to make
this change work, or are the references not at page granularity? 

I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
like a per folio refcount

> Now, to recap the actual issue: if device folios are not split upon free
> and are later reallocated with a different order in
> zone_device_page_init, the implementation breaks. This problem is not
> specific to Xe—Nouveau happens to always allocate at the same order, so
> it works by coincidence. Reallocating at a different order is valid
> behavior and must be supported.
> 

Agreed

> Matt
> 
> [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> 
>>> +	}
>>> +}
>>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
>>> +
>>>  void free_zone_device_folio(struct folio *folio)
>>>  {
>>>  	struct dev_pagemap *pgmap = folio->pgmap;
>>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
>>>  	case MEMORY_DEVICE_COHERENT:
>>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
>>>  			break;
>>> +		free_zone_device_folio_prepare(folio);
>>>  		pgmap->ops->folio_free(folio, order);
>>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>  		break;
>>
>> Balbir

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 4 weeks ago

On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> On 1/12/26 11:16, Matthew Brost wrote:
> > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> >> On 1/12/26 06:55, Francois Dugast wrote:
> >>> From: Matthew Brost <matthew.brost@intel.com>
> >>>
> >>> Add free_zone_device_folio_prepare(), a helper that restores large
> >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> >>>
> >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> >>> compound metadata). Before returning such pages to the device pgmap
> >>> allocator, each constituent page must be reset to a standalone
> >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> >>>
> >>> Use this helper prior to folio_free() for device-private and
> >>> device-coherent folios to ensure consistent device page state for
> >>> subsequent allocations.
> >>>
> >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> >>> Cc: Zi Yan <ziy@nvidia.com>
> >>> Cc: David Hildenbrand <david@kernel.org>
> >>> Cc: Oscar Salvador <osalvador@suse.de>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> >>> Cc: Mike Rapoport <rppt@kernel.org>
> >>> Cc: Suren Baghdasaryan <surenb@google.com>
> >>> Cc: Michal Hocko <mhocko@suse.com>
> >>> Cc: Alistair Popple <apopple@nvidia.com>
> >>> Cc: linux-mm@kvack.org
> >>> Cc: linux-cxl@vger.kernel.org
> >>> Cc: linux-kernel@vger.kernel.org
> >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>> ---
> >>>  include/linux/memremap.h |  1 +
> >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 56 insertions(+)
> >>>
> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >>> index 97fcffeb1c1e..88e1d4707296 100644
> >>> --- a/include/linux/memremap.h
> >>> +++ b/include/linux/memremap.h
> >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> >>>  
> >>>  #ifdef CONFIG_ZONE_DEVICE
> >>>  void zone_device_page_init(struct page *page, unsigned int order);
> >>> +void free_zone_device_folio_prepare(struct folio *folio);
> >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 39dc4bd190d0..375a61e18858 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >>>  
> >>> +/**
> >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> >>> + *
> >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> >>> + *
> >>> + * This helper:
> >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> >>> + *       * clears ->mapping
> >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> >>> + *
> >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> >>> + * required.
> >>> + */
> >>> +void free_zone_device_folio_prepare(struct folio *folio)
> >>> +{
> >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> >>> +	int order, i;
> >>> +
> >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> >>> +
> >>> +	folio->mapping = NULL;
> >>> +	order = folio_order(folio);
> >>> +	if (!order)
> >>> +		return;
> >>> +
> >>> +	folio_reset_order(folio);
> >>> +
> >>> +	for (i = 0; i < (1UL << order); i++) {
> >>> +		struct page *page = folio_page(folio, i);
> >>> +		struct folio *new_folio = (struct folio *)page;
> >>> +
> >>> +		ClearPageHead(page);
> >>> +		clear_compound_head(page);
> >>> +
> >>> +		new_folio->mapping = NULL;
> >>> +		/*
> >>> +		 * Reset pgmap which was over-written by
> >>> +		 * prep_compound_page().
> >>> +		 */
> >>> +		new_folio->pgmap = pgmap;
> >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> >>
> >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> >> that PMD_ORDER more frees than we'd like?
> >>
> > 
> > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > would make my implementation explode. I explained this in detail here [1]
> > to Zi.
> > 
> > To recap [1], my memory allocator has no visibility into individual
> > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > allows VRAM to be allocated or evicted for both traditional GPU
> > allocations (GEMs) and SVM allocations.
> > 
> 
> I assume it is still backed by pages that are ref counted? I suspect you'd

Yes.

> need to convert one reference count to PMD_ORDER reference counts to make
> this change work, or are the references not at page granularity? 
> 
> I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> like a per folio refcount
> 

The refcount is incremented by 1 for each call to
folio_set_zone_device_data. If we have a 2MB device folio backing a
2MB allocation, the refcount is 1. If we have 512 4KB device pages
backing a 2MB allocation, the refcount is 512. The refcount matches the
number of folio_free calls we expect to receive for the size of the
backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
but thia all configurable via a table driver side (Xe) in GPU SVM (drm
common layer).

Matt

> > Now, to recap the actual issue: if device folios are not split upon free
> > and are later reallocated with a different order in
> > zone_device_page_init, the implementation breaks. This problem is not
> > specific to Xe—Nouveau happens to always allocate at the same order, so
> > it works by coincidence. Reallocating at a different order is valid
> > behavior and must be supported.
> > 
> 
> Agreed
> 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > 
> >>> +	}
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> >>> +
> >>>  void free_zone_device_folio(struct folio *folio)
> >>>  {
> >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> >>>  	case MEMORY_DEVICE_COHERENT:
> >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> >>>  			break;
> >>> +		free_zone_device_folio_prepare(folio);
> >>>  		pgmap->ops->folio_free(folio, order);
> >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> >>>  		break;
> >>
> >> Balbir
>

Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

Posted by Matthew Brost 4 weeks ago

On Sun, Jan 11, 2026 at 06:37:06PM -0800, Matthew Brost wrote:
> On Mon, Jan 12, 2026 at 01:15:12PM +1100, Balbir Singh wrote:
> > On 1/12/26 11:16, Matthew Brost wrote:
> > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > >> On 1/12/26 06:55, Francois Dugast wrote:
> > >>> From: Matthew Brost <matthew.brost@intel.com>
> > >>>
> > >>> Add free_zone_device_folio_prepare(), a helper that restores large
> > >>> ZONE_DEVICE folios to a sane, initial state before freeing them.
> > >>>
> > >>> Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > >>> compound metadata). Before returning such pages to the device pgmap
> > >>> allocator, each constituent page must be reset to a standalone
> > >>> ZONE_DEVICE folio with a valid pgmap and no compound state.
> > >>>
> > >>> Use this helper prior to folio_free() for device-private and
> > >>> device-coherent folios to ensure consistent device page state for
> > >>> subsequent allocations.
> > >>>
> > >>> Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > >>> Cc: Zi Yan <ziy@nvidia.com>
> > >>> Cc: David Hildenbrand <david@kernel.org>
> > >>> Cc: Oscar Salvador <osalvador@suse.de>
> > >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> > >>> Cc: Balbir Singh <balbirs@nvidia.com>
> > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > >>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > >>> Cc: Vlastimil Babka <vbabka@suse.cz>
> > >>> Cc: Mike Rapoport <rppt@kernel.org>
> > >>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>> Cc: Michal Hocko <mhocko@suse.com>
> > >>> Cc: Alistair Popple <apopple@nvidia.com>
> > >>> Cc: linux-mm@kvack.org
> > >>> Cc: linux-cxl@vger.kernel.org
> > >>> Cc: linux-kernel@vger.kernel.org
> > >>> Suggested-by: Alistair Popple <apopple@nvidia.com>
> > >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > >>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > >>> ---
> > >>>  include/linux/memremap.h |  1 +
> > >>>  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > >>>  2 files changed, 56 insertions(+)
> > >>>
> > >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > >>> index 97fcffeb1c1e..88e1d4707296 100644
> > >>> --- a/include/linux/memremap.h
> > >>> +++ b/include/linux/memremap.h
> > >>> @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > >>>  
> > >>>  #ifdef CONFIG_ZONE_DEVICE
> > >>>  void zone_device_page_init(struct page *page, unsigned int order);
> > >>> +void free_zone_device_folio_prepare(struct folio *folio);
> > >>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > >>>  void memunmap_pages(struct dev_pagemap *pgmap);
> > >>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > >>> diff --git a/mm/memremap.c b/mm/memremap.c
> > >>> index 39dc4bd190d0..375a61e18858 100644
> > >>> --- a/mm/memremap.c
> > >>> +++ b/mm/memremap.c
> > >>> @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > >>>  }
> > >>>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >>>  
> > >>> +/**
> > >>> + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > >>> + * @folio: ZONE_DEVICE folio to prepare for release.
> > >>> + *
> > >>> + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > >>> + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > >>> + * must be restored to a sane ZONE_DEVICE state before they are released.
> > >>> + *
> > >>> + * This helper:
> > >>> + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > >>> + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > >>> + *   - Resets the compound order metadata (folio_reset_order()) and then
> > >>> + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > >>> + *       * clears ->mapping
> > >>> + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > >>> + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > >>> + *
> > >>> + * If @folio is order-0, only the mapping is cleared and no further work is
> > >>> + * required.
> > >>> + */
> > >>> +void free_zone_device_folio_prepare(struct folio *folio)
> > >>> +{
> > >>> +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > >>> +	int order, i;
> > >>> +
> > >>> +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > >>> +
> > >>> +	folio->mapping = NULL;
> > >>> +	order = folio_order(folio);
> > >>> +	if (!order)
> > >>> +		return;
> > >>> +
> > >>> +	folio_reset_order(folio);
> > >>> +
> > >>> +	for (i = 0; i < (1UL << order); i++) {
> > >>> +		struct page *page = folio_page(folio, i);
> > >>> +		struct folio *new_folio = (struct folio *)page;
> > >>> +
> > >>> +		ClearPageHead(page);
> > >>> +		clear_compound_head(page);
> > >>> +
> > >>> +		new_folio->mapping = NULL;
> > >>> +		/*
> > >>> +		 * Reset pgmap which was over-written by
> > >>> +		 * prep_compound_page().
> > >>> +		 */
> > >>> +		new_folio->pgmap = pgmap;
> > >>> +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > >>> +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > >>> +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > >>
> > >> Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > >> that PMD_ORDER more frees than we'd like?
> > >>
> > > 
> > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > would make my implementation explode. I explained this in detail here [1]
> > > to Zi.
> > > 
> > > To recap [1], my memory allocator has no visibility into individual
> > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > allows VRAM to be allocated or evicted for both traditional GPU
> > > allocations (GEMs) and SVM allocations.
> > > 
> > 
> > I assume it is still backed by pages that are ref counted? I suspect you'd
> 
> Yes.
> 

Let me clarify this a bit. We don’t track individual pages in our
refcounting; instead, we maintain a reference count for the original
allocation (i.e., there are no partial frees of the original
allocation). This refcounting is handled in GPU SVM (DRM common), and
when the allocation’s refcount reaches zero, GPU SVM calls into the
driver to indicate that the memory can be released. In Xe, the backing
memory is a TTM BO (think of this as an eviction hook), which is layered
on top of DRM Buddy (which actually controls VRAM allocation and can
determine device pages from this layer).

I suspect AMD, when using GPU SVM (they have indicated this is the
plan), will also use TTM BO here. Nova, assuming they eventually adopt
SVM and use GPU SVM, will likely implement something very similar to TTM
in Rust, but with DRM Buddy also controlling the actual allocation (they
have already written bindings for DRM buddy).

Matt

> > need to convert one reference count to PMD_ORDER reference counts to make
> > this change work, or are the references not at page granularity? 
> > 
> > I followed the code through drm_zdd_pagemap_put() and zdd->refcount seemed
> > like a per folio refcount
> > 
> 
> The refcount is incremented by 1 for each call to
> folio_set_zone_device_data. If we have a 2MB device folio backing a
> 2MB allocation, the refcount is 1. If we have 512 4KB device pages
> backing a 2MB allocation, the refcount is 512. The refcount matches the
> number of folio_free calls we expect to receive for the size of the
> backing allocation. Right now, in Xe, we allocate either 4k, 64k or 2M
> but thia all configurable via a table driver side (Xe) in GPU SVM (drm
> common layer).
> 
> Matt
> 
> > > Now, to recap the actual issue: if device folios are not split upon free
> > > and are later reallocated with a different order in
> > > zone_device_page_init, the implementation breaks. This problem is not
> > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > it works by coincidence. Reallocating at a different order is valid
> > > behavior and must be supported.
> > > 
> > 
> > Agreed
> > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > 
> > >>> +	}
> > >>> +}
> > >>> +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > >>> +
> > >>>  void free_zone_device_folio(struct folio *folio)
> > >>>  {
> > >>>  	struct dev_pagemap *pgmap = folio->pgmap;
> > >>> @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > >>>  	case MEMORY_DEVICE_COHERENT:
> > >>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > >>>  			break;
> > >>> +		free_zone_device_folio_prepare(folio);
> > >>>  		pgmap->ops->folio_free(folio, order);
> > >>>  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > >>>  		break;
> > >>
> > >> Balbir
> >

[PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
[PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
[PATCH v4 3/7] fs/dax: Use free_zone_device_folio_prepare() helper