[v6] mm: support device-private THP

[v6 01/15] mm/zone_device: support large zone device private folios

Posted by Balbir Singh 2 weeks, 2 days ago

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Zone device private large folios do not support deferred split and
scan like normal THP folios.

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
---
 include/linux/memremap.h | 10 +++++++++-
 mm/memremap.c            | 34 +++++++++++++++++++++-------------
 mm/rmap.c                |  6 +++++-
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5951ba12a28..9c20327c2be5 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_folio_init(struct folio *folio, unsigned int order);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	zone_device_folio_init(folio, 0);
+}
+
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index 46cb1b0b6f72..a8481ebf94cc 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_zone_device_folio(struct folio *folio)
 {
 	struct dev_pagemap *pgmap = folio->pgmap;
+	unsigned long nr = folio_nr_pages(folio);
+	int i;
 
 	if (WARN_ON_ONCE(!pgmap))
 		return;
 
 	mem_cgroup_uncharge(folio);
 
-	/*
-	 * Note: we don't expect anonymous compound pages yet. Once supported
-	 * and we could PTE-map them similar to THP, we'd have to clear
-	 * PG_anon_exclusive on all tail pages.
-	 */
 	if (folio_test_anon(folio)) {
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-		__ClearPageAnonExclusive(folio_page(folio, 0));
+		for (i = 0; i < nr; i++)
+			__ClearPageAnonExclusive(folio_page(folio, i));
+	} else {
+		VM_WARN_ON_ONCE(folio_test_large(folio));
 	}
 
 	/*
@@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
 	case MEMORY_DEVICE_COHERENT:
 		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
 			break;
-		pgmap->ops->page_free(folio_page(folio, 0));
-		put_dev_pagemap(pgmap);
+		pgmap->ops->page_free(&folio->page);
+		percpu_ref_put_many(&folio->pgmap->ref, nr);
 		break;
 
 	case MEMORY_DEVICE_GENERIC:
@@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
 	}
 }
 
-void zone_device_page_init(struct page *page)
+void zone_device_folio_init(struct folio *folio, unsigned int order)
 {
+	struct page *page = folio_page(folio, 0);
+
+	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
 	/*
 	 * Drivers shouldn't be allocating pages after calling
 	 * memunmap_pages().
 	 */
-	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-	set_page_count(page, 1);
+	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
+	folio_set_count(folio, 1);
 	lock_page(page);
+
+	if (order > 1) {
+		prep_compound_page(page, order);
+		folio_set_large_rmappable(folio);
+	}
 }
-EXPORT_SYMBOL_GPL(zone_device_page_init);
+EXPORT_SYMBOL_GPL(zone_device_folio_init);
diff --git a/mm/rmap.c b/mm/rmap.c
index 34333ae3bd80..9a2aabfaea6f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1769,9 +1769,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 	 * the folio is unmapped and at least one page is still mapped.
 	 *
 	 * Check partially_mapped first to ensure it is a large folio.
+	 *
+	 * Device private folios do not support deferred splitting and
+	 * shrinker based scanning of the folios to free.
 	 */
 	if (partially_mapped && folio_test_anon(folio) &&
-	    !folio_test_partially_mapped(folio))
+	    !folio_test_partially_mapped(folio) &&
+	    !folio_is_device_private(folio))
 		deferred_split_folio(folio, true);
 
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
-- 
2.50.1

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by Zi Yan 2 weeks ago

On 16 Sep 2025, at 8:21, Balbir Singh wrote:

> Add routines to support allocation of large order zone device folios
> and helper functions for zone device folios, to check if a folio is
> device private and helpers for setting zone device data.
>
> When large folios are used, the existing page_free() callback in
> pgmap is called when the folio is freed, this is true for both
> PAGE_SIZE and higher order pages.
>
> Zone device private large folios do not support deferred split and
> scan like normal THP folios.
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> ---
>  include/linux/memremap.h | 10 +++++++++-
>  mm/memremap.c            | 34 +++++++++++++++++++++-------------
>  mm/rmap.c                |  6 +++++-
>  3 files changed, 35 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index e5951ba12a28..9c20327c2be5 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> -void zone_device_page_init(struct page *page);
> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void memunmap_pages(struct dev_pagemap *pgmap);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>
>  unsigned long memremap_compat_align(void);
> +
> +static inline void zone_device_page_init(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	zone_device_folio_init(folio, 0);

I assume it is for legacy code, where only non-compound page exists?

It seems that you assume @page is always order-0, but there is no check
for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
above it would be useful to detect misuse.

> +}
> +
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
>  		struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 46cb1b0b6f72..a8481ebf94cc 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  void free_zone_device_folio(struct folio *folio)
>  {
>  	struct dev_pagemap *pgmap = folio->pgmap;
> +	unsigned long nr = folio_nr_pages(folio);
> +	int i;
>
>  	if (WARN_ON_ONCE(!pgmap))
>  		return;
>
>  	mem_cgroup_uncharge(folio);
>
> -	/*
> -	 * Note: we don't expect anonymous compound pages yet. Once supported
> -	 * and we could PTE-map them similar to THP, we'd have to clear
> -	 * PG_anon_exclusive on all tail pages.
> -	 */
>  	if (folio_test_anon(folio)) {
> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -		__ClearPageAnonExclusive(folio_page(folio, 0));
> +		for (i = 0; i < nr; i++)
> +			__ClearPageAnonExclusive(folio_page(folio, i));
> +	} else {
> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>  	}
>
>  	/*
> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>  	case MEMORY_DEVICE_COHERENT:
>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>  			break;
> -		pgmap->ops->page_free(folio_page(folio, 0));
> -		put_dev_pagemap(pgmap);
> +		pgmap->ops->page_free(&folio->page);
> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>  		break;
>
>  	case MEMORY_DEVICE_GENERIC:
> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>  	}
>  }
>
> -void zone_device_page_init(struct page *page)
> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>  {
> +	struct page *page = folio_page(folio, 0);

It is strange to see a folio is converted back to page in
a function called zone_device_folio_init().

> +
> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +
>  	/*
>  	 * Drivers shouldn't be allocating pages after calling
>  	 * memunmap_pages().
>  	 */
> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
> -	set_page_count(page, 1);
> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> +	folio_set_count(folio, 1);
>  	lock_page(page);
> +
> +	if (order > 1) {
> +		prep_compound_page(page, order);
> +		folio_set_large_rmappable(folio);
> +	}

OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
is called.

I feel that your zone_device_page_init() and zone_device_folio_init()
implementations are inverse. They should follow the same pattern
as __alloc_pages_noprof() and __folio_alloc_noprof(), where
zone_device_page_init() does the actual initialization and
zone_device_folio_init() just convert a page to folio.

Something like:

void zone_device_page_init(struct page *page, unsigned int order)
{
	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);

	/*
	 * Drivers shouldn't be allocating pages after calling
	 * memunmap_pages().
	 */

    WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
	
	/*
	 * anonymous folio does not support order-1, high order file-backed folio
	 * is not supported at all.
	 */
	VM_WARN_ON_ONCE(order == 1);

	if (order > 1)
		prep_compound_page(page, order);

	/* page has to be compound head here */
	set_page_count(page, 1);
	lock_page(page);
}

void zone_device_folio_init(struct folio *folio, unsigned int order)
{
	struct page *page = folio_page(folio, 0);

	zone_device_page_init(page, order);
	page_rmappable_folio(page);
}

Or

struct folio *zone_device_folio_init(struct page *page, unsigned int order)
{
	zone_device_page_init(page, order);
	return page_rmappable_folio(page);
}


Then, it comes to free_zone_device_folio() above,
I feel that pgmap->ops->page_free() should take an additional order
parameter to free a compound page like free_frozen_pages().


This is my impression after reading the patch and zone device page code.

Alistair and David can correct me if this is wrong, since I am new to
zone device page code.
	
>  }
> -EXPORT_SYMBOL_GPL(zone_device_page_init);
> +EXPORT_SYMBOL_GPL(zone_device_folio_init);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80..9a2aabfaea6f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1769,9 +1769,13 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>  	 * the folio is unmapped and at least one page is still mapped.
>  	 *
>  	 * Check partially_mapped first to ensure it is a large folio.
> +	 *
> +	 * Device private folios do not support deferred splitting and
> +	 * shrinker based scanning of the folios to free.
>  	 */
>  	if (partially_mapped && folio_test_anon(folio) &&
> -	    !folio_test_partially_mapped(folio))
> +	    !folio_test_partially_mapped(folio) &&
> +	    !folio_is_device_private(folio))
>  		deferred_split_folio(folio, true);
>
>  	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
> -- 
> 2.50.1


--
Best Regards,
Yan, Zi

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by David Hildenbrand 1 week, 1 day ago

On 18.09.25 04:49, Zi Yan wrote:
> On 16 Sep 2025, at 8:21, Balbir Singh wrote:
> 
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Zone device private large folios do not support deferred split and
>> scan like normal THP folios.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> ---
>>   include/linux/memremap.h | 10 +++++++++-
>>   mm/memremap.c            | 34 +++++++++++++++++++++-------------
>>   mm/rmap.c                |  6 +++++-
>>   3 files changed, 35 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index e5951ba12a28..9c20327c2be5 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>   }
>>
>>   #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>   void memunmap_pages(struct dev_pagemap *pgmap);
>>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>
>>   unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	zone_device_folio_init(folio, 0);
> 
> I assume it is for legacy code, where only non-compound page exists?
> 
> It seems that you assume @page is always order-0, but there is no check
> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
> above it would be useful to detect misuse.
> 
>> +}
>> +
>>   #else
>>   static inline void *devm_memremap_pages(struct device *dev,
>>   		struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 46cb1b0b6f72..a8481ebf94cc 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>   void free_zone_device_folio(struct folio *folio)
>>   {
>>   	struct dev_pagemap *pgmap = folio->pgmap;
>> +	unsigned long nr = folio_nr_pages(folio);
>> +	int i;
>>
>>   	if (WARN_ON_ONCE(!pgmap))
>>   		return;
>>
>>   	mem_cgroup_uncharge(folio);
>>
>> -	/*
>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>> -	 * PG_anon_exclusive on all tail pages.
>> -	 */
>>   	if (folio_test_anon(folio)) {
>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>> +		for (i = 0; i < nr; i++)
>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>> +	} else {
>> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>>   	}
>>
>>   	/*
>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>   	case MEMORY_DEVICE_COHERENT:
>>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>   			break;
>> -		pgmap->ops->page_free(folio_page(folio, 0));
>> -		put_dev_pagemap(pgmap);
>> +		pgmap->ops->page_free(&folio->page);
>> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>   		break;
>>
>>   	case MEMORY_DEVICE_GENERIC:
>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>>   	}
>>   }
>>
>> -void zone_device_page_init(struct page *page)
>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>   {
>> +	struct page *page = folio_page(folio, 0);
> 
> It is strange to see a folio is converted back to page in
> a function called zone_device_folio_init().
> 
>> +
>> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> +
>>   	/*
>>   	 * Drivers shouldn't be allocating pages after calling
>>   	 * memunmap_pages().
>>   	 */
>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -	set_page_count(page, 1);
>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +	folio_set_count(folio, 1);
>>   	lock_page(page);
>> +
>> +	if (order > 1) {
>> +		prep_compound_page(page, order);
>> +		folio_set_large_rmappable(folio);
>> +	}
> 
> OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
> is called.
> 
> I feel that your zone_device_page_init() and zone_device_folio_init()
> implementations are inverse. They should follow the same pattern
> as __alloc_pages_noprof() and __folio_alloc_noprof(), where
> zone_device_page_init() does the actual initialization and
> zone_device_folio_init() just convert a page to folio.
> 
> Something like:
> 
> void zone_device_page_init(struct page *page, unsigned int order)
> {
> 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> 
> 	/*
> 	 * Drivers shouldn't be allocating pages after calling
> 	 * memunmap_pages().
> 	 */
> 
>      WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> 	
> 	/*
> 	 * anonymous folio does not support order-1, high order file-backed folio
> 	 * is not supported at all.
> 	 */
> 	VM_WARN_ON_ONCE(order == 1);
> 
> 	if (order > 1)
> 		prep_compound_page(page, order);
> 
> 	/* page has to be compound head here */
> 	set_page_count(page, 1);
> 	lock_page(page);
> }
> 
> void zone_device_folio_init(struct folio *folio, unsigned int order)
> {
> 	struct page *page = folio_page(folio, 0);
> 
> 	zone_device_page_init(page, order);
> 	page_rmappable_folio(page);
> }
> 
> Or
> 
> struct folio *zone_device_folio_init(struct page *page, unsigned int order)
> {
> 	zone_device_page_init(page, order);
> 	return page_rmappable_folio(page);
> }

I think the problem is that it will all be weird once we dynamically 
allocate "struct folio".

I have not yet a clear understanding on how that would really work.

For example, should it be pgmap->ops->page_folio() ?

Who allocates the folio? Do we allocate all order-0 folios initially, to 
then merge them when constructing large folios? How do we manage the 
"struct folio" during such merging splitting?

With that in mind, I don't really know what the proper interface should 
be today.


zone_device_folio_init(struct page *page, unsigned int order)

looks cleaner, agreed.

> 
> 
> Then, it comes to free_zone_device_folio() above,
> I feel that pgmap->ops->page_free() should take an additional order
> parameter to free a compound page like free_frozen_pages().
> 

IIRC free_frozen_pages() does not operate on compound pages. If we know 
that we are operating on a compound page (or single page) then passing 
in the page (or better the folio) should work.

-- 
Cheers

David / dhildenb

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by Balbir Singh 1 week, 6 days ago

On 9/18/25 12:49, Zi Yan wrote:
> On 16 Sep 2025, at 8:21, Balbir Singh wrote:
> 
>> Add routines to support allocation of large order zone device folios
>> and helper functions for zone device folios, to check if a folio is
>> device private and helpers for setting zone device data.
>>
>> When large folios are used, the existing page_free() callback in
>> pgmap is called when the folio is freed, this is true for both
>> PAGE_SIZE and higher order pages.
>>
>> Zone device private large folios do not support deferred split and
>> scan like normal THP folios.
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> ---
>>  include/linux/memremap.h | 10 +++++++++-
>>  mm/memremap.c            | 34 +++++++++++++++++++++-------------
>>  mm/rmap.c                |  6 +++++-
>>  3 files changed, 35 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index e5951ba12a28..9c20327c2be5 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>  }
>>
>>  #ifdef CONFIG_ZONE_DEVICE
>> -void zone_device_page_init(struct page *page);
>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>
>>  unsigned long memremap_compat_align(void);
>> +
>> +static inline void zone_device_page_init(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	zone_device_folio_init(folio, 0);
> 
> I assume it is for legacy code, where only non-compound page exists?
> 
> It seems that you assume @page is always order-0, but there is no check
> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
> above it would be useful to detect misuse.
> 
>> +}
>> +
>>  #else
>>  static inline void *devm_memremap_pages(struct device *dev,
>>  		struct dev_pagemap *pgmap)
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 46cb1b0b6f72..a8481ebf94cc 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>  void free_zone_device_folio(struct folio *folio)
>>  {
>>  	struct dev_pagemap *pgmap = folio->pgmap;
>> +	unsigned long nr = folio_nr_pages(folio);
>> +	int i;
>>
>>  	if (WARN_ON_ONCE(!pgmap))
>>  		return;
>>
>>  	mem_cgroup_uncharge(folio);
>>
>> -	/*
>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>> -	 * PG_anon_exclusive on all tail pages.
>> -	 */
>>  	if (folio_test_anon(folio)) {
>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>> +		for (i = 0; i < nr; i++)
>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>> +	} else {
>> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>>  	}
>>
>>  	/*
>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>  	case MEMORY_DEVICE_COHERENT:
>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>  			break;
>> -		pgmap->ops->page_free(folio_page(folio, 0));
>> -		put_dev_pagemap(pgmap);
>> +		pgmap->ops->page_free(&folio->page);
>> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>  		break;
>>
>>  	case MEMORY_DEVICE_GENERIC:
>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>>  	}
>>  }
>>
>> -void zone_device_page_init(struct page *page)
>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>  {
>> +	struct page *page = folio_page(folio, 0);
> 
> It is strange to see a folio is converted back to page in
> a function called zone_device_folio_init().
> 
>> +
>> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>> +
>>  	/*
>>  	 * Drivers shouldn't be allocating pages after calling
>>  	 * memunmap_pages().
>>  	 */
>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>> -	set_page_count(page, 1);
>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> +	folio_set_count(folio, 1);
>>  	lock_page(page);
>> +
>> +	if (order > 1) {
>> +		prep_compound_page(page, order);
>> +		folio_set_large_rmappable(folio);
>> +	}
> 
> OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
> is called.
> 
> I feel that your zone_device_page_init() and zone_device_folio_init()
> implementations are inverse. They should follow the same pattern
> as __alloc_pages_noprof() and __folio_alloc_noprof(), where
> zone_device_page_init() does the actual initialization and
> zone_device_folio_init() just convert a page to folio.
> 
> Something like:
> 
> void zone_device_page_init(struct page *page, unsigned int order)
> {
> 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> 
> 	/*
> 	 * Drivers shouldn't be allocating pages after calling
> 	 * memunmap_pages().
> 	 */
> 
>     WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
> 	
> 	/*
> 	 * anonymous folio does not support order-1, high order file-backed folio
> 	 * is not supported at all.
> 	 */
> 	VM_WARN_ON_ONCE(order == 1);
> 
> 	if (order > 1)
> 		prep_compound_page(page, order);
> 
> 	/* page has to be compound head here */
> 	set_page_count(page, 1);
> 	lock_page(page);
> }
> 
> void zone_device_folio_init(struct folio *folio, unsigned int order)
> {
> 	struct page *page = folio_page(folio, 0);
> 
> 	zone_device_page_init(page, order);
> 	page_rmappable_folio(page);
> }
> 
> Or
> 
> struct folio *zone_device_folio_init(struct page *page, unsigned int order)
> {
> 	zone_device_page_init(page, order);
> 	return page_rmappable_folio(page);
> }
> 
> 
> Then, it comes to free_zone_device_folio() above,
> I feel that pgmap->ops->page_free() should take an additional order
> parameter to free a compound page like free_frozen_pages().
> 
> 
> This is my impression after reading the patch and zone device page code.
> 
> Alistair and David can correct me if this is wrong, since I am new to
> zone device page code.
> 	

Thanks, I did not want to change zone_device_page_init() for several
drivers (outside my test scope) that already assume it has an order size of 0.

Balbir Singh

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by Zi Yan 1 week, 5 days ago

On 19 Sep 2025, at 1:01, Balbir Singh wrote:

> On 9/18/25 12:49, Zi Yan wrote:
>> On 16 Sep 2025, at 8:21, Balbir Singh wrote:
>>
>>> Add routines to support allocation of large order zone device folios
>>> and helper functions for zone device folios, to check if a folio is
>>> device private and helpers for setting zone device data.
>>>
>>> When large folios are used, the existing page_free() callback in
>>> pgmap is called when the folio is freed, this is true for both
>>> PAGE_SIZE and higher order pages.
>>>
>>> Zone device private large folios do not support deferred split and
>>> scan like normal THP folios.
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>> ---
>>>  include/linux/memremap.h | 10 +++++++++-
>>>  mm/memremap.c            | 34 +++++++++++++++++++++-------------
>>>  mm/rmap.c                |  6 +++++-
>>>  3 files changed, 35 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index e5951ba12a28..9c20327c2be5 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>  }
>>>
>>>  #ifdef CONFIG_ZONE_DEVICE
>>> -void zone_device_page_init(struct page *page);
>>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>>
>>>  unsigned long memremap_compat_align(void);
>>> +
>>> +static inline void zone_device_page_init(struct page *page)
>>> +{
>>> +	struct folio *folio = page_folio(page);
>>> +
>>> +	zone_device_folio_init(folio, 0);
>>
>> I assume it is for legacy code, where only non-compound page exists?
>>
>> It seems that you assume @page is always order-0, but there is no check
>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
>> above it would be useful to detect misuse.
>>
>>> +}
>>> +
>>>  #else
>>>  static inline void *devm_memremap_pages(struct device *dev,
>>>  		struct dev_pagemap *pgmap)
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 46cb1b0b6f72..a8481ebf94cc 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>  void free_zone_device_folio(struct folio *folio)
>>>  {
>>>  	struct dev_pagemap *pgmap = folio->pgmap;
>>> +	unsigned long nr = folio_nr_pages(folio);
>>> +	int i;
>>>
>>>  	if (WARN_ON_ONCE(!pgmap))
>>>  		return;
>>>
>>>  	mem_cgroup_uncharge(folio);
>>>
>>> -	/*
>>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>>> -	 * PG_anon_exclusive on all tail pages.
>>> -	 */
>>>  	if (folio_test_anon(folio)) {
>>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>>> +		for (i = 0; i < nr; i++)
>>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>>> +	} else {
>>> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>>>  	}
>>>
>>>  	/*
>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>>  	case MEMORY_DEVICE_COHERENT:
>>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>>  			break;
>>> -		pgmap->ops->page_free(folio_page(folio, 0));
>>> -		put_dev_pagemap(pgmap);
>>> +		pgmap->ops->page_free(&folio->page);
>>> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>  		break;
>>>
>>>  	case MEMORY_DEVICE_GENERIC:
>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>>>  	}
>>>  }
>>>
>>> -void zone_device_page_init(struct page *page)
>>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>>  {
>>> +	struct page *page = folio_page(folio, 0);
>>
>> It is strange to see a folio is converted back to page in
>> a function called zone_device_folio_init().
>>
>>> +
>>> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>> +
>>>  	/*
>>>  	 * Drivers shouldn't be allocating pages after calling
>>>  	 * memunmap_pages().
>>>  	 */
>>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>>> -	set_page_count(page, 1);
>>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>>> +	folio_set_count(folio, 1);
>>>  	lock_page(page);
>>> +
>>> +	if (order > 1) {
>>> +		prep_compound_page(page, order);
>>> +		folio_set_large_rmappable(folio);
>>> +	}
>>
>> OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
>> is called.
>>
>> I feel that your zone_device_page_init() and zone_device_folio_init()
>> implementations are inverse. They should follow the same pattern
>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where
>> zone_device_page_init() does the actual initialization and
>> zone_device_folio_init() just convert a page to folio.
>>
>> Something like:
>>
>> void zone_device_page_init(struct page *page, unsigned int order)
>> {
>> 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>
>> 	/*
>> 	 * Drivers shouldn't be allocating pages after calling
>> 	 * memunmap_pages().
>> 	 */
>>
>>     WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>> 	
>> 	/*
>> 	 * anonymous folio does not support order-1, high order file-backed folio
>> 	 * is not supported at all.
>> 	 */
>> 	VM_WARN_ON_ONCE(order == 1);
>>
>> 	if (order > 1)
>> 		prep_compound_page(page, order);
>>
>> 	/* page has to be compound head here */
>> 	set_page_count(page, 1);
>> 	lock_page(page);
>> }
>>
>> void zone_device_folio_init(struct folio *folio, unsigned int order)
>> {
>> 	struct page *page = folio_page(folio, 0);
>>
>> 	zone_device_page_init(page, order);
>> 	page_rmappable_folio(page);
>> }
>>
>> Or
>>
>> struct folio *zone_device_folio_init(struct page *page, unsigned int order)
>> {
>> 	zone_device_page_init(page, order);
>> 	return page_rmappable_folio(page);
>> }
>>
>>
>> Then, it comes to free_zone_device_folio() above,
>> I feel that pgmap->ops->page_free() should take an additional order
>> parameter to free a compound page like free_frozen_pages().
>>
>>
>> This is my impression after reading the patch and zone device page code.
>>
>> Alistair and David can correct me if this is wrong, since I am new to
>> zone device page code.
>> 	
>
> Thanks, I did not want to change zone_device_page_init() for several
> drivers (outside my test scope) that already assume it has an order size of 0.

But my proposed zone_device_page_init() should still work for order-0
pages. You just need to change call site to add 0 as a new parameter.


One strange thing I found in the original zone_device_page_init() is
the use of page_pgmap() in
WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)).
page_pgmap() calls page_folio() on the given page to access pgmap field.
And pgmap field is only available in struct folio. The code initializes
struct page, but in middle it suddenly finds the page is actually a folio,
then treat it as a page afterwards. I wonder if it can be done better.

This might be a question to Alistair, since he made the change.

Best Regards,
Yan, Zi

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by Balbir Singh 1 week, 2 days ago

On 9/19/25 23:26, Zi Yan wrote:
> On 19 Sep 2025, at 1:01, Balbir Singh wrote:
> 
>> On 9/18/25 12:49, Zi Yan wrote:
>>> On 16 Sep 2025, at 8:21, Balbir Singh wrote:
>>>
>>>> Add routines to support allocation of large order zone device folios
>>>> and helper functions for zone device folios, to check if a folio is
>>>> device private and helpers for setting zone device data.
>>>>
>>>> When large folios are used, the existing page_free() callback in
>>>> pgmap is called when the folio is freed, this is true for both
>>>> PAGE_SIZE and higher order pages.
>>>>
>>>> Zone device private large folios do not support deferred split and
>>>> scan like normal THP folios.
>>>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>> Cc: Gregory Price <gourry@gourry.net>
>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>> ---
>>>>  include/linux/memremap.h | 10 +++++++++-
>>>>  mm/memremap.c            | 34 +++++++++++++++++++++-------------
>>>>  mm/rmap.c                |  6 +++++-
>>>>  3 files changed, 35 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>> index e5951ba12a28..9c20327c2be5 100644
>>>> --- a/include/linux/memremap.h
>>>> +++ b/include/linux/memremap.h
>>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>>  }
>>>>
>>>>  #ifdef CONFIG_ZONE_DEVICE
>>>> -void zone_device_page_init(struct page *page);
>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>>  void memunmap_pages(struct dev_pagemap *pgmap);
>>>>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>>>  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>>>
>>>>  unsigned long memremap_compat_align(void);
>>>> +
>>>> +static inline void zone_device_page_init(struct page *page)
>>>> +{
>>>> +	struct folio *folio = page_folio(page);
>>>> +
>>>> +	zone_device_folio_init(folio, 0);
>>>
>>> I assume it is for legacy code, where only non-compound page exists?
>>>
>>> It seems that you assume @page is always order-0, but there is no check
>>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
>>> above it would be useful to detect misuse.
>>>
>>>> +}
>>>> +
>>>>  #else
>>>>  static inline void *devm_memremap_pages(struct device *dev,
>>>>  		struct dev_pagemap *pgmap)
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index 46cb1b0b6f72..a8481ebf94cc 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>>  void free_zone_device_folio(struct folio *folio)
>>>>  {
>>>>  	struct dev_pagemap *pgmap = folio->pgmap;
>>>> +	unsigned long nr = folio_nr_pages(folio);
>>>> +	int i;
>>>>
>>>>  	if (WARN_ON_ONCE(!pgmap))
>>>>  		return;
>>>>
>>>>  	mem_cgroup_uncharge(folio);
>>>>
>>>> -	/*
>>>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>>>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>>>> -	 * PG_anon_exclusive on all tail pages.
>>>> -	 */
>>>>  	if (folio_test_anon(folio)) {
>>>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>>>> +		for (i = 0; i < nr; i++)
>>>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>>>> +	} else {
>>>> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>>>>  	}
>>>>
>>>>  	/*
>>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>>>  	case MEMORY_DEVICE_COHERENT:
>>>>  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>>>  			break;
>>>> -		pgmap->ops->page_free(folio_page(folio, 0));
>>>> -		put_dev_pagemap(pgmap);
>>>> +		pgmap->ops->page_free(&folio->page);
>>>> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>>  		break;
>>>>
>>>>  	case MEMORY_DEVICE_GENERIC:
>>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>>>>  	}
>>>>  }
>>>>
>>>> -void zone_device_page_init(struct page *page)
>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>>>  {
>>>> +	struct page *page = folio_page(folio, 0);
>>>
>>> It is strange to see a folio is converted back to page in
>>> a function called zone_device_folio_init().
>>>
>>>> +
>>>> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>>> +
>>>>  	/*
>>>>  	 * Drivers shouldn't be allocating pages after calling
>>>>  	 * memunmap_pages().
>>>>  	 */
>>>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>>>> -	set_page_count(page, 1);
>>>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>>>> +	folio_set_count(folio, 1);
>>>>  	lock_page(page);
>>>> +
>>>> +	if (order > 1) {
>>>> +		prep_compound_page(page, order);
>>>> +		folio_set_large_rmappable(folio);
>>>> +	}
>>>
>>> OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
>>> is called.
>>>
>>> I feel that your zone_device_page_init() and zone_device_folio_init()
>>> implementations are inverse. They should follow the same pattern
>>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where
>>> zone_device_page_init() does the actual initialization and
>>> zone_device_folio_init() just convert a page to folio.
>>>
>>> Something like:
>>>
>>> void zone_device_page_init(struct page *page, unsigned int order)
>>> {
>>> 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>>
>>> 	/*
>>> 	 * Drivers shouldn't be allocating pages after calling
>>> 	 * memunmap_pages().
>>> 	 */
>>>
>>>     WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>>> 	
>>> 	/*
>>> 	 * anonymous folio does not support order-1, high order file-backed folio
>>> 	 * is not supported at all.
>>> 	 */
>>> 	VM_WARN_ON_ONCE(order == 1);
>>>
>>> 	if (order > 1)
>>> 		prep_compound_page(page, order);
>>>
>>> 	/* page has to be compound head here */
>>> 	set_page_count(page, 1);
>>> 	lock_page(page);
>>> }
>>>
>>> void zone_device_folio_init(struct folio *folio, unsigned int order)
>>> {
>>> 	struct page *page = folio_page(folio, 0);
>>>
>>> 	zone_device_page_init(page, order);
>>> 	page_rmappable_folio(page);
>>> }
>>>
>>> Or
>>>
>>> struct folio *zone_device_folio_init(struct page *page, unsigned int order)
>>> {
>>> 	zone_device_page_init(page, order);
>>> 	return page_rmappable_folio(page);
>>> }
>>>
>>>
>>> Then, it comes to free_zone_device_folio() above,
>>> I feel that pgmap->ops->page_free() should take an additional order
>>> parameter to free a compound page like free_frozen_pages().
>>>
>>>
>>> This is my impression after reading the patch and zone device page code.
>>>
>>> Alistair and David can correct me if this is wrong, since I am new to
>>> zone device page code.
>>> 	
>>
>> Thanks, I did not want to change zone_device_page_init() for several
>> drivers (outside my test scope) that already assume it has an order size of 0.
> 
> But my proposed zone_device_page_init() should still work for order-0
> pages. You just need to change call site to add 0 as a new parameter.
> 

I did not want to change existing callers (increases testing impact)
without a strong reason.

> 
> One strange thing I found in the original zone_device_page_init() is
> the use of page_pgmap() in
> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)).
> page_pgmap() calls page_folio() on the given page to access pgmap field.
> And pgmap field is only available in struct folio. The code initializes
> struct page, but in middle it suddenly finds the page is actually a folio,
> then treat it as a page afterwards. I wonder if it can be done better.
> 
> This might be a question to Alistair, since he made the change.
> 

I'll let him answer it :)

Thanks for the review
Balbir

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Posted by David Hildenbrand 1 week, 1 day ago

On 23.09.25 05:47, Balbir Singh wrote:
> On 9/19/25 23:26, Zi Yan wrote:
>> On 19 Sep 2025, at 1:01, Balbir Singh wrote:
>>
>>> On 9/18/25 12:49, Zi Yan wrote:
>>>> On 16 Sep 2025, at 8:21, Balbir Singh wrote:
>>>>
>>>>> Add routines to support allocation of large order zone device folios
>>>>> and helper functions for zone device folios, to check if a folio is
>>>>> device private and helpers for setting zone device data.
>>>>>
>>>>> When large folios are used, the existing page_free() callback in
>>>>> pgmap is called when the folio is freed, this is true for both
>>>>> PAGE_SIZE and higher order pages.
>>>>>
>>>>> Zone device private large folios do not support deferred split and
>>>>> scan like normal THP folios.
>>>>>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>>> Cc: Gregory Price <gourry@gourry.net>
>>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>>> Cc: Nico Pache <npache@redhat.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>> ---
>>>>>   include/linux/memremap.h | 10 +++++++++-
>>>>>   mm/memremap.c            | 34 +++++++++++++++++++++-------------
>>>>>   mm/rmap.c                |  6 +++++-
>>>>>   3 files changed, 35 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index e5951ba12a28..9c20327c2be5 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
>>>>>   }
>>>>>
>>>>>   #ifdef CONFIG_ZONE_DEVICE
>>>>> -void zone_device_page_init(struct page *page);
>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order);
>>>>>   void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>>>   void memunmap_pages(struct dev_pagemap *pgmap);
>>>>>   void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>>>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
>>>>>   bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>>>>>
>>>>>   unsigned long memremap_compat_align(void);
>>>>> +
>>>>> +static inline void zone_device_page_init(struct page *page)
>>>>> +{
>>>>> +	struct folio *folio = page_folio(page);
>>>>> +
>>>>> +	zone_device_folio_init(folio, 0);
>>>>
>>>> I assume it is for legacy code, where only non-compound page exists?
>>>>
>>>> It seems that you assume @page is always order-0, but there is no check
>>>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
>>>> above it would be useful to detect misuse.
>>>>
>>>>> +}
>>>>> +
>>>>>   #else
>>>>>   static inline void *devm_memremap_pages(struct device *dev,
>>>>>   		struct dev_pagemap *pgmap)
>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>> index 46cb1b0b6f72..a8481ebf94cc 100644
>>>>> --- a/mm/memremap.c
>>>>> +++ b/mm/memremap.c
>>>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>>>>>   void free_zone_device_folio(struct folio *folio)
>>>>>   {
>>>>>   	struct dev_pagemap *pgmap = folio->pgmap;
>>>>> +	unsigned long nr = folio_nr_pages(folio);
>>>>> +	int i;
>>>>>
>>>>>   	if (WARN_ON_ONCE(!pgmap))
>>>>>   		return;
>>>>>
>>>>>   	mem_cgroup_uncharge(folio);
>>>>>
>>>>> -	/*
>>>>> -	 * Note: we don't expect anonymous compound pages yet. Once supported
>>>>> -	 * and we could PTE-map them similar to THP, we'd have to clear
>>>>> -	 * PG_anon_exclusive on all tail pages.
>>>>> -	 */
>>>>>   	if (folio_test_anon(folio)) {
>>>>> -		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>> -		__ClearPageAnonExclusive(folio_page(folio, 0));
>>>>> +		for (i = 0; i < nr; i++)
>>>>> +			__ClearPageAnonExclusive(folio_page(folio, i));
>>>>> +	} else {
>>>>> +		VM_WARN_ON_ONCE(folio_test_large(folio));
>>>>>   	}
>>>>>
>>>>>   	/*
>>>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
>>>>>   	case MEMORY_DEVICE_COHERENT:
>>>>>   		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
>>>>>   			break;
>>>>> -		pgmap->ops->page_free(folio_page(folio, 0));
>>>>> -		put_dev_pagemap(pgmap);
>>>>> +		pgmap->ops->page_free(&folio->page);
>>>>> +		percpu_ref_put_many(&folio->pgmap->ref, nr);
>>>>>   		break;
>>>>>
>>>>>   	case MEMORY_DEVICE_GENERIC:
>>>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>>>>>   	}
>>>>>   }
>>>>>
>>>>> -void zone_device_page_init(struct page *page)
>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order)
>>>>>   {
>>>>> +	struct page *page = folio_page(folio, 0);
>>>>
>>>> It is strange to see a folio is converted back to page in
>>>> a function called zone_device_folio_init().
>>>>
>>>>> +
>>>>> +	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>>>> +
>>>>>   	/*
>>>>>   	 * Drivers shouldn't be allocating pages after calling
>>>>>   	 * memunmap_pages().
>>>>>   	 */
>>>>> -	WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
>>>>> -	set_page_count(page, 1);
>>>>> +	WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>>>>> +	folio_set_count(folio, 1);
>>>>>   	lock_page(page);
>>>>> +
>>>>> +	if (order > 1) {
>>>>> +		prep_compound_page(page, order);
>>>>> +		folio_set_large_rmappable(folio);
>>>>> +	}
>>>>
>>>> OK, so basically, @folio is not a compound page yet when zone_device_folio_init()
>>>> is called.
>>>>
>>>> I feel that your zone_device_page_init() and zone_device_folio_init()
>>>> implementations are inverse. They should follow the same pattern
>>>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where
>>>> zone_device_page_init() does the actual initialization and
>>>> zone_device_folio_init() just convert a page to folio.
>>>>
>>>> Something like:
>>>>
>>>> void zone_device_page_init(struct page *page, unsigned int order)
>>>> {
>>>> 	VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
>>>>
>>>> 	/*
>>>> 	 * Drivers shouldn't be allocating pages after calling
>>>> 	 * memunmap_pages().
>>>> 	 */
>>>>
>>>>      WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
>>>> 	
>>>> 	/*
>>>> 	 * anonymous folio does not support order-1, high order file-backed folio
>>>> 	 * is not supported at all.
>>>> 	 */
>>>> 	VM_WARN_ON_ONCE(order == 1);
>>>>
>>>> 	if (order > 1)
>>>> 		prep_compound_page(page, order);
>>>>
>>>> 	/* page has to be compound head here */
>>>> 	set_page_count(page, 1);
>>>> 	lock_page(page);
>>>> }
>>>>
>>>> void zone_device_folio_init(struct folio *folio, unsigned int order)
>>>> {
>>>> 	struct page *page = folio_page(folio, 0);
>>>>
>>>> 	zone_device_page_init(page, order);
>>>> 	page_rmappable_folio(page);
>>>> }
>>>>
>>>> Or
>>>>
>>>> struct folio *zone_device_folio_init(struct page *page, unsigned int order)
>>>> {
>>>> 	zone_device_page_init(page, order);
>>>> 	return page_rmappable_folio(page);
>>>> }
>>>>
>>>>
>>>> Then, it comes to free_zone_device_folio() above,
>>>> I feel that pgmap->ops->page_free() should take an additional order
>>>> parameter to free a compound page like free_frozen_pages().
>>>>
>>>>
>>>> This is my impression after reading the patch and zone device page code.
>>>>
>>>> Alistair and David can correct me if this is wrong, since I am new to
>>>> zone device page code.
>>>> 	
>>>
>>> Thanks, I did not want to change zone_device_page_init() for several
>>> drivers (outside my test scope) that already assume it has an order size of 0.
>>
>> But my proposed zone_device_page_init() should still work for order-0
>> pages. You just need to change call site to add 0 as a new parameter.
>>
> 
> I did not want to change existing callers (increases testing impact)
> without a strong reason.
> 
>>
>> One strange thing I found in the original zone_device_page_init() is
>> the use of page_pgmap() in
>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)).
>> page_pgmap() calls page_folio() on the given page to access pgmap field.
>> And pgmap field is only available in struct folio. The code initializes
>> struct page, but in middle it suddenly finds the page is actually a folio,
>> then treat it as a page afterwards. I wonder if it can be done better.
>>
>> This might be a question to Alistair, since he made the change.
>>
> 
> I'll let him answer it :)

Not him, but I think this goes back to my question raised in my other 
reply: When would we allocate "struct folio" in the future.

If it's "always" then actually most of the zone-device code would only 
ever operate on folios and never on pages in the future.

I recall during a discussion at LSF/MM I raised that, and the answer was 
(IIRC) that we will allocate "struct folio" as we will initialize the 
memmap for dax.

So essentially, we'd always have folios and would never really have to 
operate on pages.

-- 
Cheers

David / dhildenb