[PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special

David Hildenbrand posted 9 patches 2 months, 3 weeks ago
There is a newer version of this series
[PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by David Hildenbrand 2 months, 3 weeks ago
The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the ordinary
shared zeropage.

For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to
still use them (e.g., GUP can still take a reference on them).

vm_normal_page_pmd() already filters out the huge zero folio. However,
so far we are not marking it as special like we do with the ordinary
shared zeropage. Let's mark it as special, so we can further refactor
vm_normal_page_pmd() and vm_normal_page().

While at it, update the doc regarding the shared zero folios.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      | 14 +++++++++-----
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db08c37b87077..3f9a27812a590 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 {
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
+	entry = pmd_mkspecial(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	mm_inc_nr_ptes(mm);
@@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (fop.is_folio) {
 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
 
-		if (!is_huge_zero_folio(fop.folio)) {
+		if (is_huge_zero_folio(fop.folio)) {
+			entry = pmd_mkspecial(entry);
+		} else {
 			folio_get(fop.folio);
 			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
 			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1f..173eb6267e0ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
- * case, NULL is returned here. "Normal" mappings do have a struct page.
+ * case, NULL is returned here. "Normal" mappings do have a struct page and
+ * are ordinarily refcounted.
+ *
+ * Page mappings of the shared zero folios are always considered "special", as
+ * they are not ordinarily refcounted. However, selected page table walkers
+ * (such as GUP) can still identify these mappings and work with the
+ * underlying "struct page".
  *
  * There are 2 broad cases. Firstly, an architecture may define a pte_special()
  * pte bit, in which case this function is trivial. Secondly, an architecture
@@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
  * page" backing, however the difference is that _all_ pages with a struct
- * page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The only exception are zeropages, which are
- * *never* refcounted.
+ * page (that is, those where pfn_valid is true, except the shared zero
+ * folios) are refcounted and considered normal pages by the VM.
  *
  * The disadvantage is that pages are refcounted (which can be slower and
  * simply not an option for some PFNMAP users). The advantage is that we
@@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	/* Currently it's only used for huge pfnmaps */
 	if (unlikely(pmd_special(pmd)))
 		return NULL;
 
-- 
2.50.1
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by Wei Yang 2 months, 1 week ago
On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>The huge zero folio is refcounted (+mapcounted -- is that a word?)
>differently than "normal" folios, similarly (but different) to the ordinary
>shared zeropage.
>
>For this reason, we special-case these pages in
>vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>still use them (e.g., GUP can still take a reference on them).
>
>vm_normal_page_pmd() already filters out the huge zero folio. However,
>so far we are not marking it as special like we do with the ordinary
>shared zeropage. Let's mark it as special, so we can further refactor
>vm_normal_page_pmd() and vm_normal_page().
>
>While at it, update the doc regarding the shared zero folios.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by Lorenzo Stoakes 2 months, 3 weeks ago
On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> The huge zero folio is refcounted (+mapcounted -- is that a word?)
> differently than "normal" folios, similarly (but different) to the ordinary
> shared zeropage.

Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
pages?

But for some reason the huge zero page wants to exist or not exist based on
usage for one. Which is stupid to me.

>
> For this reason, we special-case these pages in
> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> still use them (e.g., GUP can still take a reference on them).
>
> vm_normal_page_pmd() already filters out the huge zero folio. However,
> so far we are not marking it as special like we do with the ordinary
> shared zeropage. Let's mark it as special, so we can further refactor
> vm_normal_page_pmd() and vm_normal_page().
>
> While at it, update the doc regarding the shared zero folios.

Hmm I wonder how this will interact with the static PMD series at [0]?

I wonder if more use of that might result in some weirdness with refcounting
etc.?

Also, that series was (though I reviewed against it) moving stuff that
references the huge zero folio out of there, but also generally allows
access and mapping of this folio via largest_zero_folio() so not only via
insert_pmd().

So we're going to end up with mappings of this that are not marked special
that are potentially going to have refcount/mapcount manipulation that
contradict what you're doing here perhaps?

[0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/

>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I looked thorugh places that use vm_normal_page_pm() (other than decl of
function):

fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
mm/pagewalk.c - correctly handles NULL page + huge zero page
mm/huge_memory.c - can_change_pmd_writable() correctly returns false.

And all seems to work wtih this change.

Overall, other than concerns above + nits below LGTM, we should treat all
the zero folios the same in this regard, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c |  5 ++++-
>  mm/memory.c      | 14 +++++++++-----
>  2 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index db08c37b87077..3f9a27812a590 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>  {
>  	pmd_t entry;
>  	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> +	entry = pmd_mkspecial(entry);
>  	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  	set_pmd_at(mm, haddr, pmd, entry);
>  	mm_inc_nr_ptes(mm);
> @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (fop.is_folio) {
>  		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>
> -		if (!is_huge_zero_folio(fop.folio)) {
> +		if (is_huge_zero_folio(fop.folio)) {
> +			entry = pmd_mkspecial(entry);
> +		} else {
>  			folio_get(fop.folio);
>  			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>  			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1f..173eb6267e0ac 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
> - * case, NULL is returned here. "Normal" mappings do have a struct page.
> + * case, NULL is returned here. "Normal" mappings do have a struct page and
> + * are ordinarily refcounted.
> + *
> + * Page mappings of the shared zero folios are always considered "special", as
> + * they are not ordinarily refcounted. However, selected page table walkers
> + * (such as GUP) can still identify these mappings and work with the
> + * underlying "struct page".

I feel like we need more detail or something more explicit about what 'not
ordinary' refcounting constitutes. This is a bit vague.

>   *
>   * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>   * pte bit, in which case this function is trivial. Secondly, an architecture
> @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>   * page" backing, however the difference is that _all_ pages with a struct
> - * page (that is, those where pfn_valid is true) are refcounted and considered
> - * normal pages by the VM. The only exception are zeropages, which are
> - * *never* refcounted.
> + * page (that is, those where pfn_valid is true, except the shared zero
> + * folios) are refcounted and considered normal pages by the VM.
>   *
>   * The disadvantage is that pages are refcounted (which can be slower and
>   * simply not an option for some PFNMAP users). The advantage is that we
> @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

You know I"m semi-ashamed to admit I didn't even know this function
exists. But yikes that we have a separate function like this just for PMDs.

>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>
> -	/* Currently it's only used for huge pfnmaps */
>  	if (unlikely(pmd_special(pmd)))
>  		return NULL;
>
> --
> 2.50.1
>
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by David Hildenbrand 2 months, 2 weeks ago
On 17.07.25 20:29, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>> The huge zero folio is refcounted (+mapcounted -- is that a word?)
>> differently than "normal" folios, similarly (but different) to the ordinary
>> shared zeropage.
> 
> Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> pages?

I wish we could get rid of the weird refcounting of the huge zero folio 
and get rid of the shrinker. But as long as the shrinker exists, I'm 
afraid that weird per-process refcounting must stay.

> 
> But for some reason the huge zero page wants to exist or not exist based on
> usage for one. Which is stupid to me.

Yes, I will try at some point (once we have the static huge zero folio) 
to remove the shrinker part and make it always static. Well, at least 
for reasonable architectures.

> 
>>
>> For this reason, we special-case these pages in
>> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>> still use them (e.g., GUP can still take a reference on them).
>>
>> vm_normal_page_pmd() already filters out the huge zero folio. However,
>> so far we are not marking it as special like we do with the ordinary
>> shared zeropage. Let's mark it as special, so we can further refactor
>> vm_normal_page_pmd() and vm_normal_page().
>>
>> While at it, update the doc regarding the shared zero folios.
> 
> Hmm I wonder how this will interact with the static PMD series at [0]?

No, it shouldn't.

> 
> I wonder if more use of that might result in some weirdness with refcounting
> etc.?

I don't think so.

> 
> Also, that series was (though I reviewed against it) moving stuff that
> references the huge zero folio out of there, but also generally allows
> access and mapping of this folio via largest_zero_folio() so not only via
> insert_pmd().
> 
> So we're going to end up with mappings of this that are not marked special
> that are potentially going to have refcount/mapcount manipulation that
> contradict what you're doing here perhaps?

I don't think so. It's just like having the existing static (small) 
shared zeropage where the same rules about refcounting+mapcounting apply.

> 
> [0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/
> 
>>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I looked thorugh places that use vm_normal_page_pm() (other than decl of
> function):
> 
> fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
> mm/pagewalk.c - correctly handles NULL page + huge zero page
> mm/huge_memory.c - can_change_pmd_writable() correctly returns false.
> 
> And all seems to work wtih this change.
> 
> Overall, other than concerns above + nits below LGTM, we should treat all
> the zero folios the same in this regard, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>   mm/huge_memory.c |  5 ++++-
>>   mm/memory.c      | 14 +++++++++-----
>>   2 files changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index db08c37b87077..3f9a27812a590 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>>   {
>>   	pmd_t entry;
>>   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>> +	entry = pmd_mkspecial(entry);
>>   	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>   	set_pmd_at(mm, haddr, pmd, entry);
>>   	mm_inc_nr_ptes(mm);
>> @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   	if (fop.is_folio) {
>>   		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>>
>> -		if (!is_huge_zero_folio(fop.folio)) {
>> +		if (is_huge_zero_folio(fop.folio)) {
>> +			entry = pmd_mkspecial(entry);
>> +		} else {
>>   			folio_get(fop.folio);
>>   			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>>   			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 92fd18a5d8d1f..173eb6267e0ac 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * "Special" mappings do not wish to be associated with a "struct page" (either
>>    * it doesn't exist, or it exists but they don't want to touch it). In this
>> - * case, NULL is returned here. "Normal" mappings do have a struct page.
>> + * case, NULL is returned here. "Normal" mappings do have a struct page and
>> + * are ordinarily refcounted.
>> + *
>> + * Page mappings of the shared zero folios are always considered "special", as
>> + * they are not ordinarily refcounted. However, selected page table walkers
>> + * (such as GUP) can still identify these mappings and work with the
>> + * underlying "struct page".
> 
> I feel like we need more detail or something more explicit about what 'not
> ordinary' refcounting constitutes. This is a bit vague.

Hm, I am not sure this is the correct place to document that. But let me 
see if I can come up with something reasonable

(like, the refcount and mapcount of these folios is never adjusted when 
mapping them into page tables)

> 
>>    *
>>    * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>>    * pte bit, in which case this function is trivial. Secondly, an architecture
>> @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>>    * page" backing, however the difference is that _all_ pages with a struct
>> - * page (that is, those where pfn_valid is true) are refcounted and considered
>> - * normal pages by the VM. The only exception are zeropages, which are
>> - * *never* refcounted.
>> + * page (that is, those where pfn_valid is true, except the shared zero
>> + * folios) are refcounted and considered normal pages by the VM.
>>    *
>>    * The disadvantage is that pages are refcounted (which can be slower and
>>    * simply not an option for some PFNMAP users). The advantage is that we
>> @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> 
> You know I"m semi-ashamed to admit I didn't even know this function
> exists. But yikes that we have a separate function like this just for PMDs.

It's a bit new-ish :)


-- 
Cheers,

David / dhildenb
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by Lorenzo Stoakes 2 months, 2 weeks ago
On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
> On 17.07.25 20:29, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> > > The huge zero folio is refcounted (+mapcounted -- is that a word?)
> > > differently than "normal" folios, similarly (but different) to the ordinary
> > > shared zeropage.
> >
> > Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> > pages?
>
> I wish we could get rid of the weird refcounting of the huge zero folio and
> get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
> weird per-process refcounting must stay.

Does this change of yours cause any issue with it? I mean now nothing can grab
this page using vm_normal_page_pmd(), so it won't be able to manipulate
refcounts.

Does this impact the shrink stuff at all? Haven't traced it all through.

>
> >
> > But for some reason the huge zero page wants to exist or not exist based on
> > usage for one. Which is stupid to me.
>
> Yes, I will try at some point (once we have the static huge zero folio) to
> remove the shrinker part and make it always static. Well, at least for
> reasonable architectures.

Yes, this seems the correct way to move forward.

And we need to get rid of (or neuter) the 'unreasonable' architectures...

>
> >
> > >
> > > For this reason, we special-case these pages in
> > > vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> > > still use them (e.g., GUP can still take a reference on them).
> > >
> > > vm_normal_page_pmd() already filters out the huge zero folio. However,
> > > so far we are not marking it as special like we do with the ordinary
> > > shared zeropage. Let's mark it as special, so we can further refactor
> > > vm_normal_page_pmd() and vm_normal_page().
> > >
> > > While at it, update the doc regarding the shared zero folios.
> >
> > Hmm I wonder how this will interact with the static PMD series at [0]?
>
> No, it shouldn't.

I'm always nervous about these kinds of things :)

I'm assuming the reference/map counting will still work properly with the static
page?

I think I raised that in that series :P

>
> >
> > I wonder if more use of that might result in some weirdness with refcounting
> > etc.?
>
> I don't think so.

Good :>)

>
> >
> > Also, that series was (though I reviewed against it) moving stuff that
> > references the huge zero folio out of there, but also generally allows
> > access and mapping of this folio via largest_zero_folio() so not only via
> > insert_pmd().
> >
> > So we're going to end up with mappings of this that are not marked special
> > that are potentially going to have refcount/mapcount manipulation that
> > contradict what you're doing here perhaps?
>
> I don't think so. It's just like having the existing static (small) shared
> zeropage where the same rules about refcounting+mapcounting apply.

It feels like all this is a mess... am I missing something that would make it
all make sense?

It's not sane to disappear zero pages based on not-usage in 2025. Sorry if that
little RAM hurts you, use a microcontroller... after which it doesn't really
mean anything to have ref/map counts...

>
> >
> > [0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/
> >
> > >
> > > Reviewed-by: Oscar Salvador <osalvador@suse.de>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > I looked thorugh places that use vm_normal_page_pm() (other than decl of
> > function):
> >
> > fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
> > mm/pagewalk.c - correctly handles NULL page + huge zero page
> > mm/huge_memory.c - can_change_pmd_writable() correctly returns false.
> >
> > And all seems to work wtih this change.
> >
> > Overall, other than concerns above + nits below LGTM, we should treat all
> > the zero folios the same in this regard, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!

No problem! Thanks for the cleanups, these are great...

>
> >
> > > ---
> > >   mm/huge_memory.c |  5 ++++-
> > >   mm/memory.c      | 14 +++++++++-----
> > >   2 files changed, 13 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index db08c37b87077..3f9a27812a590 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> > >   {
> > >   	pmd_t entry;
> > >   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> > > +	entry = pmd_mkspecial(entry);
> > >   	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > >   	set_pmd_at(mm, haddr, pmd, entry);
> > >   	mm_inc_nr_ptes(mm);
> > > @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (fop.is_folio) {
> > >   		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
> > >
> > > -		if (!is_huge_zero_folio(fop.folio)) {
> > > +		if (is_huge_zero_folio(fop.folio)) {
> > > +			entry = pmd_mkspecial(entry);
> > > +		} else {
> > >   			folio_get(fop.folio);
> > >   			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> > >   			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 92fd18a5d8d1f..173eb6267e0ac 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >    *
> > >    * "Special" mappings do not wish to be associated with a "struct page" (either
> > >    * it doesn't exist, or it exists but they don't want to touch it). In this
> > > - * case, NULL is returned here. "Normal" mappings do have a struct page.
> > > + * case, NULL is returned here. "Normal" mappings do have a struct page and
> > > + * are ordinarily refcounted.
> > > + *
> > > + * Page mappings of the shared zero folios are always considered "special", as
> > > + * they are not ordinarily refcounted. However, selected page table walkers
> > > + * (such as GUP) can still identify these mappings and work with the
> > > + * underlying "struct page".
> >
> > I feel like we need more detail or something more explicit about what 'not
> > ordinary' refcounting constitutes. This is a bit vague.
>
> Hm, I am not sure this is the correct place to document that. But let me see
> if I can come up with something reasonable
>
> (like, the refcount and mapcount of these folios is never adjusted when
> mapping them into page tables)

I think having _something_ is good even if perhaps not ideally situated... :)

>
> >
> > >    *
> > >    * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> > >    * pte bit, in which case this function is trivial. Secondly, an architecture
> > > @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >    *
> > >    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
> > >    * page" backing, however the difference is that _all_ pages with a struct
> > > - * page (that is, those where pfn_valid is true) are refcounted and considered
> > > - * normal pages by the VM. The only exception are zeropages, which are
> > > - * *never* refcounted.
> > > + * page (that is, those where pfn_valid is true, except the shared zero
> > > + * folios) are refcounted and considered normal pages by the VM.
> > >    *
> > >    * The disadvantage is that pages are refcounted (which can be slower and
> > >    * simply not an option for some PFNMAP users). The advantage is that we
> > > @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> >
> > You know I"m semi-ashamed to admit I didn't even know this function
> > exists. But yikes that we have a separate function like this just for PMDs.
>
> It's a bit new-ish :)

OK I feel less bad about it then... -ish ;)

>
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by David Hildenbrand 2 months, 2 weeks ago
On 18.07.25 12:41, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
>> On 17.07.25 20:29, Lorenzo Stoakes wrote:
>>> On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>>>> The huge zero folio is refcounted (+mapcounted -- is that a word?)
>>>> differently than "normal" folios, similarly (but different) to the ordinary
>>>> shared zeropage.
>>>
>>> Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
>>> pages?
>>
>> I wish we could get rid of the weird refcounting of the huge zero folio and
>> get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
>> weird per-process refcounting must stay.
> 
> Does this change of yours cause any issue with it? I mean now nothing can grab
> this page using vm_normal_page_pmd(), so it won't be able to manipulate
> refcounts.

Please look again at vm_normal_page_pmd(): we have a manual 
huge_zero_pfn() check in there! There is no change in behavior. :)

It's not obvious from the diff below. But huge zero folio was considered 
special before this change, just not marked accordingly.

>>
>>>
>>>>
>>>> For this reason, we special-case these pages in
>>>> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>>>> still use them (e.g., GUP can still take a reference on them).
>>>>
>>>> vm_normal_page_pmd() already filters out the huge zero folio. However,
>>>> so far we are not marking it as special like we do with the ordinary
>>>> shared zeropage. Let's mark it as special, so we can further refactor
>>>> vm_normal_page_pmd() and vm_normal_page().
>>>>
>>>> While at it, update the doc regarding the shared zero folios.
>>>
>>> Hmm I wonder how this will interact with the static PMD series at [0]?
>>
>> No, it shouldn't.
> 
> I'm always nervous about these kinds of things :)
> 
> I'm assuming the reference/map counting will still work properly with the static
> page?

Let me stress again: no change in behavior besides setting the special 
flag in this patch. Return value of vm_normal_page_pmd() is not changed.

>>>
>>> Also, that series was (though I reviewed against it) moving stuff that
>>> references the huge zero folio out of there, but also generally allows
>>> access and mapping of this folio via largest_zero_folio() so not only via
>>> insert_pmd().
>>>
>>> So we're going to end up with mappings of this that are not marked special
>>> that are potentially going to have refcount/mapcount manipulation that
>>> contradict what you're doing here perhaps?
>>
>> I don't think so. It's just like having the existing static (small) shared
>> zeropage where the same rules about refcounting+mapcounting apply.
> 
> It feels like all this is a mess... am I missing something that would make it
> all make sense?

Let me clarify:

The small zeropage is never refcounted+mapcounted when mapped into page 
tables.

The huge zero folio is never refcounted+mapcounted when mapped into page 
tables EXCEPT it is refcounted in a weird different when first mapped 
into a process.

The whole reason is the shrinker. I don't like it, but there was a 
reason it was added. Maybe that reason no longer exists, but that's 
nothing that this patch series is concerned with, really. :)

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
Posted by Lorenzo Stoakes 2 months, 2 weeks ago
On Fri, Jul 18, 2025 at 12:54:38PM +0200, David Hildenbrand wrote:
> On 18.07.25 12:41, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
> > > On 17.07.25 20:29, Lorenzo Stoakes wrote:
> > > > On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> > > > > The huge zero folio is refcounted (+mapcounted -- is that a word?)
> > > > > differently than "normal" folios, similarly (but different) to the ordinary
> > > > > shared zeropage.
> > > >
> > > > Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> > > > pages?
> > >
> > > I wish we could get rid of the weird refcounting of the huge zero folio and
> > > get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
> > > weird per-process refcounting must stay.
> >
> > Does this change of yours cause any issue with it? I mean now nothing can grab
> > this page using vm_normal_page_pmd(), so it won't be able to manipulate
> > refcounts.
>
> Please look again at vm_normal_page_pmd(): we have a manual huge_zero_pfn()
> check in there! There is no change in behavior. :)
>
> It's not obvious from the diff below. But huge zero folio was considered
> special before this change, just not marked accordingly.

Yeah I think the delta screwed me up here.

Previously:

struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
				pmd_t pmd)
{
	...
	if (unlikely(pmd_special(pmd)))
		return NULL;
	...
	if (is_huge_zero_pmd(pmd))
		return NULL;
	...
}

Now:

struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
				pmd_t pmd)
{
	...
	if (unlikely(pmd_special(pmd))) {
		...
		if (is_huge_zero_pfn(pfn))
			return NULL;
		...
	}
	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
}

And:

static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
		unsigned long addr, unsigned long pfn, unsigned long long entry)
{
	...
	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
		...
		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
			return NULL;
	}
	...
}

So it is equivalent, thanks, satisfied with that now!

Sorry for being a pain, just keen to really be confident in this.

>
> > >
> > > >
> > > > >
> > > > > For this reason, we special-case these pages in
> > > > > vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> > > > > still use them (e.g., GUP can still take a reference on them).
> > > > >
> > > > > vm_normal_page_pmd() already filters out the huge zero folio. However,
> > > > > so far we are not marking it as special like we do with the ordinary
> > > > > shared zeropage. Let's mark it as special, so we can further refactor
> > > > > vm_normal_page_pmd() and vm_normal_page().
> > > > >
> > > > > While at it, update the doc regarding the shared zero folios.
> > > >
> > > > Hmm I wonder how this will interact with the static PMD series at [0]?
> > >
> > > No, it shouldn't.
> >
> > I'm always nervous about these kinds of things :)
> >
> > I'm assuming the reference/map counting will still work properly with the static
> > page?
>
> Let me stress again: no change in behavior besides setting the special flag
> in this patch. Return value of vm_normal_page_pmd() is not changed.

OK yeah I think this is the issue here then - I had assumed that it did
_somehow_ modify behaviour.

See above, based on your responses I'v satisfied myself it's all good thank
you!

>
> > > >
> > > > Also, that series was (though I reviewed against it) moving stuff that
> > > > references the huge zero folio out of there, but also generally allows
> > > > access and mapping of this folio via largest_zero_folio() so not only via
> > > > insert_pmd().
> > > >
> > > > So we're going to end up with mappings of this that are not marked special
> > > > that are potentially going to have refcount/mapcount manipulation that
> > > > contradict what you're doing here perhaps?
> > >
> > > I don't think so. It's just like having the existing static (small) shared
> > > zeropage where the same rules about refcounting+mapcounting apply.
> >
> > It feels like all this is a mess... am I missing something that would make it
> > all make sense?
>
> Let me clarify:
>
> The small zeropage is never refcounted+mapcounted when mapped into page
> tables.
>
> The huge zero folio is never refcounted+mapcounted when mapped into page
> tables EXCEPT it is refcounted in a weird different when first mapped into a
> process.

Thanks that's great!

>
> The whole reason is the shrinker. I don't like it, but there was a reason it
> was added. Maybe that reason no longer exists, but that's nothing that this
> patch series is concerned with, really. :)

I hate that so much but yes out of scope.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo