[PATCH v2 02/14] mm: Filter zone device pages returned from folio_walk_start()

Alistair Popple posted 14 patches 3 months, 3 weeks ago
There is a newer version of this series
[PATCH v2 02/14] mm: Filter zone device pages returned from folio_walk_start()
Posted by Alistair Popple 3 months, 3 weeks ago
Previously dax pages were skipped by the pagewalk code as pud_special() or
vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are
refcounted normally that is no longer the case, so the pagewalk code will
start returning them.

Most callers already explicitly filter for DAX or zone device pages so
don't need updating. However some don't, so add checks to those callers.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

---

Changes since v1:

 - Dropped "mm/pagewalk: Skip dax pages in pagewalk" and replaced it
   with this new patch for v2

 - As suggested by David and Jason we can filter the folios in the
   callers instead of doing it in folio_start_walk(). Most callers
   already do this (see below).

I audited all callers of folio_walk_start() and found the following:

mm/ksm.c:

break_ksm() - doesn't need to filter zone_device pages because the can
never be KSM pages.

get_mergeable_page() - already filters out zone_device pages.
scan_get_next_rmap_iterm() - already filters out zone_device_pages.

mm/huge_memory.c:

split_huge_pages_pid() - already checks for DAX with
vma_not_suitable_for_thp_split()

mm/rmap.c:

make_device_exclusive() - only works on anonymous pages, although
there'd be no issue with finding a DAX page even if support was extended
to file-backed pages.

mm/migrate.c:

add_folio_for_migration() - already checks the vma with vma_migratable()
do_pages_stat_array() - explicitly checks for zone_device folios

kernel/event/uprobes.c:

uprobe_write_opcode() - only works on anonymous pages, not sure if
zone_device could ever work so add an explicit check

arch/s390/mm/fault.c:

do_secure_storage_access() - not sure so be conservative and add a check

arch/s390/kernel/uv.c:

make_hva_secure() - not sure so be conservative and add a check
---
 arch/s390/kernel/uv.c   | 2 +-
 arch/s390/mm/fault.c    | 2 +-
 kernel/events/uprobes.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index b99478e..55aa280 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -424,7 +424,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
 		return -EFAULT;
 	}
 	folio = folio_walk_start(&fw, vma, hva, 0);
-	if (!folio) {
+	if (!folio || folio_is_zone_device(folio)) {
 		mmap_read_unlock(mm);
 		return -ENXIO;
 	}
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index e1ad05b..df1a067 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -449,7 +449,7 @@ void do_secure_storage_access(struct pt_regs *regs)
 		if (!vma)
 			return handle_fault_error(regs, SEGV_MAPERR);
 		folio = folio_walk_start(&fw, vma, addr, 0);
-		if (!folio) {
+		if (!folio || folio_is_zone_device(folio)) {
 			mmap_read_unlock(mm);
 			return;
 		}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8a601df..f774367 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -539,7 +539,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
 	}
 
 	ret = 0;
-	if (unlikely(!folio_test_anon(folio))) {
+	if (unlikely(!folio_test_anon(folio) || folio_is_zone_device(folio))) {
 		VM_WARN_ON_ONCE(is_register);
 		folio_put(folio);
 		goto out;
-- 
git-series 0.9.1
Re: [PATCH v2 02/14] mm: Filter zone device pages returned from folio_walk_start()
Posted by David Hildenbrand 3 months, 3 weeks ago
On 16.06.25 13:58, Alistair Popple wrote:
> Previously dax pages were skipped by the pagewalk code as pud_special() or
> vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are
> refcounted normally that is no longer the case, so the pagewalk code will
> start returning them.
> 
> Most callers already explicitly filter for DAX or zone device pages so
> don't need updating. However some don't, so add checks to those callers.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> 
> ---
> 
> Changes since v1:
> 
>   - Dropped "mm/pagewalk: Skip dax pages in pagewalk" and replaced it
>     with this new patch for v2
> 
>   - As suggested by David and Jason we can filter the folios in the
>     callers instead of doing it in folio_start_walk(). Most callers
>     already do this (see below).
> 
> I audited all callers of folio_walk_start() and found the following:
> 
> mm/ksm.c:
> 
> break_ksm() - doesn't need to filter zone_device pages because the can
> never be KSM pages.
> 
> get_mergeable_page() - already filters out zone_device pages.
> scan_get_next_rmap_iterm() - already filters out zone_device_pages.
> 
> mm/huge_memory.c:
> 
> split_huge_pages_pid() - already checks for DAX with
> vma_not_suitable_for_thp_split()
> 
> mm/rmap.c:
> 
> make_device_exclusive() - only works on anonymous pages, although
> there'd be no issue with finding a DAX page even if support was extended
> to file-backed pages.
> 
> mm/migrate.c:
> 
> add_folio_for_migration() - already checks the vma with vma_migratable()
> do_pages_stat_array() - explicitly checks for zone_device folios
> 
> kernel/event/uprobes.c:
> 
> uprobe_write_opcode() - only works on anonymous pages, not sure if
> zone_device could ever work so add an explicit check
> 
> arch/s390/mm/fault.c:
> 
> do_secure_storage_access() - not sure so be conservative and add a check
> 
> arch/s390/kernel/uv.c:
> 
> make_hva_secure() - not sure so be conservative and add a check
> ---
>   arch/s390/kernel/uv.c   | 2 +-
>   arch/s390/mm/fault.c    | 2 +-
>   kernel/events/uprobes.c | 2 +-
>   3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index b99478e..55aa280 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -424,7 +424,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>   		return -EFAULT;
>   	}
>   	folio = folio_walk_start(&fw, vma, hva, 0);
> -	if (!folio) {
> +	if (!folio || folio_is_zone_device(folio)) {
>   		mmap_read_unlock(mm);
>   		return -ENXIO;
>   	}
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index e1ad05b..df1a067 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -449,7 +449,7 @@ void do_secure_storage_access(struct pt_regs *regs)
>   		if (!vma)
>   			return handle_fault_error(regs, SEGV_MAPERR);
>   		folio = folio_walk_start(&fw, vma, addr, 0);
> -		if (!folio) {
> +		if (!folio || folio_is_zone_device(folio)) {
>   			mmap_read_unlock(mm);
>   			return;
>   		}

Curious, does s390 even support ZONE_DEVICE and could trigger this?

> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 8a601df..f774367 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -539,7 +539,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>   	}
>   
>   	ret = 0;
> -	if (unlikely(!folio_test_anon(folio))) {
> +	if (unlikely(!folio_test_anon(folio) || folio_is_zone_device(folio))) {
>   		VM_WARN_ON_ONCE(is_register);
>   		folio_put(folio);
>   		goto out;

I wonder if __uprobe_write_opcode() would just work with anon device folios?

We only modify page content, and conditionally zap the page. Would there 
be a problem with anon device folios?

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 02/14] mm: Filter zone device pages returned from folio_walk_start()
Posted by David Hildenbrand 3 months, 3 weeks ago
On 17.06.25 11:25, David Hildenbrand wrote:
> On 16.06.25 13:58, Alistair Popple wrote:
>> Previously dax pages were skipped by the pagewalk code as pud_special() or
>> vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are
>> refcounted normally that is no longer the case, so the pagewalk code will
>> start returning them.
>>
>> Most callers already explicitly filter for DAX or zone device pages so
>> don't need updating. However some don't, so add checks to those callers.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>
>> ---
>>
>> Changes since v1:
>>
>>    - Dropped "mm/pagewalk: Skip dax pages in pagewalk" and replaced it
>>      with this new patch for v2
>>
>>    - As suggested by David and Jason we can filter the folios in the
>>      callers instead of doing it in folio_start_walk(). Most callers
>>      already do this (see below).
>>
>> I audited all callers of folio_walk_start() and found the following:
>>
>> mm/ksm.c:
>>
>> break_ksm() - doesn't need to filter zone_device pages because the can
>> never be KSM pages.
>>
>> get_mergeable_page() - already filters out zone_device pages.
>> scan_get_next_rmap_iterm() - already filters out zone_device_pages.
>>
>> mm/huge_memory.c:
>>
>> split_huge_pages_pid() - already checks for DAX with
>> vma_not_suitable_for_thp_split()
>>
>> mm/rmap.c:
>>
>> make_device_exclusive() - only works on anonymous pages, although
>> there'd be no issue with finding a DAX page even if support was extended
>> to file-backed pages.
>>
>> mm/migrate.c:
>>
>> add_folio_for_migration() - already checks the vma with vma_migratable()
>> do_pages_stat_array() - explicitly checks for zone_device folios
>>
>> kernel/event/uprobes.c:
>>
>> uprobe_write_opcode() - only works on anonymous pages, not sure if
>> zone_device could ever work so add an explicit check
>>
>> arch/s390/mm/fault.c:
>>
>> do_secure_storage_access() - not sure so be conservative and add a check
>>
>> arch/s390/kernel/uv.c:
>>
>> make_hva_secure() - not sure so be conservative and add a check
>> ---
>>    arch/s390/kernel/uv.c   | 2 +-
>>    arch/s390/mm/fault.c    | 2 +-
>>    kernel/events/uprobes.c | 2 +-
>>    3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
>> index b99478e..55aa280 100644
>> --- a/arch/s390/kernel/uv.c
>> +++ b/arch/s390/kernel/uv.c
>> @@ -424,7 +424,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>>    		return -EFAULT;
>>    	}
>>    	folio = folio_walk_start(&fw, vma, hva, 0);
>> -	if (!folio) {
>> +	if (!folio || folio_is_zone_device(folio)) {
>>    		mmap_read_unlock(mm);
>>    		return -ENXIO;
>>    	}
>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>> index e1ad05b..df1a067 100644
>> --- a/arch/s390/mm/fault.c
>> +++ b/arch/s390/mm/fault.c
>> @@ -449,7 +449,7 @@ void do_secure_storage_access(struct pt_regs *regs)
>>    		if (!vma)
>>    			return handle_fault_error(regs, SEGV_MAPERR);
>>    		folio = folio_walk_start(&fw, vma, addr, 0);
>> -		if (!folio) {
>> +		if (!folio || folio_is_zone_device(folio)) {
>>    			mmap_read_unlock(mm);
>>    			return;
>>    		}
> 
> Curious, does s390 even support ZONE_DEVICE and could trigger this?

Ah, I see you raised this above. Even if it could be triggered (which I 
don't think), I wonder if there would actually be a problem with 
zone_device folios in here?

I think these two can be dropped for now

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 02/14] mm: Filter zone device pages returned from folio_walk_start()
Posted by Alistair Popple 3 months, 3 weeks ago
On Tue, Jun 17, 2025 at 11:30:20AM +0200, David Hildenbrand wrote:
> On 17.06.25 11:25, David Hildenbrand wrote:
> > On 16.06.25 13:58, Alistair Popple wrote:
> > > Previously dax pages were skipped by the pagewalk code as pud_special() or
> > > vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are
> > > refcounted normally that is no longer the case, so the pagewalk code will
> > > start returning them.
> > > 
> > > Most callers already explicitly filter for DAX or zone device pages so
> > > don't need updating. However some don't, so add checks to those callers.
> > > 
> > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > 
> > > ---
> > > 
> > > Changes since v1:
> > > 
> > >    - Dropped "mm/pagewalk: Skip dax pages in pagewalk" and replaced it
> > >      with this new patch for v2
> > > 
> > >    - As suggested by David and Jason we can filter the folios in the
> > >      callers instead of doing it in folio_start_walk(). Most callers
> > >      already do this (see below).
> > > 
> > > I audited all callers of folio_walk_start() and found the following:
> > > 
> > > mm/ksm.c:
> > > 
> > > break_ksm() - doesn't need to filter zone_device pages because the can
> > > never be KSM pages.
> > > 
> > > get_mergeable_page() - already filters out zone_device pages.
> > > scan_get_next_rmap_iterm() - already filters out zone_device_pages.
> > > 
> > > mm/huge_memory.c:
> > > 
> > > split_huge_pages_pid() - already checks for DAX with
> > > vma_not_suitable_for_thp_split()
> > > 
> > > mm/rmap.c:
> > > 
> > > make_device_exclusive() - only works on anonymous pages, although
> > > there'd be no issue with finding a DAX page even if support was extended
> > > to file-backed pages.
> > > 
> > > mm/migrate.c:
> > > 
> > > add_folio_for_migration() - already checks the vma with vma_migratable()
> > > do_pages_stat_array() - explicitly checks for zone_device folios
> > > 
> > > kernel/event/uprobes.c:
> > > 
> > > uprobe_write_opcode() - only works on anonymous pages, not sure if
> > > zone_device could ever work so add an explicit check
> > > 
> > > arch/s390/mm/fault.c:
> > > 
> > > do_secure_storage_access() - not sure so be conservative and add a check
> > > 
> > > arch/s390/kernel/uv.c:
> > > 
> > > make_hva_secure() - not sure so be conservative and add a check
> > > ---
> > >    arch/s390/kernel/uv.c   | 2 +-
> > >    arch/s390/mm/fault.c    | 2 +-
> > >    kernel/events/uprobes.c | 2 +-
> > >    3 files changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > > index b99478e..55aa280 100644
> > > --- a/arch/s390/kernel/uv.c
> > > +++ b/arch/s390/kernel/uv.c
> > > @@ -424,7 +424,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
> > >    		return -EFAULT;
> > >    	}
> > >    	folio = folio_walk_start(&fw, vma, hva, 0);
> > > -	if (!folio) {
> > > +	if (!folio || folio_is_zone_device(folio)) {
> > >    		mmap_read_unlock(mm);
> > >    		return -ENXIO;
> > >    	}
> > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > > index e1ad05b..df1a067 100644
> > > --- a/arch/s390/mm/fault.c
> > > +++ b/arch/s390/mm/fault.c
> > > @@ -449,7 +449,7 @@ void do_secure_storage_access(struct pt_regs *regs)
> > >    		if (!vma)
> > >    			return handle_fault_error(regs, SEGV_MAPERR);
> > >    		folio = folio_walk_start(&fw, vma, addr, 0);
> > > -		if (!folio) {
> > > +		if (!folio || folio_is_zone_device(folio)) {
> > >    			mmap_read_unlock(mm);
> > >    			return;
> > >    		}
> > 
> > Curious, does s390 even support ZONE_DEVICE and could trigger this?

In thoery yes. Now that we don't need the DEVMAP PTE bit someone could enable
ZONE_DEVICE on s390 as it supports the rest of the prerequisites AFAICT:

config ZONE_DEVICE
        bool "Device memory (pmem, HMM, etc...) hotplug support"
        depends on MEMORY_HOTPLUG
        depends on MEMORY_HOTREMOVE
        depends on SPARSEMEM_VMEMMAP
 
> Ah, I see you raised this above. Even if it could be triggered (which I
> don't think), I wonder if there would actually be a problem with zone_device
> folios in here?

Yes, I'm not sure either - it seems unlikely but I know nothing about how secure
storage works on s390 so was trying to be be conservative.

> I think these two can be dropped for now

Ok.

> > I wonder if __uprobe_write_opcode() would just work with anon device folios?
> >
> > We only modify page content, and conditionally zap the page. Would there 
> > be a problem with anon device folios?

The two main types of anon device folios I know of are DEVICE_COHERENT
and DEVICE_PRIVATE. I doubt it would be a problem for the former, but it
would definitely be a problem for the latter as the actual page content is
unaddressable from the CPU.

So we could probably make the check specific to DEVICE_PRIVATE, although it's
hard to imagine anyone caring about uprobes from DEVICE_COHERENT memory.

> -- 
> Cheers,
> 
> David / dhildenb
>