[PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages

Alistair Popple posted 25 patches 1 year ago
There is a newer version of this series
[PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages
Posted by Alistair Popple 1 year ago
The procfs mmu files such as smaps currently ignore device dax and fs
dax pages because these pages are considered special. To maintain
existing behaviour once these pages are treated as normal pages and
returned from vm_normal_page() add tests to explicitly skip them.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 fs/proc/task_mmu.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38a5a3e..c9b227a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		young = pte_young(ptent);
 		dirty = pte_dirty(ptent);
 		present = true;
@@ -849,6 +851,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		present = true;
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -1378,7 +1382,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
 		return false;
 	folio = vm_normal_folio(vma, addr, pte);
-	if (!folio)
+	if (!folio || folio_is_device_dax(folio) || folio_is_fsdax(folio))
 		return false;
 	return folio_maybe_dma_pinned(folio);
 }
@@ -1703,6 +1707,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
+		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
+			page = NULL;
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
@@ -2089,7 +2095,9 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page(vma, addr, pte);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page) &&
+			    !is_fsdax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2151,7 +2159,9 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page_pmd(vma, addr, pmd);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page) &&
+			    !is_fsdax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2914,7 +2924,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 		return NULL;
 
 	page = vm_normal_page_pmd(vma, addr, pmd);
-	if (!page)
+	if (!page || is_device_dax_page(page) || is_fsdax_page(page))
 		return NULL;
 
 	if (PageReserved(page))
-- 
git-series 0.9.1
Re: [PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages
Posted by David Hildenbrand 12 months ago
On 17.12.24 06:13, Alistair Popple wrote:
> The procfs mmu files such as smaps currently ignore device dax and fs
> dax pages because these pages are considered special. To maintain
> existing behaviour once these pages are treated as normal pages and
> returned from vm_normal_page() add tests to explicitly skip them.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
>   fs/proc/task_mmu.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 38a5a3e..c9b227a 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>   
>   	if (pte_present(ptent)) {
>   		page = vm_normal_page(vma, addr, ptent);
> +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))

This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme 
here, likely we should have a special helper?


But, don't we actually want to include them in the smaps output now? I 
think we want.

The rmap code will indicate these pages in /proc/meminfo, per-node info, 
in the memcg ... as "Mapped:" etc.

So likely we just want to also indicate them here, or is there any 
downsides we know of?

-- 
Cheers,

David / dhildenb
Re: [PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages
Posted by Alistair Popple 12 months ago
On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
> On 17.12.24 06:13, Alistair Popple wrote:
> > The procfs mmu files such as smaps currently ignore device dax and fs
> > dax pages because these pages are considered special. To maintain
> > existing behaviour once these pages are treated as normal pages and
> > returned from vm_normal_page() add tests to explicitly skip them.
> > 
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> >   fs/proc/task_mmu.c | 18 ++++++++++++++----
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 38a5a3e..c9b227a 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> >   	if (pte_present(ptent)) {
> >   		page = vm_normal_page(vma, addr, ptent);
> > +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
> 
> This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
> here, likely we should have a special helper?

Sounds good, will add is_dax_page() if there are enough callers left after any
review comments.
 
> But, don't we actually want to include them in the smaps output now? I think
> we want.

I'm not an expert in what callers of vm_normal_page() think of as a "normal"
page. So my philosphy here was to ensure anything calling vm_normal_page()
didn't accidentally start seeing DAX pages, either by checking existing filters
(lots of callers already call vma_is_special_huge() or some equivalent) or
explicitly filtering them out in the hope someone smarter than me could tell me
it was unneccssary.

That stategy seems to have worked, and so I agree we likely do want them in
smaps. I just didn't want to silently do it without this kind of discussion
first.

> The rmap code will indicate these pages in /proc/meminfo, per-node info, in
> the memcg ... as "Mapped:" etc.
> 
> So likely we just want to also indicate them here, or is there any downsides
> we know of?

I don't know of any, and I think it makes sense to also indicate them so will
drop this check in the respin.

> -- 
> Cheers,
> 
> David / dhildenb
>
Re: [PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages
Posted by David Hildenbrand 12 months ago
On 19.12.24 00:11, Alistair Popple wrote:
> On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
>> On 17.12.24 06:13, Alistair Popple wrote:
>>> The procfs mmu files such as smaps currently ignore device dax and fs
>>> dax pages because these pages are considered special. To maintain
>>> existing behaviour once these pages are treated as normal pages and
>>> returned from vm_normal_page() add tests to explicitly skip them.
>>>
>>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>>> ---
>>>    fs/proc/task_mmu.c | 18 ++++++++++++++----
>>>    1 file changed, 14 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 38a5a3e..c9b227a 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>>    	if (pte_present(ptent)) {
>>>    		page = vm_normal_page(vma, addr, ptent);
>>> +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
>>
>> This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
>> here, likely we should have a special helper?
> 
> Sounds good, will add is_dax_page() if there are enough callers left after any
> review comments.

:)

>   
>> But, don't we actually want to include them in the smaps output now? I think
>> we want.
> 
> I'm not an expert in what callers of vm_normal_page() think of as a "normal"
> page. 

Yeah, it's tricky. It means "this is abnormal, don't look at the struct 
page". We're moving away from that, such that these folios/pages will be 
... mostly normal :)

> So my philosphy here was to ensure anything calling vm_normal_page()
> didn't accidentally start seeing DAX pages, either by checking existing filters
> (lots of callers already call vma_is_special_huge() or some equivalent) or
> explicitly filtering them out in the hope someone smarter than me could tell me
> it was unneccssary.
> 
> That stategy seems to have worked, and so I agree we likely do want them in
> smaps. I just didn't want to silently do it without this kind of discussion
> first.

Yes, absolutely.

> 
>> The rmap code will indicate these pages in /proc/meminfo, per-node info, in
>> the memcg ... as "Mapped:" etc.
>>
>> So likely we just want to also indicate them here, or is there any downsides
>> we know of?
> 
> I don't know of any, and I think it makes sense to also indicate them so will
> drop this check in the respin.

It will be easy to hide them later, at least we talked about it. Thanks 
for doing all this!

-- 
Cheers,

David / dhildenb
Re: [PATCH v4 19/25] proc/task_mmu: Ignore ZONE_DEVICE pages
Posted by Alistair Popple 11 months, 2 weeks ago
On Fri, Dec 20, 2024 at 07:32:52PM +0100, David Hildenbrand wrote:
> On 19.12.24 00:11, Alistair Popple wrote:
> > On Tue, Dec 17, 2024 at 11:31:25PM +0100, David Hildenbrand wrote:
> > > On 17.12.24 06:13, Alistair Popple wrote:
> > > > The procfs mmu files such as smaps currently ignore device dax and fs
> > > > dax pages because these pages are considered special. To maintain
> > > > existing behaviour once these pages are treated as normal pages and
> > > > returned from vm_normal_page() add tests to explicitly skip them.
> > > > 
> > > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > > ---
> > > >    fs/proc/task_mmu.c | 18 ++++++++++++++----
> > > >    1 file changed, 14 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 38a5a3e..c9b227a 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -801,6 +801,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> > > >    	if (pte_present(ptent)) {
> > > >    		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (page && (is_device_dax_page(page) || is_fsdax_page(page)))
> > > 
> > > This "is_device_dax_page(page) || is_fsdax_page(page)" is a common theme
> > > here, likely we should have a special helper?
> > 
> > Sounds good, will add is_dax_page() if there are enough callers left after any
> > review comments.
> 
> :)

In the end there was only a single caller so I will leave this open-coded.

> > > But, don't we actually want to include them in the smaps output now? I think
> > > we want.
> > 
> > I'm not an expert in what callers of vm_normal_page() think of as a "normal"
> > page.
> 
> Yeah, it's tricky. It means "this is abnormal, don't look at the struct
> page". We're moving away from that, such that these folios/pages will be ...
> mostly normal :)
> 
> > So my philosphy here was to ensure anything calling vm_normal_page()
> > didn't accidentally start seeing DAX pages, either by checking existing filters
> > (lots of callers already call vma_is_special_huge() or some equivalent) or
> > explicitly filtering them out in the hope someone smarter than me could tell me
> > it was unneccssary.
> > 
> > That stategy seems to have worked, and so I agree we likely do want them in
> > smaps. I just didn't want to silently do it without this kind of discussion
> > first.
> 
> Yes, absolutely.
> 
> > 
> > > The rmap code will indicate these pages in /proc/meminfo, per-node info, in
> > > the memcg ... as "Mapped:" etc.
> > > 
> > > So likely we just want to also indicate them here, or is there any downsides
> > > we know of?
> > 
> > I don't know of any, and I think it makes sense to also indicate them so will
> > drop this check in the respin.
> 
> It will be easy to hide them later, at least we talked about it. Thanks for
> doing all this!

Not a problem. The other main thing in this patch is also hiding them from
/proc/<PID>/pagemap. Based on this discussion I can't think of any good reason
why we would want to hide them there so will also remove the checks in the
pagemap walker.

> -- 
> Cheers,
> 
> David / dhildenb
>