[v1] mm: rewrite pfnmap tracking and remove VM_PAT

[PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

... by factoring it out from track_pfn_remap().

For PMDs/PUDs, actually check the full range, and trigger a fallback
if we run into this "different memory types / cachemodes" scenario.

Add some documentation.

Will checking each page result in undesired overhead? We'll have to
learn. Not checking each page looks wrong, though. Maybe we could
optimize the lookup internally.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/mm/pat/memtype.c | 24 ++++++++----------------
 include/linux/pgtable.h   | 28 ++++++++++++++++++++--------
 mm/huge_memory.c          |  7 +++++--
 mm/memory.c               |  4 ++--
 4 files changed, 35 insertions(+), 28 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index edec5859651d6..193e33251b18f 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 		    unsigned long pfn, unsigned long addr, unsigned long size)
 {
 	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
-	enum page_cache_mode pcm;
 
 	/* reserve the whole chunk starting from paddr */
 	if (!vma || (addr == vma->vm_start
@@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 		return ret;
 	}
 
+	return pfnmap_sanitize_pgprot(pfn, size, prot);
+}
+
+int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
+{
+	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
+	enum page_cache_mode pcm;
+
 	if (!pat_enabled())
 		return 0;
 
-	/*
-	 * For anything smaller than the vma size we set prot based on the
-	 * lookup.
-	 */
 	pcm = lookup_memtype(paddr);
 
 	/* Check memtype for the remaining pages */
@@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 	return 0;
 }
 
-void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
-{
-	enum page_cache_mode pcm;
-
-	if (!pat_enabled())
-		return;
-
-	pcm = lookup_memtype(pfn_t_to_phys(pfn));
-	pgprot_set_cachemode(prot, pcm);
-}
-
 /*
  * untrack_pfn is called while unmapping a pfnmap for a region.
  * untrack can be called for a specific region indicated by pfn and size or
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c921..91aadfe2515a5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1500,13 +1500,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 	return 0;
 }
 
-/*
- * track_pfn_insert is called when a _new_ single pfn is established
- * by vmf_insert_pfn().
- */
-static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-				    pfn_t pfn)
+static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
+		pgprot_t *prot)
 {
+	return 0;
 }
 
 /*
@@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
 extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
 			   unsigned long pfn, unsigned long addr,
 			   unsigned long size);
-extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
-			     pfn_t pfn);
+
+/**
+ * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
+ * @pfn: the start of the pfn range
+ * @size: the size of the pfn range
+ * @prot: the pgprot to sanitize
+ *
+ * Sanitize the given pgprot for a pfn range, for example, adjusting the
+ * cachemode.
+ *
+ * This function cannot fail for a single page, but can fail for multiple
+ * pages.
+ *
+ * Returns 0 on success and -EINVAL on error.
+ */
+int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
+		pgprot_t *prot);
 extern int track_pfn_copy(struct vm_area_struct *dst_vma,
 		struct vm_area_struct *src_vma, unsigned long *pfn);
 extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fdcf0a6049b9f..b8ae5e1493315 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
 			return VM_FAULT_OOM;
 	}
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
+		return VM_FAULT_FALLBACK;
+
 	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
 	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
 			pgtable);
@@ -1577,7 +1579,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
+		return VM_FAULT_FALLBACK;
 
 	ptl = pud_lock(vma->vm_mm, vmf->pud);
 	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
diff --git a/mm/memory.c b/mm/memory.c
index 424420349bd3c..c737a8625866a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2563,7 +2563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	if (!pfn_modify_allowed(pfn, pgprot))
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
+	pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
 
 	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
 			false);
@@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return VM_FAULT_SIGBUS;
 
-	track_pfn_insert(vma, &pgprot, pfn);
+	pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
 
 	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
 		return VM_FAULT_SIGBUS;
-- 
2.49.0

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by Peter Xu 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> ... by factoring it out from track_pfn_remap().
> 
> For PMDs/PUDs, actually check the full range, and trigger a fallback
> if we run into this "different memory types / cachemodes" scenario.

The current patch looks like to still pass PAGE_SIZE into the new helper at
all track_pfn_insert() call sites, so it seems this comment does not 100%
match with the code?  Or I may have misread somewhere.

Maybe it's still easier to keep the single-pfn lookup to never fail..  more
below.

> 
> Add some documentation.
> 
> Will checking each page result in undesired overhead? We'll have to
> learn. Not checking each page looks wrong, though. Maybe we could
> optimize the lookup internally.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/x86/mm/pat/memtype.c | 24 ++++++++----------------
>  include/linux/pgtable.h   | 28 ++++++++++++++++++++--------
>  mm/huge_memory.c          |  7 +++++--
>  mm/memory.c               |  4 ++--
>  4 files changed, 35 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index edec5859651d6..193e33251b18f 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1031,7 +1031,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  		    unsigned long pfn, unsigned long addr, unsigned long size)
>  {
>  	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> -	enum page_cache_mode pcm;
>  
>  	/* reserve the whole chunk starting from paddr */
>  	if (!vma || (addr == vma->vm_start
> @@ -1044,13 +1043,17 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  		return ret;
>  	}
>  
> +	return pfnmap_sanitize_pgprot(pfn, size, prot);
> +}
> +
> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size, pgprot_t *prot)
> +{
> +	resource_size_t paddr = (resource_size_t)pfn << PAGE_SHIFT;
> +	enum page_cache_mode pcm;
> +
>  	if (!pat_enabled())
>  		return 0;
>  
> -	/*
> -	 * For anything smaller than the vma size we set prot based on the
> -	 * lookup.
> -	 */
>  	pcm = lookup_memtype(paddr);
>  
>  	/* Check memtype for the remaining pages */
> @@ -1065,17 +1068,6 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  	return 0;
>  }
>  
> -void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
> -{
> -	enum page_cache_mode pcm;
> -
> -	if (!pat_enabled())
> -		return;
> -
> -	pcm = lookup_memtype(pfn_t_to_phys(pfn));
> -	pgprot_set_cachemode(prot, pcm);
> -}
> -
>  /*
>   * untrack_pfn is called while unmapping a pfnmap for a region.
>   * untrack can be called for a specific region indicated by pfn and size or
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c921..91aadfe2515a5 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1500,13 +1500,10 @@ static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  	return 0;
>  }
>  
> -/*
> - * track_pfn_insert is called when a _new_ single pfn is established
> - * by vmf_insert_pfn().
> - */
> -static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> -				    pfn_t pfn)
> +static inline int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> +		pgprot_t *prot)
>  {
> +	return 0;
>  }
>  
>  /*
> @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>  extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>  			   unsigned long pfn, unsigned long addr,
>  			   unsigned long size);
> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> -			     pfn_t pfn);
> +
> +/**
> + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range

Nit: s/sanitize/update|setup|.../?

But maybe you have good reason to use sanitize.  No strong opinions.

> + * @pfn: the start of the pfn range
> + * @size: the size of the pfn range
> + * @prot: the pgprot to sanitize
> + *
> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> + * cachemode.
> + *
> + * This function cannot fail for a single page, but can fail for multiple
> + * pages.
> + *
> + * Returns 0 on success and -EINVAL on error.
> + */
> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> +		pgprot_t *prot);
>  extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>  		struct vm_area_struct *src_vma, unsigned long *pfn);
>  extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fdcf0a6049b9f..b8ae5e1493315 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>  			return VM_FAULT_OOM;
>  	}
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> +		return VM_FAULT_FALLBACK;

Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
trigger, though.

Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
a win already in all cases.

> +
>  	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>  	error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
>  			pgtable);
> @@ -1577,7 +1579,8 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> +		return VM_FAULT_FALLBACK;
>  
>  	ptl = pud_lock(vma->vm_mm, vmf->pud);
>  	insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
> diff --git a/mm/memory.c b/mm/memory.c
> index 424420349bd3c..c737a8625866a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2563,7 +2563,7 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
>  	if (!pfn_modify_allowed(pfn, pgprot))
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
> +	pfnmap_sanitize_pgprot(pfn, PAGE_SIZE, &pgprot);
>  
>  	return insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
>  			false);
> @@ -2626,7 +2626,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return VM_FAULT_SIGBUS;
>  
> -	track_pfn_insert(vma, &pgprot, pfn);
> +	pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot);
>  
>  	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
>  		return VM_FAULT_SIGBUS;
> -- 
> 2.49.0
> 

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

>>   
>> -	track_pfn_insert(vma, &pgprot, pfn);
>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>> +		return VM_FAULT_FALLBACK;
> 
> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> trigger, though.

Missed that comment. I can document that pgprot will only be touched if 
the function succeeds.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

On 25.04.25 21:31, Peter Xu wrote:
> On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
>> ... by factoring it out from track_pfn_remap().
>>
>> For PMDs/PUDs, actually check the full range, and trigger a fallback
>> if we run into this "different memory types / cachemodes" scenario.
> 
> The current patch looks like to still pass PAGE_SIZE into the new helper at
> all track_pfn_insert() call sites, so it seems this comment does not 100%
> match with the code?  Or I may have misread somewhere.

No, you're right, while reshuffling the patches I forgot to add the 
actual PMD/PUD size.

> 
> Maybe it's still easier to keep the single-pfn lookup to never fail..  more
> below.
> 

[...]

>>   /*
>> @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
>>   extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>   			   unsigned long pfn, unsigned long addr,
>>   			   unsigned long size);
>> -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
>> -			     pfn_t pfn);
>> +
>> +/**
>> + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
> 
> Nit: s/sanitize/update|setup|.../?
> 
> But maybe you have good reason to use sanitize.  No strong opinions.

What it does on PAT (only implementation so far ...) is looking up the 
memory type to select the caching mode that can be use.

"sanitize" was IMHO a good fit, because we must make sure that we don't 
use the wrong caching mode.

update/setup/... don't make that quite clear. Any other suggestions?

> 
>> + * @pfn: the start of the pfn range
>> + * @size: the size of the pfn range
>> + * @prot: the pgprot to sanitize
>> + *
>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>> + * cachemode.
>> + *
>> + * This function cannot fail for a single page, but can fail for multiple
>> + * pages.
>> + *
>> + * Returns 0 on success and -EINVAL on error.
>> + */
>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>> +		pgprot_t *prot);
>>   extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>   		struct vm_area_struct *src_vma, unsigned long *pfn);
>>   extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fdcf0a6049b9f..b8ae5e1493315 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>   			return VM_FAULT_OOM;
>>   	}
>>   
>> -	track_pfn_insert(vma, &pgprot, pfn);
>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>> +		return VM_FAULT_FALLBACK;
> 
> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> trigger, though.
> 
> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> a win already in all cases.

It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. 
That's certainly helpful for the single-page case.

Regarding never failing here: we should check the whole range. We have 
to make sure that none of the pages has a memory type / caching mode 
that is incompatible with what we setup.


Thanks a bunch for the review!
-- 
Cheers,

David / dhildenb

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by Peter Xu 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 09:48:50PM +0200, David Hildenbrand wrote:
> On 25.04.25 21:31, Peter Xu wrote:
> > On Fri, Apr 25, 2025 at 10:17:06AM +0200, David Hildenbrand wrote:
> > > ... by factoring it out from track_pfn_remap().
> > > 
> > > For PMDs/PUDs, actually check the full range, and trigger a fallback
> > > if we run into this "different memory types / cachemodes" scenario.
> > 
> > The current patch looks like to still pass PAGE_SIZE into the new helper at
> > all track_pfn_insert() call sites, so it seems this comment does not 100%
> > match with the code?  Or I may have misread somewhere.
> 
> No, you're right, while reshuffling the patches I forgot to add the actual
> PMD/PUD size.
> 
> > 
> > Maybe it's still easier to keep the single-pfn lookup to never fail..  more
> > below.
> > 
> 
> [...]
> 
> > >   /*
> > > @@ -1556,8 +1553,23 @@ static inline void untrack_pfn_clear(struct vm_area_struct *vma)
> > >   extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >   			   unsigned long pfn, unsigned long addr,
> > >   			   unsigned long size);
> > > -extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
> > > -			     pfn_t pfn);
> > > +
> > > +/**
> > > + * pfnmap_sanitize_pgprot - sanitize the pgprot for a pfn range
> > 
> > Nit: s/sanitize/update|setup|.../?
> > 
> > But maybe you have good reason to use sanitize.  No strong opinions.
> 
> What it does on PAT (only implementation so far ...) is looking up the
> memory type to select the caching mode that can be use.
> 
> "sanitize" was IMHO a good fit, because we must make sure that we don't use
> the wrong caching mode.
> 
> update/setup/... don't make that quite clear. Any other suggestions?

I'm very poor on naming.. :( So far anything seems slightly better than
sanitize to me, as the word "sanitize" is actually also used in memtype.c
for other purpose.. see sanitize_phys().

> 
> > 
> > > + * @pfn: the start of the pfn range
> > > + * @size: the size of the pfn range
> > > + * @prot: the pgprot to sanitize
> > > + *
> > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > + * cachemode.
> > > + *
> > > + * This function cannot fail for a single page, but can fail for multiple
> > > + * pages.
> > > + *
> > > + * Returns 0 on success and -EINVAL on error.
> > > + */
> > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > +		pgprot_t *prot);
> > >   extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > >   		struct vm_area_struct *src_vma, unsigned long *pfn);
> > >   extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > >   			return VM_FAULT_OOM;
> > >   	}
> > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > +		return VM_FAULT_FALLBACK;
> > 
> > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > trigger, though.
> > 
> > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > a win already in all cases.
> 
> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> certainly helpful for the single-page case.
> 
> Regarding never failing here: we should check the whole range. We have to
> make sure that none of the pages has a memory type / caching mode that is
> incompatible with what we setup.

Would it happen in real world?

IIUC per-vma registration needs to happen first, which checks for memtype
conflicts in the first place, or reserve_pfn_range() could already have
failed.

Here it's the fault path looking up the memtype, so I would expect it is
guaranteed all pfns under the same vma is following the verified (and same)
memtype?

Thanks,

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

>> What it does on PAT (only implementation so far ...) is looking up the
>> memory type to select the caching mode that can be use.
>>
>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>> the wrong caching mode.
>>
>> update/setup/... don't make that quite clear. Any other suggestions?
> 
> I'm very poor on naming.. :( So far anything seems slightly better than
> sanitize to me, as the word "sanitize" is actually also used in memtype.c
> for other purpose.. see sanitize_phys().

Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, 
in the other functions it's an address.

Likely we should just call it pfnmap_X_cachemode()/

Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.

pfnmap_setup_cachemode() ? Hm.

> 
>>
>>>
>>>> + * @pfn: the start of the pfn range
>>>> + * @size: the size of the pfn range
>>>> + * @prot: the pgprot to sanitize
>>>> + *
>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>> + * cachemode.
>>>> + *
>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>> + * pages.
>>>> + *
>>>> + * Returns 0 on success and -EINVAL on error.
>>>> + */
>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>> +		pgprot_t *prot);
>>>>    extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>    		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>    extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>    			return VM_FAULT_OOM;
>>>>    	}
>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>> +		return VM_FAULT_FALLBACK;
>>>
>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>> trigger, though.
>>>
>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>> a win already in all cases.
>>
>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>> certainly helpful for the single-page case.
>>
>> Regarding never failing here: we should check the whole range. We have to
>> make sure that none of the pages has a memory type / caching mode that is
>> incompatible with what we setup.
> 
> Would it happen in real world?
 > > IIUC per-vma registration needs to happen first, which checks for 
memtype
> conflicts in the first place, or reserve_pfn_range() could already have
> failed.
 > > Here it's the fault path looking up the memtype, so I would expect 
it is
> guaranteed all pfns under the same vma is following the verified (and same)
> memtype?

The whole point of track_pfn_insert() is that it is used when we *don't* 
use reserve_pfn_range()->track_pfn_remap(), no?

track_pfn_remap() would check the whole range that gets mapped, so 
track_pfn_insert() user must similarly check the whole range that gets 
mapped.

Note that even track_pfn_insert() is already pretty clear on the 
intended usage: "called when a _new_ single pfn is established"

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by Peter Xu 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> 
> > > What it does on PAT (only implementation so far ...) is looking up the
> > > memory type to select the caching mode that can be use.
> > > 
> > > "sanitize" was IMHO a good fit, because we must make sure that we don't use
> > > the wrong caching mode.
> > > 
> > > update/setup/... don't make that quite clear. Any other suggestions?
> > 
> > I'm very poor on naming.. :( So far anything seems slightly better than
> > sanitize to me, as the word "sanitize" is actually also used in memtype.c
> > for other purpose.. see sanitize_phys().
> 
> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> the other functions it's an address.
> 
> Likely we should just call it pfnmap_X_cachemode()/
> 
> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> 
> pfnmap_setup_cachemode() ? Hm.

Sounds good here.

> 
> > 
> > > 
> > > > 
> > > > > + * @pfn: the start of the pfn range
> > > > > + * @size: the size of the pfn range
> > > > > + * @prot: the pgprot to sanitize
> > > > > + *
> > > > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > > > + * cachemode.
> > > > > + *
> > > > > + * This function cannot fail for a single page, but can fail for multiple
> > > > > + * pages.
> > > > > + *
> > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > + */
> > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > +		pgprot_t *prot);
> > > > >    extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > >    		struct vm_area_struct *src_vma, unsigned long *pfn);
> > > > >    extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > > > >    			return VM_FAULT_OOM;
> > > > >    	}
> > > > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > > > +		return VM_FAULT_FALLBACK;
> > > > 
> > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > > > trigger, though.
> > > > 
> > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > > > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > > > a win already in all cases.
> > > 
> > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> > > certainly helpful for the single-page case.
> > > 
> > > Regarding never failing here: we should check the whole range. We have to
> > > make sure that none of the pages has a memory type / caching mode that is
> > > incompatible with what we setup.
> > 
> > Would it happen in real world?
> > > IIUC per-vma registration needs to happen first, which checks for
> memtype
> > conflicts in the first place, or reserve_pfn_range() could already have
> > failed.
> > > Here it's the fault path looking up the memtype, so I would expect it is
> > guaranteed all pfns under the same vma is following the verified (and same)
> > memtype?
> 
> The whole point of track_pfn_insert() is that it is used when we *don't* use
> reserve_pfn_range()->track_pfn_remap(), no?
> 
> track_pfn_remap() would check the whole range that gets mapped, so
> track_pfn_insert() user must similarly check the whole range that gets
> mapped.
> 
> Note that even track_pfn_insert() is already pretty clear on the intended
> usage: "called when a _new_ single pfn is established"

We need to define "new" then..  But I agree it's not crystal clear at
least.  I think I just wasn't the first to assume it was reserved, see this
(especially, the "Expectation" part..):

commit 5180da410db6369d1f95c9014da1c9bc33fb043e
Author: Suresh Siddha <suresh.b.siddha@intel.com>
Date:   Mon Oct 8 16:28:29 2012 -0700

    x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
    
    With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
    attribute and uses it.  Expectation is that the driver reserves the
    memory attributes for the pfn before calling vm_insert_pfn().

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

On 28.04.25 18:21, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
>>
>>>> What it does on PAT (only implementation so far ...) is looking up the
>>>> memory type to select the caching mode that can be use.
>>>>
>>>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>>>> the wrong caching mode.
>>>>
>>>> update/setup/... don't make that quite clear. Any other suggestions?
>>>
>>> I'm very poor on naming.. :( So far anything seems slightly better than
>>> sanitize to me, as the word "sanitize" is actually also used in memtype.c
>>> for other purpose.. see sanitize_phys().
>>
>> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
>> the other functions it's an address.
>>
>> Likely we should just call it pfnmap_X_cachemode()/
>>
>> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
>>
>> pfnmap_setup_cachemode() ? Hm.
> 
> Sounds good here.

Okay, I'll use that one. If ever something else besides PAT would 
require different semantics, they can bother with finding a better name :)

> 
>>
>>>
>>>>
>>>>>
>>>>>> + * @pfn: the start of the pfn range
>>>>>> + * @size: the size of the pfn range
>>>>>> + * @prot: the pgprot to sanitize
>>>>>> + *
>>>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>>>> + * cachemode.
>>>>>> + *
>>>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>>>> + * pages.
>>>>>> + *
>>>>>> + * Returns 0 on success and -EINVAL on error.
>>>>>> + */
>>>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>>>> +		pgprot_t *prot);
>>>>>>     extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>     		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>>>     extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>>>     			return VM_FAULT_OOM;
>>>>>>     	}
>>>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>>>> +		return VM_FAULT_FALLBACK;
>>>>>
>>>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>>>> trigger, though.
>>>>>
>>>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>>>> a win already in all cases.
>>>>
>>>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>>>> certainly helpful for the single-page case.
>>>>
>>>> Regarding never failing here: we should check the whole range. We have to
>>>> make sure that none of the pages has a memory type / caching mode that is
>>>> incompatible with what we setup.
>>>
>>> Would it happen in real world?
>>>> IIUC per-vma registration needs to happen first, which checks for
>> memtype
>>> conflicts in the first place, or reserve_pfn_range() could already have
>>> failed.
>>>> Here it's the fault path looking up the memtype, so I would expect it is
>>> guaranteed all pfns under the same vma is following the verified (and same)
>>> memtype?
>>
>> The whole point of track_pfn_insert() is that it is used when we *don't* use
>> reserve_pfn_range()->track_pfn_remap(), no?
>>
>> track_pfn_remap() would check the whole range that gets mapped, so
>> track_pfn_insert() user must similarly check the whole range that gets
>> mapped.
>>
>> Note that even track_pfn_insert() is already pretty clear on the intended
>> usage: "called when a _new_ single pfn is established"
> 
> We need to define "new" then..  But I agree it's not crystal clear at
> least.  I think I just wasn't the first to assume it was reserved, see this
> (especially, the "Expectation" part..):
> 
> commit 5180da410db6369d1f95c9014da1c9bc33fb043e
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
> Date:   Mon Oct 8 16:28:29 2012 -0700
> 
>      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
>      
>      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
>      attribute and uses it.  Expectation is that the driver reserves the
>      memory attributes for the pfn before calling vm_insert_pfn().

It's all confusing.

We do have the following functions relevant in pat code:

(1) memtype_reserve(): used by ioremap and set_memory_XX

(2) memtype_reserve_io(): used by iomap

(3) reserve_pfn_range(): only remap_pfn_range() calls it

(4) arch_io_reserve_memtype_wc()


Which one would perform the reservation for, say, vfio?


I agree that if there would be a guarantee/expectation that all PFNs 
have the same memtype (from previous reservation), it would be 
sufficient to check a single PFN, and we could document that. I just 
don't easily see where that reservation is happening.

So a pointer to that would be appreciated!

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by Peter Xu 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
> On 28.04.25 18:21, Peter Xu wrote:
> > On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
> > > 
> > > > > What it does on PAT (only implementation so far ...) is looking up the
> > > > > memory type to select the caching mode that can be use.
> > > > > 
> > > > > "sanitize" was IMHO a good fit, because we must make sure that we don't use
> > > > > the wrong caching mode.
> > > > > 
> > > > > update/setup/... don't make that quite clear. Any other suggestions?
> > > > 
> > > > I'm very poor on naming.. :( So far anything seems slightly better than
> > > > sanitize to me, as the word "sanitize" is actually also used in memtype.c
> > > > for other purpose.. see sanitize_phys().
> > > 
> > > Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
> > > the other functions it's an address.
> > > 
> > > Likely we should just call it pfnmap_X_cachemode()/
> > > 
> > > Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
> > > 
> > > pfnmap_setup_cachemode() ? Hm.
> > 
> > Sounds good here.
> 
> Okay, I'll use that one. If ever something else besides PAT would require
> different semantics, they can bother with finding a better name :)
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > + * @pfn: the start of the pfn range
> > > > > > > + * @size: the size of the pfn range
> > > > > > > + * @prot: the pgprot to sanitize
> > > > > > > + *
> > > > > > > + * Sanitize the given pgprot for a pfn range, for example, adjusting the
> > > > > > > + * cachemode.
> > > > > > > + *
> > > > > > > + * This function cannot fail for a single page, but can fail for multiple
> > > > > > > + * pages.
> > > > > > > + *
> > > > > > > + * Returns 0 on success and -EINVAL on error.
> > > > > > > + */
> > > > > > > +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
> > > > > > > +		pgprot_t *prot);
> > > > > > >     extern int track_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > >     		struct vm_area_struct *src_vma, unsigned long *pfn);
> > > > > > >     extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
> > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > > index fdcf0a6049b9f..b8ae5e1493315 100644
> > > > > > > --- a/mm/huge_memory.c
> > > > > > > +++ b/mm/huge_memory.c
> > > > > > > @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> > > > > > >     			return VM_FAULT_OOM;
> > > > > > >     	}
> > > > > > > -	track_pfn_insert(vma, &pgprot, pfn);
> > > > > > > +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
> > > > > > > +		return VM_FAULT_FALLBACK;
> > > > > > 
> > > > > > Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
> > > > > > trigger, though.
> > > > > > 
> > > > > > Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
> > > > > > replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
> > > > > > a win already in all cases.
> > > > > 
> > > > > It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
> > > > > certainly helpful for the single-page case.
> > > > > 
> > > > > Regarding never failing here: we should check the whole range. We have to
> > > > > make sure that none of the pages has a memory type / caching mode that is
> > > > > incompatible with what we setup.
> > > > 
> > > > Would it happen in real world?
> > > > > IIUC per-vma registration needs to happen first, which checks for
> > > memtype
> > > > conflicts in the first place, or reserve_pfn_range() could already have
> > > > failed.
> > > > > Here it's the fault path looking up the memtype, so I would expect it is
> > > > guaranteed all pfns under the same vma is following the verified (and same)
> > > > memtype?
> > > 
> > > The whole point of track_pfn_insert() is that it is used when we *don't* use
> > > reserve_pfn_range()->track_pfn_remap(), no?
> > > 
> > > track_pfn_remap() would check the whole range that gets mapped, so
> > > track_pfn_insert() user must similarly check the whole range that gets
> > > mapped.
> > > 
> > > Note that even track_pfn_insert() is already pretty clear on the intended
> > > usage: "called when a _new_ single pfn is established"
> > 
> > We need to define "new" then..  But I agree it's not crystal clear at
> > least.  I think I just wasn't the first to assume it was reserved, see this
> > (especially, the "Expectation" part..):
> > 
> > commit 5180da410db6369d1f95c9014da1c9bc33fb043e
> > Author: Suresh Siddha <suresh.b.siddha@intel.com>
> > Date:   Mon Oct 8 16:28:29 2012 -0700
> > 
> >      x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
> >      With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
> >      attribute and uses it.  Expectation is that the driver reserves the
> >      memory attributes for the pfn before calling vm_insert_pfn().
> 
> It's all confusing.
> 
> We do have the following functions relevant in pat code:
> 
> (1) memtype_reserve(): used by ioremap and set_memory_XX
> 
> (2) memtype_reserve_io(): used by iomap
> 
> (3) reserve_pfn_range(): only remap_pfn_range() calls it
> 
> (4) arch_io_reserve_memtype_wc()
> 
> 
> Which one would perform the reservation for, say, vfio?

My understanding is it was done via barmap.  See this stack:

vfio_pci_core_mmap
  pci_iomap
    pci_iomap_range
      ... 
        __ioremap_caller
          memtype_reserve

> 
> 
> I agree that if there would be a guarantee/expectation that all PFNs have
> the same memtype (from previous reservation), it would be sufficient to
> check a single PFN, and we could document that. I just don't easily see
> where that reservation is happening.
> 
> So a pointer to that would be appreciated!

I am not aware of any pointer.. maybe others could chime in.

IMHO, if there's anything uncertain, for this one we could always decouple
this issue from the core issue you're working on, so at least it keeps the
old behavior (which is pure lookup on pfn injections) until a solid issue
occurs?  It avoids the case where we could introduce unnecessary code but
then it's much harder to justify a removal.  What do you think?

Thanks,

-- 
Peter Xu

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by David Hildenbrand 9 months, 2 weeks ago

On 29.04.25 15:44, Peter Xu wrote:
> On Mon, Apr 28, 2025 at 10:37:49PM +0200, David Hildenbrand wrote:
>> On 28.04.25 18:21, Peter Xu wrote:
>>> On Mon, Apr 28, 2025 at 04:58:46PM +0200, David Hildenbrand wrote:
>>>>
>>>>>> What it does on PAT (only implementation so far ...) is looking up the
>>>>>> memory type to select the caching mode that can be use.
>>>>>>
>>>>>> "sanitize" was IMHO a good fit, because we must make sure that we don't use
>>>>>> the wrong caching mode.
>>>>>>
>>>>>> update/setup/... don't make that quite clear. Any other suggestions?
>>>>>
>>>>> I'm very poor on naming.. :( So far anything seems slightly better than
>>>>> sanitize to me, as the word "sanitize" is actually also used in memtype.c
>>>>> for other purpose.. see sanitize_phys().
>>>>
>>>> Sure, one can sanitize a lot of things. Here it's the cachemode/pgrpot, in
>>>> the other functions it's an address.
>>>>
>>>> Likely we should just call it pfnmap_X_cachemode()/
>>>>
>>>> Set/update don't really fit for X in case pfnmap_X_cachemode() is a NOP.
>>>>
>>>> pfnmap_setup_cachemode() ? Hm.
>>>
>>> Sounds good here.
>>
>> Okay, I'll use that one. If ever something else besides PAT would require
>> different semantics, they can bother with finding a better name :)
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> + * @pfn: the start of the pfn range
>>>>>>>> + * @size: the size of the pfn range
>>>>>>>> + * @prot: the pgprot to sanitize
>>>>>>>> + *
>>>>>>>> + * Sanitize the given pgprot for a pfn range, for example, adjusting the
>>>>>>>> + * cachemode.
>>>>>>>> + *
>>>>>>>> + * This function cannot fail for a single page, but can fail for multiple
>>>>>>>> + * pages.
>>>>>>>> + *
>>>>>>>> + * Returns 0 on success and -EINVAL on error.
>>>>>>>> + */
>>>>>>>> +int pfnmap_sanitize_pgprot(unsigned long pfn, unsigned long size,
>>>>>>>> +		pgprot_t *prot);
>>>>>>>>      extern int track_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>>>      		struct vm_area_struct *src_vma, unsigned long *pfn);
>>>>>>>>      extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index fdcf0a6049b9f..b8ae5e1493315 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -1455,7 +1455,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>>>>>>>>      			return VM_FAULT_OOM;
>>>>>>>>      	}
>>>>>>>> -	track_pfn_insert(vma, &pgprot, pfn);
>>>>>>>> +	if (pfnmap_sanitize_pgprot(pfn_t_to_pfn(pfn), PAGE_SIZE, &pgprot))
>>>>>>>> +		return VM_FAULT_FALLBACK;
>>>>>>>
>>>>>>> Would "pgtable" leak if it fails?  If it's PAGE_SIZE, IIUC it won't ever
>>>>>>> trigger, though.
>>>>>>>
>>>>>>> Maybe we could have a "void pfnmap_sanitize_pgprot_pfn(&pgprot, pfn)" to
>>>>>>> replace track_pfn_insert() and never fail?  Dropping vma ref is definitely
>>>>>>> a win already in all cases.
>>>>>>
>>>>>> It could be a simple wrapper around pfnmap_sanitize_pgprot(), yes. That's
>>>>>> certainly helpful for the single-page case.
>>>>>>
>>>>>> Regarding never failing here: we should check the whole range. We have to
>>>>>> make sure that none of the pages has a memory type / caching mode that is
>>>>>> incompatible with what we setup.
>>>>>
>>>>> Would it happen in real world?
>>>>>> IIUC per-vma registration needs to happen first, which checks for
>>>> memtype
>>>>> conflicts in the first place, or reserve_pfn_range() could already have
>>>>> failed.
>>>>>> Here it's the fault path looking up the memtype, so I would expect it is
>>>>> guaranteed all pfns under the same vma is following the verified (and same)
>>>>> memtype?
>>>>
>>>> The whole point of track_pfn_insert() is that it is used when we *don't* use
>>>> reserve_pfn_range()->track_pfn_remap(), no?
>>>>
>>>> track_pfn_remap() would check the whole range that gets mapped, so
>>>> track_pfn_insert() user must similarly check the whole range that gets
>>>> mapped.
>>>>
>>>> Note that even track_pfn_insert() is already pretty clear on the intended
>>>> usage: "called when a _new_ single pfn is established"
>>>
>>> We need to define "new" then..  But I agree it's not crystal clear at
>>> least.  I think I just wasn't the first to assume it was reserved, see this
>>> (especially, the "Expectation" part..):
>>>
>>> commit 5180da410db6369d1f95c9014da1c9bc33fb043e
>>> Author: Suresh Siddha <suresh.b.siddha@intel.com>
>>> Date:   Mon Oct 8 16:28:29 2012 -0700
>>>
>>>       x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn
>>>       With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
>>>       attribute and uses it.  Expectation is that the driver reserves the
>>>       memory attributes for the pfn before calling vm_insert_pfn().
>>
>> It's all confusing.
>>
>> We do have the following functions relevant in pat code:
>>
>> (1) memtype_reserve(): used by ioremap and set_memory_XX
>>
>> (2) memtype_reserve_io(): used by iomap
>>
>> (3) reserve_pfn_range(): only remap_pfn_range() calls it
>>
>> (4) arch_io_reserve_memtype_wc()
>>
>>
>> Which one would perform the reservation for, say, vfio?
> 
> My understanding is it was done via barmap.  See this stack:
> 
> vfio_pci_core_mmap
>    pci_iomap
>      pci_iomap_range
>        ...
>          __ioremap_caller
>            memtype_reserve
> 
>>
>>
>> I agree that if there would be a guarantee/expectation that all PFNs have
>> the same memtype (from previous reservation), it would be sufficient to
>> check a single PFN, and we could document that. I just don't easily see
>> where that reservation is happening.
>>
>> So a pointer to that would be appreciated!
> 
> I am not aware of any pointer.. maybe others could chime in.
> 
> IMHO, if there's anything uncertain, for this one we could always decouple
> this issue from the core issue you're working on, so at least it keeps the
> old behavior (which is pure lookup on pfn injections) until a solid issue
> occurs?  It avoids the case where we could introduce unnecessary code but
> then it's much harder to justify a removal.  What do you think?

I'll use the _pfn variant and document the behavior.

I do wonder why we even have to lookup the memtype again if the caller 
apparently reserved it (which implied checking it). All a bit weird.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 02/11] mm: convert track_pfn_insert() to pfnmap_sanitize_pgprot()

Posted by Peter Xu 9 months, 2 weeks ago

On Tue, Apr 29, 2025 at 06:25:06PM +0200, David Hildenbrand wrote:
> I do wonder why we even have to lookup the memtype again if the caller
> apparently reserved it (which implied checking it). All a bit weird.

Maybe it's because the memtype info isn't always visible to the upper
layers, e.g. default pci_iomap() for MMIOs doesn't need to specify anything
on cache mode.  There's some pci_iomap_wc() variance, but still looks like
only the internal has full control..

-- 
Peter Xu