Optimize mremap() by PTE-batching

[PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Dev Jain 9 months, 1 week ago

To use PTE batching, we want to determine whether the folio mapped by
the PTE is large, thus requiring the use of vm_normal_folio(). We want
to avoid the cost of vm_normal_folio() if the code path doesn't already
require the folio. For arm64, pte_batch_hint() does the job. To generalize
this hint, add a helper which will determine whether two consecutive PTEs
point to consecutive PFNs, in which case there is a high probability that
the underlying folio is large.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..28e21fcc7837 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
 }
 #endif
 
+/* Caller must ensure that ptep + 1 exists */
+static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
+{
+	pte_t *next_ptep, next_pte;
+
+	if (pte_batch_hint(ptep, pte) != 1)
+		return true;
+
+	next_ptep = ptep + 1;
+	next_pte = ptep_get(next_ptep);
+	if (!pte_present(next_pte))
+		return false;
+
+	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
+}
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
-- 
2.30.2

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by David Hildenbrand 9 months ago

On 06.05.25 07:00, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   include/linux/pgtable.h | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..28e21fcc7837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>   }
>   #endif
>   
> +/* Caller must ensure that ptep + 1 exists */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> +}

So, where we want to use that is:

if (pte_present(old_pte)) {
	if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
		struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);

		if (folio && folio_test_large(folio))
			nr = folio_pte_batch(folio, old_addr, old_ptep,
					     old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
	}
}

where we won't need the folio later. But want it all part of the same folio?


And the simpler version would be


if (pte_present(old_pte)) {
	if (max_nr != 1) {
		struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);

		if (folio && folio_test_large(folio))
			nr = folio_pte_batch(folio, old_addr, old_ptep,
					     old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
	}
}


Two things come to mind:

(1) Do we *really* care about the vm_normal_folio() + folio_test_large() call that much, that you
have to add this optimization ahead of times ? :)

(2) Do we really need "must be part of the same folio", or could be just batch over present
ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
might be better.

-- 
Cheers,

David / dhildenb

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Dev Jain 9 months ago


On 07/05/25 3:33 pm, David Hildenbrand wrote:
> On 06.05.25 07:00, Dev Jain wrote:
>> To use PTE batching, we want to determine whether the folio mapped by
>> the PTE is large, thus requiring the use of vm_normal_folio(). We want
>> to avoid the cost of vm_normal_folio() if the code path doesn't already
>> require the folio. For arm64, pte_batch_hint() does the job. To 
>> generalize
>> this hint, add a helper which will determine whether two consecutive PTEs
>> point to consecutive PFNs, in which case there is a high probability that
>> the underlying folio is large.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..28e21fcc7837 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>>   }
>>   #endif
>> +/* Caller must ensure that ptep + 1 exists */
>> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
>> +{
>> +    pte_t *next_ptep, next_pte;
>> +
>> +    if (pte_batch_hint(ptep, pte) != 1)
>> +        return true;
>> +
>> +    next_ptep = ptep + 1;
>> +    next_pte = ptep_get(next_ptep);
>> +    if (!pte_present(next_pte))
>> +        return false;
>> +
>> +    return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
>> +}
> 
> So, where we want to use that is:
> 
> if (pte_present(old_pte)) {
>      if ((max_nr != 1) && maybe_contiguous_pte_pfns(old_ptep, old_pte)) {
>          struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> 
>          if (folio && folio_test_large(folio))
>              nr = folio_pte_batch(folio, old_addr, old_ptep,
>                           old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>      }
> }
> 
> where we won't need the folio later. But want it all part of the same 
> folio?
> 
> 
> And the simpler version would be
> 
> 
> if (pte_present(old_pte)) {
>      if (max_nr != 1) {
>          struct folio *folio = vm_normal_folio(vma, old_addr, old_pte);
> 
>          if (folio && folio_test_large(folio))
>              nr = folio_pte_batch(folio, old_addr, old_ptep,
>                           old_pte, max_nr, fpb_flags, NULL, NULL, NULL);
>      }
> }
> 
> 
> Two things come to mind:
> 
> (1) Do we *really* care about the vm_normal_folio() + folio_test_large() 
> call that much, that you
> have to add this optimization ahead of times ? :)

For my mprotect series, I see a regression of almost (7.7 - 7.65)/7.7 = 
0.65% for the small folio case. I am happy to remove this 
micro-optimization if that is the preference.

> 
> (2) Do we really need "must be part of the same folio", or could be just 
> batch over present
> ptes that map consecutive PFNs? In that case, a helper that avoids 
> folio_pte_batch() completely
> might be better.
> 
I am not sure I get you here. folio_pte_batch() seems to be the simplest 
thing we can do as being done around in the code elsewhere, I am not 
aware of any alternate.

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by David Hildenbrand 9 months ago

>> (2) Do we really need "must be part of the same folio", or could be just
>> batch over present
>> ptes that map consecutive PFNs? In that case, a helper that avoids
>> folio_pte_batch() completely
>> might be better.
>>
> I am not sure I get you here. folio_pte_batch() seems to be the simplest
> thing we can do as being done around in the code elsewhere, I am not
> aware of any alternate.

If we don't need the folio, then we can have a batching function that
doesn't require the folio.

Likely, we could even factor that (non-folio batching) out from folio_pte_batch().
The recent fix [1] might make that easier. See below.


So my question is: is something relying on all of these PTEs to point at the same folio?

[1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz


Something like this: (would need kerneldoc, probably remove "addr" parameter from folio_pte_batch(),
and look into other related cleanups as discussed with Andrew)


 From f56f67ee5ae9879adb99a8da37fa7ec848c4d256 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Thu, 8 May 2025 12:53:52 +0200
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/internal.h | 84 ++++++++++++++++++++++++++++-----------------------
  1 file changed, 46 insertions(+), 38 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 25a29872c634b..53ff8f8a7c8f9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -217,36 +217,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  	return pte_wrprotect(pte_mkold(pte));
  }
  
-/**
- * folio_pte_batch - detect a PTE batch for a large folio
- * @folio: The large folio to detect a PTE batch for.
- * @addr: The user virtual address the first page is mapped at.
- * @start_ptep: Page table pointer for the first entry.
- * @pte: Page table entry for the first page.
- * @max_nr: The maximum number of table entries to consider.
- * @flags: Flags to modify the PTE batch semantics.
- * @any_writable: Optional pointer to indicate whether any entry except the
- *		  first one is writable.
- * @any_young: Optional pointer to indicate whether any entry except the
- *		  first one is young.
- * @any_dirty: Optional pointer to indicate whether any entry except the
- *		  first one is dirty.
- *
- * Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same large folio.
- *
- * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
- * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
- * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
- *
- * start_ptep must map any page of the folio. max_nr must be at least one and
- * must be limited by the caller so scanning cannot exceed a single page table.
- *
- * Return: the number of table entries in the batch.
- */
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
-		bool *any_writable, bool *any_young, bool *any_dirty)
+static inline int pte_batch(pte_t *start_ptep, pte_t pte, int max_nr,
+		fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty)
  {
  	pte_t expected_pte, *ptep;
  	bool writable, young, dirty;
@@ -259,14 +231,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  	if (any_dirty)
  		*any_dirty = false;
  
-	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
-	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
-	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
-
-	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
-	max_nr = min_t(unsigned long, max_nr,
-		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
-
  	nr = pte_batch_hint(start_ptep, pte);
  	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
  	ptep = start_ptep + nr;
@@ -300,6 +264,50 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  	return min(nr, max_nr);
  }
  
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @addr: The user virtual address the first page is mapped at.
+ * @start_ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ * @flags: Flags to modify the PTE batch semantics.
+ * @any_writable: Optional pointer to indicate whether any entry except the
+ *		  first one is writable.
+ * @any_young: Optional pointer to indicate whether any entry except the
+ *		  first one is young.
+ * @any_dirty: Optional pointer to indicate whether any entry except the
+ *		  first one is dirty.
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
+ * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
+ *
+ * start_ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+		bool *any_writable, bool *any_young, bool *any_dirty)
+{
+
+	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
+	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+
+	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
+	max_nr = min_t(unsigned long, max_nr,
+		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
+
+	return pte_batch(start_ptep, pte, max_nr, flags, any_writable, any_young,
+			 any_dirty);
+}
+
  /**
   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
   *	 forward or backward by delta
-- 
2.49.0


-- 
Cheers,

David / dhildenb

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Dev Jain 9 months ago


On 08/05/25 4:25 pm, David Hildenbrand wrote:
> 
>>> (2) Do we really need "must be part of the same folio", or could be just
>>> batch over present
>>> ptes that map consecutive PFNs? In that case, a helper that avoids
>>> folio_pte_batch() completely
>>> might be better.
>>>
>> I am not sure I get you here. folio_pte_batch() seems to be the simplest
>> thing we can do as being done around in the code elsewhere, I am not
>> aware of any alternate.
> 
> If we don't need the folio, then we can have a batching function that
> doesn't require the folio.
> 
> Likely, we could even factor that (non-folio batching) out from 
> folio_pte_batch().
> The recent fix [1] might make that easier. See below.
> 
> 
> So my question is: is something relying on all of these PTEs to point at 
> the same folio?

Hmm...get_and_clear_full_ptes, as you say in another mail, will require 
that...

> 
> [1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz
> 
> 
> Something like this: (would need kerneldoc, probably remove "addr" 
> parameter from folio_pte_batch(),
> and look into other related cleanups as discussed with Andrew)

I like this refactoring! Can you tell the commit hash on which you make 
the patch, I cannot apply it.

So we need to collect/not collect a/d bits according to whether the pte 
batch belongs to a large folio/small folios. Seems complicated :)

> 
> 
>  From f56f67ee5ae9879adb99a8da37fa7ec848c4d256 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 8 May 2025 12:53:52 +0200
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/internal.h | 84 ++++++++++++++++++++++++++++-----------------------
>   1 file changed, 46 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 25a29872c634b..53ff8f8a7c8f9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -217,36 +217,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t 
> pte, fpb_t flags)
>       return pte_wrprotect(pte_mkold(pte));
>   }
> 
> -/**
> - * folio_pte_batch - detect a PTE batch for a large folio
> - * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
> - * @start_ptep: Page table pointer for the first entry.
> - * @pte: Page table entry for the first page.
> - * @max_nr: The maximum number of table entries to consider.
> - * @flags: Flags to modify the PTE batch semantics.
> - * @any_writable: Optional pointer to indicate whether any entry except 
> the
> - *          first one is writable.
> - * @any_young: Optional pointer to indicate whether any entry except the
> - *          first one is young.
> - * @any_dirty: Optional pointer to indicate whether any entry except the
> - *          first one is dirty.
> - *
> - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio.
> - *
> - * All PTEs inside a PTE batch have the same PTE bits set, excluding 
> the PFN,
> - * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> - * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> - *
> - * start_ptep must map any page of the folio. max_nr must be at least 
> one and
> - * must be limited by the caller so scanning cannot exceed a single 
> page table.
> - *
> - * Return: the number of table entries in the batch.
> - */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -        pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> -        bool *any_writable, bool *any_young, bool *any_dirty)
> +static inline int pte_batch(pte_t *start_ptep, pte_t pte, int max_nr,
> +        fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty)
>   {
>       pte_t expected_pte, *ptep;
>       bool writable, young, dirty;
> @@ -259,14 +231,6 @@ static inline int folio_pte_batch(struct folio 
> *folio, unsigned long addr,
>       if (any_dirty)
>           *any_dirty = false;
> 
> -    VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> -    VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> -    VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, 
> folio);
> -
> -    /* Limit max_nr to the actual remaining PFNs in the folio we could 
> batch. */
> -    max_nr = min_t(unsigned long, max_nr,
> -               folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
> -
>       nr = pte_batch_hint(start_ptep, pte);
>       expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), 
> flags);
>       ptep = start_ptep + nr;
> @@ -300,6 +264,50 @@ static inline int folio_pte_batch(struct folio 
> *folio, unsigned long addr,
>       return min(nr, max_nr);
>   }
> 
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @addr: The user virtual address the first page is mapped at.
> + * @start_ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + * @flags: Flags to modify the PTE batch semantics.
> + * @any_writable: Optional pointer to indicate whether any entry except 
> the
> + *          first one is writable.
> + * @any_young: Optional pointer to indicate whether any entry except the
> + *          first one is young.
> + * @any_dirty: Optional pointer to indicate whether any entry except the
> + *          first one is dirty.
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding 
> the PFN,
> + * the accessed bit, writable bit, dirty bit (with FPB_IGNORE_DIRTY) and
> + * soft-dirty bit (with FPB_IGNORE_SOFT_DIRTY).
> + *
> + * start_ptep must map any page of the folio. max_nr must be at least 
> one and
> + * must be limited by the caller so scanning cannot exceed a single 
> page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +        pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> +        bool *any_writable, bool *any_young, bool *any_dirty)
> +{
> +
> +    VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> +    VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> +    VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, 
> folio);
> +
> +    /* Limit max_nr to the actual remaining PFNs in the folio we could 
> batch. */
> +    max_nr = min_t(unsigned long, max_nr,
> +               folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
> +
> +    return pte_batch(start_ptep, pte, max_nr, flags, any_writable, 
> any_young,
> +             any_dirty);
> +}
> +
>   /**
>    * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>    *     forward or backward by delta

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by David Hildenbrand 9 months ago

On 09.05.25 07:25, Dev Jain wrote:
> 
> 
> On 08/05/25 4:25 pm, David Hildenbrand wrote:
>>
>>>> (2) Do we really need "must be part of the same folio", or could be just
>>>> batch over present
>>>> ptes that map consecutive PFNs? In that case, a helper that avoids
>>>> folio_pte_batch() completely
>>>> might be better.
>>>>
>>> I am not sure I get you here. folio_pte_batch() seems to be the simplest
>>> thing we can do as being done around in the code elsewhere, I am not
>>> aware of any alternate.
>>
>> If we don't need the folio, then we can have a batching function that
>> doesn't require the folio.
>>
>> Likely, we could even factor that (non-folio batching) out from
>> folio_pte_batch().
>> The recent fix [1] might make that easier. See below.
>>
>>
>> So my question is: is something relying on all of these PTEs to point at
>> the same folio?
> 
> Hmm...get_and_clear_full_ptes, as you say in another mail, will require
> that...
> 
>>
>> [1] https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz
>>
>>
>> Something like this: (would need kerneldoc, probably remove "addr"
>> parameter from folio_pte_batch(),
>> and look into other related cleanups as discussed with Andrew)
> 
> I like this refactoring! Can you tell the commit hash on which you make
> the patch, I cannot apply it.

Oh, it was just on top of my private version of [1]. It should now be in 
mm-new (or mm-unstable, did not check).

But as raised in my other mail, get_and_clear_full_ptes() might be 
problematic across folios.


-- 
Cheers,

David / dhildenb

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Matthew Wilcox 9 months, 1 week ago

On Tue, May 06, 2025 at 10:30:55AM +0530, Dev Jain wrote:
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);

umm ... PFN is in units of PAGE_SIZE.  Did you mean == 1?

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Dev Jain 9 months ago


On 06/05/25 9:16 pm, Matthew Wilcox wrote:
> On Tue, May 06, 2025 at 10:30:55AM +0530, Dev Jain wrote:
>> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> 
> umm ... PFN is in units of PAGE_SIZE.  Did you mean == 1?

My bad, thanks for pointing out.

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Anshuman Khandual 9 months, 1 week ago

On 5/6/25 10:30, Dev Jain wrote:
> To use PTE batching, we want to determine whether the folio mapped by
> the PTE is large, thus requiring the use of vm_normal_folio(). We want
> to avoid the cost of vm_normal_folio() if the code path doesn't already
> require the folio. For arm64, pte_batch_hint() does the job. To generalize
> this hint, add a helper which will determine whether two consecutive PTEs
> point to consecutive PFNs, in which case there is a high probability that
> the underlying folio is large.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..28e21fcc7837 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>  
> +/* Caller must ensure that ptep + 1 exists */
> +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> +{
> +	pte_t *next_ptep, next_pte;
> +
> +	if (pte_batch_hint(ptep, pte) != 1)
> +		return true;
> +
> +	next_ptep = ptep + 1;
> +	next_pte = ptep_get(next_ptep);
> +	if (!pte_present(next_pte))
> +		return false;
> +
> +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> +}
> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					    unsigned long address,

As mentioned earlier, this new helper maybe_contiguous_pte_pfns() does not
have a proposed caller. Hence please do consider folding this forward with
the next patch where move_ptes() starts using the helper. Besides, it is
also difficult to review this patch without a proper caller context.

Re: [PATCH 2/3] mm: Add generic helper to hint a large folio

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Tue, May 06, 2025 at 02:40:44PM +0530, Anshuman Khandual wrote:
> On 5/6/25 10:30, Dev Jain wrote:
> > To use PTE batching, we want to determine whether the folio mapped by
> > the PTE is large, thus requiring the use of vm_normal_folio(). We want
> > to avoid the cost of vm_normal_folio() if the code path doesn't already
> > require the folio. For arm64, pte_batch_hint() does the job. To generalize
> > this hint, add a helper which will determine whether two consecutive PTEs
> > point to consecutive PFNs, in which case there is a high probability that
> > the underlying folio is large.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >  include/linux/pgtable.h | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index b50447ef1c92..28e21fcc7837 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -369,6 +369,22 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +/* Caller must ensure that ptep + 1 exists */
> > +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> > +{
> > +	pte_t *next_ptep, next_pte;
> > +
> > +	if (pte_batch_hint(ptep, pte) != 1)
> > +		return true;
> > +
> > +	next_ptep = ptep + 1;
> > +	next_pte = ptep_get(next_ptep);
> > +	if (!pte_present(next_pte))
> > +		return false;
> > +
> > +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == PAGE_SIZE);
> > +}
> > +
> >  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >  					    unsigned long address,
>
> As mentioned earlier, this new helper maybe_contiguous_pte_pfns() does not
> have a proposed caller. Hence please do consider folding this forward with
> the next patch where move_ptes() starts using the helper. Besides, it is
> also difficult to review this patch without a proper caller context.

Agreed, please combine the two.