[PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()

David Hildenbrand posted 4 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
Many users (including upcoming ones) don't really need the flags etc,
and can live with a function call.

So let's provide a basic, non-inlined folio_pte_batch().

In zap_present_ptes(), where we care about performance, the compiler
already seem to generate a call to a common inlined folio_pte_batch()
variant, shared with fork() code. So calling the new non-inlined variant
should not make a difference.

While at it, drop the "addr" parameter that is unused.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h  | 11 ++++++++---
 mm/madvise.c   |  4 ++--
 mm/memory.c    |  6 ++----
 mm/mempolicy.c |  3 +--
 mm/mlock.c     |  3 +--
 mm/mremap.c    |  3 +--
 mm/rmap.c      |  3 +--
 mm/util.c      | 29 +++++++++++++++++++++++++++++
 8 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ca6590c6d9eab..6000b683f68ee 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 }
 
 /**
- * folio_pte_batch - detect a PTE batch for a large folio
+ * folio_pte_batch_ext - detect a PTE batch for a large folio
  * @folio: The large folio to detect a PTE batch for.
- * @addr: The user virtual address the first page is mapped at.
  * @ptep: Page table pointer for the first entry.
  * @pte: Page table entry for the first page.
  * @max_nr: The maximum number of table entries to consider.
@@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
+ * This function will be inlined to optimize based on the input parameters;
+ * consider using folio_pte_batch() instead if applicable.
+ *
  * Return: the number of table entries in the batch.
  */
-static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
+static inline unsigned int folio_pte_batch_ext(struct folio *folio,
 		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
@@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
 	return min(nr, max_nr);
 }
 
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr);
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
diff --git a/mm/madvise.c b/mm/madvise.c
index 661bb743d2216..9b9c35a398ed0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
 {
 	int max_nr = (end - addr) / PAGE_SIZE;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       any_young, any_dirty);
+	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
+				   any_young, any_dirty);
 }
 
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
diff --git a/mm/memory.c b/mm/memory.c
index ab2d6c1425691..43d35d6675f2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		if (vma_soft_dirty_enabled(src_vma))
 			flags |= FPB_HONOR_SOFT_DIRTY;
 
-		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
+		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
 				     &any_writable, NULL, NULL);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
@@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
-				     NULL, NULL, NULL);
-
+		nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
 				       force_break, any_skipped);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2a25eedc3b1c0..eb83cff7db8c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!folio || folio_is_zone_device(folio))
 			continue;
 		if (folio_test_large(folio) && max_nr != 1)
-			nr = folio_pte_batch(folio, addr, pte, ptent,
-					     max_nr, 0, NULL, NULL, NULL);
+			nr = folio_pte_batch(folio, pte, ptent, max_nr);
 		/*
 		 * vm_normal_folio() filters out zero pages, but there might
 		 * still be reserved folios to skip, perhaps in a VDSO.
diff --git a/mm/mlock.c b/mm/mlock.c
index 2238cdc5eb1c1..a1d93ad33c6db 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
 	if (!folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, pte, ptent, count);
 }
 
 static inline bool allow_mlock_munlock(struct folio *folio,
diff --git a/mm/mremap.c b/mm/mremap.c
index d4d3ffc931502..1f5bebbb9c0cb 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
 	if (!folio || !folio_test_large(folio))
 		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       NULL, NULL);
+	return folio_pte_batch(folio, ptep, pte, max_nr);
 }
 
 static int move_ptes(struct pagetable_move_control *pmc,
diff --git a/mm/rmap.c b/mm/rmap.c
index a29d7d29c7283..6658968600b72 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
 	if (pte_pfn(pte) != folio_pfn(folio))
 		return false;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
-			       NULL, NULL) == max_nr;
+	return folio_pte_batch(folio, ptep, pte, max_nr);
 }
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 0b270c43d7d12..d29dcc135ad28 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 EXPORT_SYMBOL(compat_vma_mmap_prepare);
+
+#ifdef CONFIG_MMU
+/**
+ * folio_pte_batch - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @max_nr: The maximum number of table entries to consider.
+ *
+ * This is a simplified variant of folio_pte_batch_ext().
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio in a single VMA and a single page table.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
+ * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
+ *
+ * ptep must map any page of the folio. max_nr must be at least one and
+ * must be limited by the caller so scanning cannot exceed a single VMA and
+ * a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
+		unsigned int max_nr)
+{
+	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
+}
+#endif /* CONFIG_MMU */
-- 
2.49.0
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Oscar Salvador 3 months, 1 week ago
On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
> 
> So let's provide a basic, non-inlined folio_pte_batch().
> 
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
> 
> While at it, drop the "addr" parameter that is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

FWIW, folio_pte_batch_flags seems more appealling to me as well.

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Oscar Salvador 3 months, 1 week ago
On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
> 
> So let's provide a basic, non-inlined folio_pte_batch().
> 
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
> 
> While at it, drop the "addr" parameter that is unused.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

So, let me see if I get this.

Before this, every other user that doesn't use the extra flags (dirty,
etc...) will end up with the code, optimized out, inlined within its body?

With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
but without the runtime checks for those arguments will be created (folio_pte_batch()),
and so the users of it won't have it inlined in their body ?


-- 
Oscar Salvador
SUSE Labs
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
On 02.07.25 11:02, Oscar Salvador wrote:
> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
>>
>> While at it, drop the "addr" parameter that is unused.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> So, let me see if I get this.
> 
> Before this, every other user that doesn't use the extra flags (dirty,
> etc...) will end up with the code, optimized out, inlined within its body?

Not necessarily inlined into the body (there might still be a function 
call, depending on what the compiler decides), but inlined into the 
object file and optimized by propagating constants.

> 
> With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
> but without the runtime checks for those arguments will be created (folio_pte_batch()),
> and so the users of it won't have it inlined in their body ?

Right. We have a single folio_pte_batch() that is optimized by 
propagating all constants. Instead of having one per object file, we 
have a single shared one.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Oscar Salvador 3 months, 1 week ago
On Wed, Jul 02, 2025 at 11:05:17AM +0200, David Hildenbrand wrote:
> Not necessarily inlined into the body (there might still be a function call,
> depending on what the compiler decides), but inlined into the object file
> and optimized by propagating constants.

I see.

> > With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
> > but without the runtime checks for those arguments will be created (folio_pte_batch()),
> > and so the users of it won't have it inlined in their body ?
> 
> Right. We have a single folio_pte_batch() that is optimized by propagating
> all constants. Instead of having one per object file, we have a single
> shared one.

Alright, clear to me now, thanks for claryfing ;-)!

 

-- 
Oscar Salvador
SUSE Labs
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
On 02.07.25 11:07, Oscar Salvador wrote:
> On Wed, Jul 02, 2025 at 11:05:17AM +0200, David Hildenbrand wrote:
>> Not necessarily inlined into the body (there might still be a function call,
>> depending on what the compiler decides), but inlined into the object file
>> and optimized by propagating constants.
> 
> I see.
> 
>>> With this change, a single function, folio_pte_batch(), identical to folio_pte_batch_ext
>>> but without the runtime checks for those arguments will be created (folio_pte_batch()),
>>> and so the users of it won't have it inlined in their body ?
>>
>> Right. We have a single folio_pte_batch() that is optimized by propagating
>> all constants. Instead of having one per object file, we have a single
>> shared one.
> 
> Alright, clear to me now, thanks for claryfing ;-)!

Will clarify that in the patch description, thanks!

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Zi Yan 3 months, 1 week ago
On 27 Jun 2025, at 7:55, David Hildenbrand wrote:

> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
Nice cleanup. Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().

Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?

>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.
>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Other than the query above + nit on name below, this is really nice!

> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ca6590c6d9eab..6000b683f68ee 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  }
>
>  /**
> - * folio_pte_batch - detect a PTE batch for a large folio
> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
>   * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * This function will be inlined to optimize based on the input parameters;
> + * consider using folio_pte_batch() instead if applicable.
> + *
>   * Return: the number of table entries in the batch.
>   */
> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>  		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>  		bool *any_writable, bool *any_young, bool *any_dirty)

Sorry this is really really annoying feedback :P but _ext() makes me think of
page_ext and ugh :))

Wonder if __folio_pte_batch() is better?

This is obviously, not a big deal (TM)

>  {
> @@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
>  	return min(nr, max_nr);
>  }
>
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +		unsigned int max_nr);
> +
>  /**
>   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>   *	 forward or backward by delta
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 661bb743d2216..9b9c35a398ed0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  {
>  	int max_nr = (end - addr) / PAGE_SIZE;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       any_young, any_dirty);
> +	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> +				   any_young, any_dirty);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> diff --git a/mm/memory.c b/mm/memory.c
> index ab2d6c1425691..43d35d6675f2e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  		if (vma_soft_dirty_enabled(src_vma))
>  			flags |= FPB_HONOR_SOFT_DIRTY;
>
> -		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> +		nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
>  				     &any_writable, NULL, NULL);
>  		folio_ref_add(folio, nr);
>  		if (folio_test_anon(folio)) {
> @@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>  	 * by keeping the batching logic separate.
>  	 */
>  	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
> -				     NULL, NULL, NULL);
> -
> +		nr = folio_pte_batch(folio, pte, ptent, max_nr);
>  		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>  				       addr, details, rss, force_flush,
>  				       force_break, any_skipped);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2a25eedc3b1c0..eb83cff7db8c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!folio || folio_is_zone_device(folio))
>  			continue;
>  		if (folio_test_large(folio) && max_nr != 1)
> -			nr = folio_pte_batch(folio, addr, pte, ptent,
> -					     max_nr, 0, NULL, NULL, NULL);
> +			nr = folio_pte_batch(folio, pte, ptent, max_nr);
>  		/*
>  		 * vm_normal_folio() filters out zero pages, but there might
>  		 * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 2238cdc5eb1c1..a1d93ad33c6db 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
>  	if (!folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
> -			       NULL, NULL);
> +	return folio_pte_batch(folio, pte, ptent, count);
>  }
>
>  static inline bool allow_mlock_munlock(struct folio *folio,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index d4d3ffc931502..1f5bebbb9c0cb 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>  	if (!folio || !folio_test_large(folio))
>  		return 1;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       NULL, NULL);
> +	return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  static int move_ptes(struct pagetable_move_control *pmc,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a29d7d29c7283..6658968600b72 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>  	if (pte_pfn(pte) != folio_pfn(folio))
>  		return false;
>
> -	return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -			       NULL, NULL) == max_nr;
> +	return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 0b270c43d7d12..d29dcc135ad28 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  EXPORT_SYMBOL(compat_vma_mmap_prepare);
> +
> +#ifdef CONFIG_MMU
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + *
> + * This is a simplified variant of folio_pte_batch_ext().
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio in a single VMA and a single page table.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> + * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
> + *
> + * ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +		unsigned int max_nr)
> +{
> +	return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +}
> +#endif /* CONFIG_MMU */
> --
> 2.49.0
>
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
On 27.06.25 20:48, Lorenzo Stoakes wrote:
> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
> 
> Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?

We want to always generate a function that uses as little runtime checks 
as possible. Essentially, optimize out the "flags" as much as possible.

In case of folio_pte_batch(), where we won't use any flags, any checks 
will be optimized out by the compiler.

So we get a single, specialized, non-inlined function.

> 
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
>>
>> While at it, drop the "addr" parameter that is unused.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Other than the query above + nit on name below, this is really nice!
> 
>> ---
>>   mm/internal.h  | 11 ++++++++---
>>   mm/madvise.c   |  4 ++--
>>   mm/memory.c    |  6 ++----
>>   mm/mempolicy.c |  3 +--
>>   mm/mlock.c     |  3 +--
>>   mm/mremap.c    |  3 +--
>>   mm/rmap.c      |  3 +--
>>   mm/util.c      | 29 +++++++++++++++++++++++++++++
>>   8 files changed, 45 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index ca6590c6d9eab..6000b683f68ee 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   }
>>
>>   /**
>> - * folio_pte_batch - detect a PTE batch for a large folio
>> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>>    * @folio: The large folio to detect a PTE batch for.
>> - * @addr: The user virtual address the first page is mapped at.
>>    * @ptep: Page table pointer for the first entry.
>>    * @pte: Page table entry for the first page.
>>    * @max_nr: The maximum number of table entries to consider.
>> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>    * must be limited by the caller so scanning cannot exceed a single VMA and
>>    * a single page table.
>>    *
>> + * This function will be inlined to optimize based on the input parameters;
>> + * consider using folio_pte_batch() instead if applicable.
>> + *
>>    * Return: the number of table entries in the batch.
>>    */
>> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
>> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>>   		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>>   		bool *any_writable, bool *any_young, bool *any_dirty)
> 
> Sorry this is really really annoying feedback :P but _ext() makes me think of
> page_ext and ugh :))
> 
> Wonder if __folio_pte_batch() is better?
> 
> This is obviously, not a big deal (TM)

Obviously, I had that as part of the development, and decided against it 
at some point. :)

Yeah, _ext() is not common in MM yet, in contrast to other subsystems. 
The only user is indeed page_ext. On arm we seem to have set_pte_ext(). 
But it's really "page_ext", that's the problematic part, not "_ext" :P

No strong opinion, but I tend to dislike here "__", because often it 
means "internal helper you're not supposed to used", which isn't really 
the case here.

E.g.,

alloc_frozen_pages() -> alloc_frozen_pages_noprof() -> 
__alloc_frozen_pages_noprof()

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Lorenzo Stoakes 3 months, 1 week ago
On Mon, Jun 30, 2025 at 11:19:13AM +0200, David Hildenbrand wrote:
> On 27.06.25 20:48, Lorenzo Stoakes wrote:
> > On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
> > > Many users (including upcoming ones) don't really need the flags etc,
> > > and can live with a function call.
> > >
> > > So let's provide a basic, non-inlined folio_pte_batch().
> >
> > Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?
>
> We want to always generate a function that uses as little runtime checks as
> possible. Essentially, optimize out the "flags" as much as possible.
>
> In case of folio_pte_batch(), where we won't use any flags, any checks will
> be optimized out by the compiler.
>
> So we get a single, specialized, non-inlined function.

I mean I suppose code bloat is a thing too. Would the compiler not also optimise
out checks if it were inlined though?

>
> >
> > >
> > > In zap_present_ptes(), where we care about performance, the compiler
> > > already seem to generate a call to a common inlined folio_pte_batch()
> > > variant, shared with fork() code. So calling the new non-inlined variant
> > > should not make a difference.
> > >
> > > While at it, drop the "addr" parameter that is unused.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > Other than the query above + nit on name below, this is really nice!
> >
> > > ---
> > >   mm/internal.h  | 11 ++++++++---
> > >   mm/madvise.c   |  4 ++--
> > >   mm/memory.c    |  6 ++----
> > >   mm/mempolicy.c |  3 +--
> > >   mm/mlock.c     |  3 +--
> > >   mm/mremap.c    |  3 +--
> > >   mm/rmap.c      |  3 +--
> > >   mm/util.c      | 29 +++++++++++++++++++++++++++++
> > >   8 files changed, 45 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ca6590c6d9eab..6000b683f68ee 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >   }
> > >
> > >   /**
> > > - * folio_pte_batch - detect a PTE batch for a large folio
> > > + * folio_pte_batch_ext - detect a PTE batch for a large folio
> > >    * @folio: The large folio to detect a PTE batch for.
> > > - * @addr: The user virtual address the first page is mapped at.
> > >    * @ptep: Page table pointer for the first entry.
> > >    * @pte: Page table entry for the first page.
> > >    * @max_nr: The maximum number of table entries to consider.
> > > @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >    * must be limited by the caller so scanning cannot exceed a single VMA and
> > >    * a single page table.
> > >    *
> > > + * This function will be inlined to optimize based on the input parameters;
> > > + * consider using folio_pte_batch() instead if applicable.
> > > + *
> > >    * Return: the number of table entries in the batch.
> > >    */
> > > -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
> > >   		pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
> > >   		bool *any_writable, bool *any_young, bool *any_dirty)
> >
> > Sorry this is really really annoying feedback :P but _ext() makes me think of
> > page_ext and ugh :))
> >
> > Wonder if __folio_pte_batch() is better?
> >
> > This is obviously, not a big deal (TM)
>
> Obviously, I had that as part of the development, and decided against it at
> some point. :)
>
> Yeah, _ext() is not common in MM yet, in contrast to other subsystems. The
> only user is indeed page_ext. On arm we seem to have set_pte_ext(). But it's
> really "page_ext", that's the problematic part, not "_ext" :P
>
> No strong opinion, but I tend to dislike here "__", because often it means
> "internal helper you're not supposed to used", which isn't really the case
> here.

Yeah, and of course we break this convention all over the place :)

Maybe folio_pte_batch_flags()?

>
> E.g.,
>
> alloc_frozen_pages() -> alloc_frozen_pages_noprof() ->
> __alloc_frozen_pages_noprof()
>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
On 30.06.25 12:41, Lorenzo Stoakes wrote:
> On Mon, Jun 30, 2025 at 11:19:13AM +0200, David Hildenbrand wrote:
>> On 27.06.25 20:48, Lorenzo Stoakes wrote:
>>> On Fri, Jun 27, 2025 at 01:55:09PM +0200, David Hildenbrand wrote:
>>>> Many users (including upcoming ones) don't really need the flags etc,
>>>> and can live with a function call.
>>>>
>>>> So let's provide a basic, non-inlined folio_pte_batch().
>>>
>>> Hm, but why non-inlined, when it invokes an inlined function? Seems odd no?
>>
>> We want to always generate a function that uses as little runtime checks as
>> possible. Essentially, optimize out the "flags" as much as possible.
>>
>> In case of folio_pte_batch(), where we won't use any flags, any checks will
>> be optimized out by the compiler.
>>
>> So we get a single, specialized, non-inlined function.
> 
> I mean I suppose code bloat is a thing too. Would the compiler not also optimise
> out checks if it were inlined though?

The compiler will optimize all (most) inlined variants, yes.

But we will end up creating the same optimized variant for each 
folio_pte_batch() user before this change.

And as Andrew put it

"And why the heck is folio_pte_batch() inlined?  It's larger then my 
first hard disk" [1]

I should probably add a suggested-by + link to that discussion.

[1] 
https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/

[...]

>>
>> Obviously, I had that as part of the development, and decided against it at
>> some point. :)
>>
>> Yeah, _ext() is not common in MM yet, in contrast to other subsystems. The
>> only user is indeed page_ext. On arm we seem to have set_pte_ext(). But it's
>> really "page_ext", that's the problematic part, not "_ext" :P
>>
>> No strong opinion, but I tend to dislike here "__", because often it means
>> "internal helper you're not supposed to used", which isn't really the case
>> here.
> 
> Yeah, and of course we break this convention all over the place :)
> 
> Maybe folio_pte_batch_flags()?

Works for me as well.

-- 
Cheers,

David / dhildenb
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Lance Yang 3 months, 1 week ago
On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Many users (including upcoming ones) don't really need the flags etc,
> and can live with a function call.
>
> So let's provide a basic, non-inlined folio_pte_batch().
>
> In zap_present_ptes(), where we care about performance, the compiler
> already seem to generate a call to a common inlined folio_pte_batch()
> variant, shared with fork() code. So calling the new non-inlined variant
> should not make a difference.

It's always an interesting dance with the compiler when it comes to inlining,
isn't it? We want the speed of 'inline' for critical paths, but also a compact
binary for the common case ...

This split is a nice solution to the classic 'inline' vs. code size dilemma ;p

Thanks,
Lance

>
> While at it, drop the "addr" parameter that is unused.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h  | 11 ++++++++---
>  mm/madvise.c   |  4 ++--
>  mm/memory.c    |  6 ++----
>  mm/mempolicy.c |  3 +--
>  mm/mlock.c     |  3 +--
>  mm/mremap.c    |  3 +--
>  mm/rmap.c      |  3 +--
>  mm/util.c      | 29 +++++++++++++++++++++++++++++
>  8 files changed, 45 insertions(+), 17 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ca6590c6d9eab..6000b683f68ee 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -218,9 +218,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>  }
>
>  /**
> - * folio_pte_batch - detect a PTE batch for a large folio
> + * folio_pte_batch_ext - detect a PTE batch for a large folio
>   * @folio: The large folio to detect a PTE batch for.
> - * @addr: The user virtual address the first page is mapped at.
>   * @ptep: Page table pointer for the first entry.
>   * @pte: Page table entry for the first page.
>   * @max_nr: The maximum number of table entries to consider.
> @@ -243,9 +242,12 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * must be limited by the caller so scanning cannot exceed a single VMA and
>   * a single page table.
>   *
> + * This function will be inlined to optimize based on the input parameters;
> + * consider using folio_pte_batch() instead if applicable.
> + *
>   * Return: the number of table entries in the batch.
>   */
> -static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long addr,
> +static inline unsigned int folio_pte_batch_ext(struct folio *folio,
>                 pte_t *ptep, pte_t pte, unsigned int max_nr, fpb_t flags,
>                 bool *any_writable, bool *any_young, bool *any_dirty)
>  {
> @@ -293,6 +295,9 @@ static inline unsigned int folio_pte_batch(struct folio *folio, unsigned long ad
>         return min(nr, max_nr);
>  }
>
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +               unsigned int max_nr);
> +
>  /**
>   * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>   *      forward or backward by delta
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 661bb743d2216..9b9c35a398ed0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -349,8 +349,8 @@ static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
>  {
>         int max_nr = (end - addr) / PAGE_SIZE;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              any_young, any_dirty);
> +       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL,
> +                                  any_young, any_dirty);
>  }
>
>  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> diff --git a/mm/memory.c b/mm/memory.c
> index ab2d6c1425691..43d35d6675f2e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -995,7 +995,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                 if (vma_soft_dirty_enabled(src_vma))
>                         flags |= FPB_HONOR_SOFT_DIRTY;
>
> -               nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> +               nr = folio_pte_batch_ext(folio, src_pte, pte, max_nr, flags,
>                                      &any_writable, NULL, NULL);
>                 folio_ref_add(folio, nr);
>                 if (folio_test_anon(folio)) {
> @@ -1564,9 +1564,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>          * by keeping the batching logic separate.
>          */
>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> -               nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, 0,
> -                                    NULL, NULL, NULL);
> -
> +               nr = folio_pte_batch(folio, pte, ptent, max_nr);
>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>                                        addr, details, rss, force_flush,
>                                        force_break, any_skipped);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2a25eedc3b1c0..eb83cff7db8c3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -711,8 +711,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
>                 if (!folio || folio_is_zone_device(folio))
>                         continue;
>                 if (folio_test_large(folio) && max_nr != 1)
> -                       nr = folio_pte_batch(folio, addr, pte, ptent,
> -                                            max_nr, 0, NULL, NULL, NULL);
> +                       nr = folio_pte_batch(folio, pte, ptent, max_nr);
>                 /*
>                  * vm_normal_folio() filters out zero pages, but there might
>                  * still be reserved folios to skip, perhaps in a VDSO.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 2238cdc5eb1c1..a1d93ad33c6db 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -313,8 +313,7 @@ static inline unsigned int folio_mlock_step(struct folio *folio,
>         if (!folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, pte, ptent, count, 0, NULL,
> -                              NULL, NULL);
> +       return folio_pte_batch(folio, pte, ptent, count);
>  }
>
>  static inline bool allow_mlock_munlock(struct folio *folio,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index d4d3ffc931502..1f5bebbb9c0cb 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -182,8 +182,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr
>         if (!folio || !folio_test_large(folio))
>                 return 1;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              NULL, NULL);
> +       return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  static int move_ptes(struct pagetable_move_control *pmc,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a29d7d29c7283..6658968600b72 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1859,8 +1859,7 @@ static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
>         if (pte_pfn(pte) != folio_pfn(folio))
>                 return false;
>
> -       return folio_pte_batch(folio, addr, ptep, pte, max_nr, 0, NULL,
> -                              NULL, NULL) == max_nr;
> +       return folio_pte_batch(folio, ptep, pte, max_nr);
>  }
>
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 0b270c43d7d12..d29dcc135ad28 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1171,3 +1171,32 @@ int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
>         return 0;
>  }
>  EXPORT_SYMBOL(compat_vma_mmap_prepare);
> +
> +#ifdef CONFIG_MMU
> +/**
> + * folio_pte_batch - detect a PTE batch for a large folio
> + * @folio: The large folio to detect a PTE batch for.
> + * @ptep: Page table pointer for the first entry.
> + * @pte: Page table entry for the first page.
> + * @max_nr: The maximum number of table entries to consider.
> + *
> + * This is a simplified variant of folio_pte_batch_ext().
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same large folio in a single VMA and a single page table.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> + * the accessed bit, writable bit, dirt-bit and soft-dirty bit.
> + *
> + * ptep must map any page of the folio. max_nr must be at least one and
> + * must be limited by the caller so scanning cannot exceed a single VMA and
> + * a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
> +               unsigned int max_nr)
> +{
> +       return folio_pte_batch_ext(folio, ptep, pte, max_nr, 0, NULL, NULL, NULL);
> +}
> +#endif /* CONFIG_MMU */
> --
> 2.49.0
>
>
Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by David Hildenbrand 3 months, 1 week ago
On 27.06.25 16:19, Lance Yang wrote:
> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Many users (including upcoming ones) don't really need the flags etc,
>> and can live with a function call.
>>
>> So let's provide a basic, non-inlined folio_pte_batch().
>>
>> In zap_present_ptes(), where we care about performance, the compiler
>> already seem to generate a call to a common inlined folio_pte_batch()
>> variant, shared with fork() code. So calling the new non-inlined variant
>> should not make a difference.
> 
> It's always an interesting dance with the compiler when it comes to inlining,
> isn't it? We want the speed of 'inline' for critical paths, but also a compact
> binary for the common case ...
> 
> This split is a nice solution to the classic 'inline' vs. code size dilemma ;p

Yeah, in particular when we primarily care about optimizing out all the 
unnecessary checks inside the function, not necessarily also inlining 
the function call itself.

If we ever realize we absolute must inline it into a caller, we could 
turn folio_pte_batch_ext() into an "__always_inline", but for now it 
does not seem like this is really required from my experiments.

-- 
Cheers,

David / dhildenb

Re: [PATCH v1 3/4] mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_ext()
Posted by Lance Yang 3 months, 1 week ago

On 2025/6/27 23:09, David Hildenbrand wrote:
> On 27.06.25 16:19, Lance Yang wrote:
>> On Fri, Jun 27, 2025 at 7:55 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> Many users (including upcoming ones) don't really need the flags etc,
>>> and can live with a function call.
>>>
>>> So let's provide a basic, non-inlined folio_pte_batch().
>>>
>>> In zap_present_ptes(), where we care about performance, the compiler
>>> already seem to generate a call to a common inlined folio_pte_batch()
>>> variant, shared with fork() code. So calling the new non-inlined variant
>>> should not make a difference.
>>
>> It's always an interesting dance with the compiler when it comes to 
>> inlining,
>> isn't it? We want the speed of 'inline' for critical paths, but also a 
>> compact
>> binary for the common case ...
>>
>> This split is a nice solution to the classic 'inline' vs. code size 
>> dilemma ;p
> 
> Yeah, in particular when we primarily care about optimizing out all the 
> unnecessary checks inside the function, not necessarily also inlining 
> the function call itself.
> 
> If we ever realize we absolute must inline it into a caller, we could 
> turn folio_pte_batch_ext() into an "__always_inline", but for now it 
> does not seem like this is really required from my experiments.

Right, that makes sense. No need to force "__always_inline" prematurely.