[v3] Optimize mprotect() for large folios

[PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 8 months, 3 weeks ago

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
Architecture can override these helpers; in case not, they are implemented
as a simple loop over the corresponding single pte helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c           |  4 +--
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..e40ed57e034d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 	__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+/**
+ * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
+ * over a batch of ptes, which protects against asynchronous hardware modifications
+ * to the ptes. The intention is not to prevent the hardware from making pte
+ * updates, but to prevent any updates it may make from being lost.
+ * Please see the comment above ptep_modify_prot_start() for full description.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
+ * folio.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/**
+ * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
+ * hardware-controlled bits in the PTE unmodified.
+ *
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_modify_prot_commit().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; ++i) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1ee160ed0b14..124612ce3d24 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 						jiffies_to_msecs(jiffies));
 			}
 
-			oldpte = ptep_modify_prot_start(vma, addr, pte);
+			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
 			ptent = pte_modify(oldpte, newprot);
 
 			if (uffd_wp)
@@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			    can_change_pte_writable(vma, addr, ptent))
 				ptent = pte_mkwrite(ptent, vma);
 
-			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 			pages++;
-- 
2.30.2

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 8 months, 3 weeks ago

On 19/05/2025 08:48, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers; in case not, they are implemented
> as a simple loop over the corresponding single pte helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>  mm/mprotect.c           |  4 +--
>  2 files changed, 77 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..e40ed57e034d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +/**
> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
> + * over a batch of ptes, which protects against asynchronous hardware modifications

nit: This overflows the 80 char soft limit.

> + * to the ptes. The intention is not to prevent the hardware from making pte
> + * updates, but to prevent any updates it may make from being lost.
> + * Please see the comment above ptep_modify_prot_start() for full description.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
> + * folio.

nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
"collecting the a/d bits from each pte in the batch" is clearer.

> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ.

nit: Perhaps "batch" would be more consistent than "range"?

> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)

I thought David H suggested modify_prot_ptes_start() and
modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
though.

> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {

I thought we agreed to make the loop logic a bit more standard. I don't recall
exactly what was finally agreed, but I would think something like this would be
better:

	for (i = 1; i < nr; i++) {

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/**
> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
> + * hardware-controlled bits in the PTE unmodified.
> + *
> + * @vma: The virtual memory area the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.

You've missed pte and old_pte params here.

> + * @nr: Number of entries.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_modify_prot_commit().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ.

How can it? All the applied bits other than the PFN will be exactly the same for
the range because they all come from pte. I think this line can be dropped.

> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.

The middle sentance doesn't apply; the PTEs will all initially be none if using
the default version of modify_prot_start_ptes(). I think that can be dropped.
But I think you need to explain that this will be the case on exit.

> + */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; ++i) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif

I have some general concerns about the correctness of batching these functions.
The support was originally added by Commit 1ea0704e0da6 ("mm: add a
ptep_modify_prot transaction abstraction"), and the intent was to make it easier
to defer the pte updates for XEN on x86.

Your default implementations of the batched versions will match the number of
ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
of the batch used for modify_prot_start_ptes(). That's a requirement and you've
met it. But in the batched case, there are 2 differences;

  - You can now have multiple PTEs within a start-commit block at one time. I
hope none of the specialized implementations care about that (i.e. XEN).

  - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
and according to your docs "PTE bits in the PTE range besides the PFN can
differ" when calling modify_prot_start_ptes() so R/W and other things could
differ here.

I'm not sure if these are problems in practice; they probably are not. But have
you checked the XEN implementation (and any other specialized implementations)
are definitely compatible with your batched semantics?

Thanks,
Ryan

> +
>  #endif /* CONFIG_MMU */
>  
>  /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1ee160ed0b14..124612ce3d24 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  						jiffies_to_msecs(jiffies));
>  			}
>  
> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>  			ptent = pte_modify(oldpte, newprot);
>  
>  			if (uffd_wp)
> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>  			    can_change_pte_writable(vma, addr, ptent))
>  				ptent = pte_mkwrite(ptent, vma);
>  
> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>  			if (pte_needs_flush(oldpte, ptent))
>  				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>  			pages++;

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 7 months, 3 weeks ago

On 21/05/25 4:46 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>>   mm/mprotect.c           |  4 +--
>>   2 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..e40ed57e034d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware modifications
> nit: This overflows the 80 char soft limit.
>
>> + * to the ptes. The intention is not to prevent the hardware from making pte
>> + * updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
>> + * folio.
> nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
> "collecting the a/d bits from each pte in the batch" is clearer.
>
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> nit: Perhaps "batch" would be more consistent than "range"?
>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> I thought David H suggested modify_prot_ptes_start() and
> modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
> though.

Nothing was decided upon strongly. I will personally keep it the same for sake
of consistency.

>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> I thought we agreed to make the loop logic a bit more standard. I don't recall
> exactly what was finally agreed, but I would think something like this would be
> better:

Again, nothing was particularly decided on as far as I remember : ) Let us
keep it the same - https://lore.kernel.org/all/d048366b-eb6a-4fea-9b60-af834182b1b9@redhat.com/

> 	for (i = 1; i < nr; i++) {
>
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
> You've missed pte and old_pte params here.
>
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> How can it? All the applied bits other than the PFN will be exactly the same for
> the range because they all come from pte. I think this line can be dropped.
>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> The middle sentance doesn't apply; the PTEs will all initially be none if using
> the default version of modify_prot_start_ptes(). I think that can be dropped.
> But I think you need to explain that this will be the case on exit.
>
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
>
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
>
>    - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).
>
>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.
>
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
>
> Thanks,
> Ryan
>
>> +
>>   #endif /* CONFIG_MMU */
>>   
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 1ee160ed0b14..124612ce3d24 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>>   			if (uffd_wp)
>> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>   
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 8 months, 3 weeks ago

On 21/05/25 4:46 pm, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 75 +++++++++++++++++++++++++++++++++++++++++
>>   mm/mprotect.c           |  4 +--
>>   2 files changed, 77 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..e40ed57e034d 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1333,6 +1333,81 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>>   	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>>   }
>>   #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>> +
>> +/**
>> + * modify_prot_start_ptes - Start a pte protection read-modify-write transaction
>> + * over a batch of ptes, which protects against asynchronous hardware modifications
> nit: This overflows the 80 char soft limit.
>
>> + * to the ptes. The intention is not to prevent the hardware from making pte
>> + * updates, but to prevent any updates it may make from being lost.
>> + * Please see the comment above ptep_modify_prot_start() for full description.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_start(), collecting the a/d bits of the mapped
>> + * folio.
> nit: "mapped folio" is a bit confusing given we are operating on ptes. Perhaps
> "collecting the a/d bits from each pte in the batch" is clearer.


Sure.


>
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> nit: Perhaps "batch" would be more consistent than "range"?


Sure.


>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> I thought David H suggested modify_prot_ptes_start() and
> modify_prot_ptes_commit(), which we settled on? I'm personally fine with either
> though.


No strong opinion, I'll do that.


>
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> I thought we agreed to make the loop logic a bit more standard. I don't recall
> exactly what was finally agreed, but I would think something like this would be
> better:
>
> 	for (i = 1; i < nr; i++) {


Sure.


>
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/**
>> + * modify_prot_commit_ptes - Commit an update to a batch of ptes, leaving any
>> + * hardware-controlled bits in the PTE unmodified.
>> + *
>> + * @vma: The virtual memory area the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
> You've missed pte and old_pte params here.


My bad.


>
>> + * @nr: Number of entries.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_modify_prot_commit().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ.
> How can it? All the applied bits other than the PFN will be exactly the same for
> the range because they all come from pte. I think this line can be dropped.


Copy pasted, then forgot to remove :)


>
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> The middle sentance doesn't apply; the PTEs will all initially be none if using
> the default version of modify_prot_start_ptes(). I think that can be dropped.
> But I think you need to explain that this will be the case on exit.


Ah got it. "On exit, the set ptes will map the same folio."


>
>> + */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr; ++i) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
>
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
>
>    - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).
>
>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.
>
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
>
> Thanks,
> Ryan
>
>> +
>>   #endif /* CONFIG_MMU */
>>   
>>   /*
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 1ee160ed0b14..124612ce3d24 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -188,7 +188,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   						jiffies_to_msecs(jiffies));
>>   			}
>>   
>> -			oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +			oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
>>   			ptent = pte_modify(oldpte, newprot);
>>   
>>   			if (uffd_wp)
>> @@ -214,7 +214,7 @@ static long change_pte_range(struct mmu_gather *tlb,
>>   			    can_change_pte_writable(vma, addr, ptent))
>>   				ptent = pte_mkwrite(ptent, vma);
>>   
>> -			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> +			modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
>>   			if (pte_needs_flush(oldpte, ptent))
>>   				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>   			pages++;

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 8 months, 3 weeks ago

On 21/05/2025 12:16, Ryan Roberts wrote:
> On 19/05/2025 08:48, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers; in case not, they are implemented
>> as a simple loop over the corresponding single pte helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>

[...]

> 
> I have some general concerns about the correctness of batching these functions.
> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
> to defer the pte updates for XEN on x86.
> 
> Your default implementations of the batched versions will match the number of
> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
> met it. But in the batched case, there are 2 differences;
> 
>   - You can now have multiple PTEs within a start-commit block at one time. I
> hope none of the specialized implementations care about that (i.e. XEN).

I had a look; this isn't a problem.

> 
>   - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
> and according to your docs "PTE bits in the PTE range besides the PFN can
> differ" when calling modify_prot_start_ptes() so R/W and other things could
> differ here.

It looks like powerpc will break if you provide old_pte which has different
permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
think you need to at least spec modify_prot_start_ptes() to require that all
bits of the PTE except the PFN, access and dirty are identical. And perhaps you
can VM_WARN if found to be otherwise? And perhaps modify
ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
access/dirty to be "upgraded" from what was actually read in
ptep_modify_prot_start()?

XEN/x86 and arm64 don't care about old_pte.

Thanks,
Ryan

> 
> I'm not sure if these are problems in practice; they probably are not. But have
> you checked the XEN implementation (and any other specialized implementations)
> are definitely compatible with your batched semantics?
>

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 8 months, 3 weeks ago

On 21/05/25 5:15 pm, Ryan Roberts wrote:
> On 21/05/2025 12:16, Ryan Roberts wrote:
>> On 19/05/2025 08:48, Dev Jain wrote:
>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>> Architecture can override these helpers; in case not, they are implemented
>>> as a simple loop over the corresponding single pte helpers.
>>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> [...]
>
>> I have some general concerns about the correctness of batching these functions.
>> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
>> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
>> to defer the pte updates for XEN on x86.
>>
>> Your default implementations of the batched versions will match the number of
>> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
>> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
>> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
>> met it. But in the batched case, there are 2 differences;
>>
>>    - You can now have multiple PTEs within a start-commit block at one time. I
>> hope none of the specialized implementations care about that (i.e. XEN).
> I had a look; this isn't a problem.
>
>>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
>> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
>> and according to your docs "PTE bits in the PTE range besides the PFN can
>> differ" when calling modify_prot_start_ptes() so R/W and other things could
>> differ here.
> It looks like powerpc will break if you provide old_pte which has different
> permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
> think you need to at least spec modify_prot_start_ptes() to require that all
> bits of the PTE except the PFN, access and dirty are identical. And perhaps you
> can VM_WARN if found to be otherwise? And perhaps modify
> ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
> access/dirty to be "upgraded" from what was actually read in
> ptep_modify_prot_start()?


Got it, so we just need to document that, the permissions for all ptes 
must be identical

when using modify_prot_start_ptes(). And that we may be smearing extra 
a/d bits in

modify_prot_commit_ptes().


>
> XEN/x86 and arm64 don't care about old_pte.
>
> Thanks,
> Ryan
>
>> I'm not sure if these are problems in practice; they probably are not. But have
>> you checked the XEN implementation (and any other specialized implementations)
>> are definitely compatible with your batched semantics?
>>

Re: [PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 8 months, 3 weeks ago

On 22/05/2025 07:33, Dev Jain wrote:
> 
> On 21/05/25 5:15 pm, Ryan Roberts wrote:
>> On 21/05/2025 12:16, Ryan Roberts wrote:
>>> On 19/05/2025 08:48, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers; in case not, they are implemented
>>>> as a simple loop over the corresponding single pte helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> [...]
>>
>>> I have some general concerns about the correctness of batching these functions.
>>> The support was originally added by Commit 1ea0704e0da6 ("mm: add a
>>> ptep_modify_prot transaction abstraction"), and the intent was to make it easier
>>> to defer the pte updates for XEN on x86.
>>>
>>> Your default implementations of the batched versions will match the number of
>>> ptep_modify_prot_start() calls with the same number of ptep_modify_prot_commit()
>>> calls, even if modify_prot_commit_ptes() is called incrementally for sub-batches
>>> of the batch used for modify_prot_start_ptes(). That's a requirement and you've
>>> met it. But in the batched case, there are 2 differences;
>>>
>>>    - You can now have multiple PTEs within a start-commit block at one time. I
>>> hope none of the specialized implementations care about that (i.e. XEN).
>> I had a look; this isn't a problem.
>>
>>>    - when calling ptep_modify_prot_commit(), old_pte may not be exactly what
>>> ptep_modify_prot_start() returned for that pte. You have collected the A/D bits,
>>> and according to your docs "PTE bits in the PTE range besides the PFN can
>>> differ" when calling modify_prot_start_ptes() so R/W and other things could
>>> differ here.
>> It looks like powerpc will break if you provide old_pte which has different
>> permissions to the "real" old_pte, see radix__ptep_modify_prot_commit(). So I
>> think you need to at least spec modify_prot_start_ptes() to require that all
>> bits of the PTE except the PFN, access and dirty are identical. And perhaps you
>> can VM_WARN if found to be otherwise? And perhaps modify
>> ptep_modify_prot_commit()'s documentation to explcitly allow old_pte's
>> access/dirty to be "upgraded" from what was actually read in
>> ptep_modify_prot_start()?
> 
> 
> Got it, so we just need to document that, the permissions for all ptes must be
> identical

Not just permissions; all bits (inc SW bits) except PFN and A/D.

> 
> when using modify_prot_start_ptes(). And that we may be smearing extra a/d bits in
> 
> modify_prot_commit_ptes().
> 
> 
>>
>> XEN/x86 and arm64 don't care about old_pte.
>>
>> Thanks,
>> Ryan
>>
>>> I'm not sure if these are problems in practice; they probably are not. But have
>>> you checked the XEN implementation (and any other specialized implementations)
>>> are definitely compatible with your batched semantics?
>>>

[PATCH v3 1/5] mm: Optimize mprotect() by batch-skipping PTEs
[PATCH v3 2/5] mm: Add batched versions of ptep_modify_prot_start/commit
[PATCH v3 3/5] mm: Optimize mprotect() by PTE batching
[PATCH v3 4/5] arm64: Add batched version of ptep_modify_prot_start
[PATCH v3 5/5] arm64: Add batched version of ptep_modify_prot_commit