Use folio_pte_batch to batch process a large folio. Note that, PTE
batching here will save a few function calls, and this strategy in certain
cases (not this one) batches atomic operations in general, so we have
a performance win for all arches. This patch paves the way for patch 7
which will help us elide the TLBI per contig block on arm64.
The correctness of this patch lies on the correctness of setting the
new ptes based upon information only from the first pte of the batch
(which may also have accumulated a/d bits via modify_prot_start_ptes()).
Observe that the flag combination we pass to mprotect_folio_pte_batch()
guarantees that the batch is uniform w.r.t the soft-dirty bit and the
writable bit. Therefore, the only bits which may differ are the a/d bits.
So we only need to worry about code which is concerned about the a/d bits
of the PTEs.
Setting extra a/d bits on the new ptes where previously they were not set,
is fine - setting access bit when it was not set is not an incorrectness
problem but will only possibly delay the reclaim of the page mapped by
the pte (which is in fact intended because the kernel just operated on this
region via mprotect()!). Setting dirty bit when it was not set is again
not an incorrectness problem but will only possibly force an unnecessary
writeback.
So now we need to reason whether something can go wrong via
can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp,
and userfaultfd_pte_wp cases are solved due to uniformity in the
corresponding bits guaranteed by the flag combination. The ptes all
belong to the same VMA (since callers guarantee that [start, end) will
lie within the VMA) therefore the conditional based on the VMA is also
safe to batch around.
Since the dirty bit on the PTE really is just an indication that the folio
got written to - even if the PTE is not actually dirty but one of the PTEs
in the batch is, the wp-fault optimization can be made. Therefore, it is
safe to batch around pte_dirty() in can_change_shared_pte_writable()
(in fact this is better since without batching, it may happen that
some ptes aren't changed to writable just because they are not dirty,
even though the other ptes mapping the same large folio are dirty).
To batch around the PageAnonExclusive case, we must check the corresponding
condition for every single page. Therefore, from the large folio batch,
we process sub batches of ptes mapping pages with the same
PageAnonExclusive condition, and process that sub batch, then determine
and process the next sub batch, and so on. Note that this does not cause
any extra overhead; if suppose the size of the folio batch is 512, then
the sub batch processing in total will take 512 iterations, which is the
same as what we would have done before.
For pte_needs_flush():
ppc does not care about the a/d bits.
For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits
get cleared; since we can only have extra a/d bits due to batching,
we will only have an extra flush, not a case where we elide a flush due
to batching when we shouldn't have.
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 113 insertions(+), 12 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a1c7d8a4648d..2ddd37b2f462 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
}
static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
- pte_t pte, int max_nr_ptes)
+ pte_t pte, int max_nr_ptes, fpb_t flags)
{
/* No underlying folio, so cannot batch */
if (!folio)
@@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
if (!folio_test_large(folio))
return 1;
- return folio_pte_batch(folio, ptep, pte, max_nr_ptes);
+ return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
}
static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
@@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr,
return ret;
}
+/* Set nr_ptes number of ptes, starting from idx */
+static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
+ int idx, bool set_write, struct mmu_gather *tlb)
+{
+ /*
+ * Advance the position in the batch by idx; note that if idx > 0,
+ * then the nr_ptes passed here is <= batch size - idx.
+ */
+ addr += idx * PAGE_SIZE;
+ ptep += idx;
+ oldpte = pte_advance_pfn(oldpte, idx);
+ ptent = pte_advance_pfn(ptent, idx);
+
+ if (set_write)
+ ptent = pte_mkwrite(ptent, vma);
+
+ modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes);
+ if (pte_needs_flush(oldpte, ptent))
+ tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE);
+}
+
+/*
+ * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or
+ * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
+ * that the ptes point to consecutive pages of the same anon large folio.
+ */
+static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+ struct page *first_page, bool expected_anon_exclusive)
+{
+ int idx;
+
+ for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) {
+ if (expected_anon_exclusive != PageAnonExclusive(first_page + idx))
+ break;
+ }
+ return idx - start_idx;
+}
+
+/*
+ * This function is a result of trying our very best to retain the
+ * "avoid the write-fault handler" optimization. In can_change_pte_writable(),
+ * if the vma is a private vma, and we cannot determine whether to change
+ * the pte to writable just from the vma and the pte, we then need to look
+ * at the actual page pointed to by the pte. Unfortunately, if we have a
+ * batch of ptes pointing to consecutive pages of the same anon large folio,
+ * the anon-exclusivity (or the negation) of the first page does not guarantee
+ * the anon-exclusivity (or the negation) of the other pages corresponding to
+ * the pte batch; hence in this case it is incorrect to decide to change or
+ * not change the ptes to writable just by using information from the first
+ * pte of the batch. Therefore, we must individually check all pages and
+ * retrieve sub-batches.
+ */
+static void commit_anon_folio_batch(struct vm_area_struct *vma,
+ struct folio *folio, unsigned long addr, pte_t *ptep,
+ pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
+{
+ struct page *first_page = folio_page(folio, 0);
+ bool expected_anon_exclusive;
+ int sub_batch_idx = 0;
+ int len;
+
+ while (nr_ptes) {
+ expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
+ len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
+ first_page, expected_anon_exclusive);
+ prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len,
+ sub_batch_idx, expected_anon_exclusive, tlb);
+ sub_batch_idx += len;
+ nr_ptes -= len;
+ }
+}
+
+static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
+ struct folio *folio, unsigned long addr, pte_t *ptep,
+ pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
+{
+ bool set_write;
+
+ if (vma->vm_flags & VM_SHARED) {
+ set_write = can_change_shared_pte_writable(vma, ptent);
+ prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
+ /* idx = */ 0, set_write, tlb);
+ return;
+ }
+
+ set_write = maybe_change_pte_writable(vma, ptent) &&
+ (folio && folio_test_anon(folio));
+ if (!set_write) {
+ prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes,
+ /* idx = */ 0, set_write, tlb);
+ return;
+ }
+ commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb);
+}
+
static long change_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb,
nr_ptes = 1;
oldpte = ptep_get(pte);
if (pte_present(oldpte)) {
+ const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE;
int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
- struct folio *folio;
+ struct folio *folio = NULL;
pte_t ptent;
/*
@@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb,
/* determine batch to skip */
nr_ptes = mprotect_folio_pte_batch(folio,
- pte, oldpte, max_nr_ptes);
+ pte, oldpte, max_nr_ptes, /* flags = */ 0);
continue;
}
}
+ if (!folio)
+ folio = vm_normal_folio(vma, addr, oldpte);
+
+ nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
+
oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
ptent = pte_modify(oldpte, newprot);
@@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb,
* COW or special handling is required.
*/
if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
- !pte_write(ptent) &&
- can_change_pte_writable(vma, addr, ptent))
- ptent = pte_mkwrite(ptent, vma);
-
- modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes);
- if (pte_needs_flush(oldpte, ptent))
- tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
- pages++;
+ !pte_write(ptent))
+ set_write_prot_commit_flush_ptes(vma, folio,
+ addr, pte, oldpte, ptent, nr_ptes, tlb);
+ else
+ prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
+ nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+ pages += nr_ptes;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;
--
2.30.2
On 18.07.25 11:02, Dev Jain wrote: > Use folio_pte_batch to batch process a large folio. Note that, PTE > batching here will save a few function calls, and this strategy in certain > cases (not this one) batches atomic operations in general, so we have > a performance win for all arches. This patch paves the way for patch 7 > which will help us elide the TLBI per contig block on arm64. > > The correctness of this patch lies on the correctness of setting the > new ptes based upon information only from the first pte of the batch > (which may also have accumulated a/d bits via modify_prot_start_ptes()). > > Observe that the flag combination we pass to mprotect_folio_pte_batch() > guarantees that the batch is uniform w.r.t the soft-dirty bit and the > writable bit. Therefore, the only bits which may differ are the a/d bits. > So we only need to worry about code which is concerned about the a/d bits > of the PTEs. > > Setting extra a/d bits on the new ptes where previously they were not set, > is fine - setting access bit when it was not set is not an incorrectness > problem but will only possibly delay the reclaim of the page mapped by > the pte (which is in fact intended because the kernel just operated on this > region via mprotect()!). Setting dirty bit when it was not set is again > not an incorrectness problem but will only possibly force an unnecessary > writeback. > > So now we need to reason whether something can go wrong via > can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, > and userfaultfd_pte_wp cases are solved due to uniformity in the > corresponding bits guaranteed by the flag combination. The ptes all > belong to the same VMA (since callers guarantee that [start, end) will > lie within the VMA) therefore the conditional based on the VMA is also > safe to batch around. > > Since the dirty bit on the PTE really is just an indication that the folio > got written to - even if the PTE is not actually dirty but one of the PTEs > in the batch is, the wp-fault optimization can be made. Therefore, it is > safe to batch around pte_dirty() in can_change_shared_pte_writable() > (in fact this is better since without batching, it may happen that > some ptes aren't changed to writable just because they are not dirty, > even though the other ptes mapping the same large folio are dirty). > > To batch around the PageAnonExclusive case, we must check the corresponding > condition for every single page. Therefore, from the large folio batch, > we process sub batches of ptes mapping pages with the same > PageAnonExclusive condition, and process that sub batch, then determine > and process the next sub batch, and so on. Note that this does not cause > any extra overhead; if suppose the size of the folio batch is 512, then > the sub batch processing in total will take 512 iterations, which is the > same as what we would have done before. > > For pte_needs_flush(): > > ppc does not care about the a/d bits. > > For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits > get cleared; since we can only have extra a/d bits due to batching, > we will only have an extra flush, not a case where we elide a flush due > to batching when we shouldn't have. > > Signed-off-by: Dev Jain <dev.jain@arm.com> I wanted to review this, but looks like it's already upstream and I suspect it's buggy (see the upstream report I cc'ed you on) [...] > + > +/* > + * This function is a result of trying our very best to retain the > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), > + * if the vma is a private vma, and we cannot determine whether to change > + * the pte to writable just from the vma and the pte, we then need to look > + * at the actual page pointed to by the pte. Unfortunately, if we have a > + * batch of ptes pointing to consecutive pages of the same anon large folio, > + * the anon-exclusivity (or the negation) of the first page does not guarantee > + * the anon-exclusivity (or the negation) of the other pages corresponding to > + * the pte batch; hence in this case it is incorrect to decide to change or > + * not change the ptes to writable just by using information from the first > + * pte of the batch. Therefore, we must individually check all pages and > + * retrieve sub-batches. > + */ > +static void commit_anon_folio_batch(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + struct page *first_page = folio_page(folio, 0); Who says that we have the first page of the folio mapped into the first PTE of the batch? > + bool expected_anon_exclusive; > + int sub_batch_idx = 0; > + int len; > + > + while (nr_ptes) { > + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); > + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, > + first_page, expected_anon_exclusive); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, > + sub_batch_idx, expected_anon_exclusive, tlb); > + sub_batch_idx += len; > + nr_ptes -= len; > + } > +} > + > +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + bool set_write; > + > + if (vma->vm_flags & VM_SHARED) { > + set_write = can_change_shared_pte_writable(vma, ptent); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + > + set_write = maybe_change_pte_writable(vma, ptent) && > + (folio && folio_test_anon(folio)); > + if (!set_write) { > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); > +} > + > static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, > nr_ptes = 1; > oldpte = ptep_get(pte); > if (pte_present(oldpte)) { > + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > - struct folio *folio; > + struct folio *folio = NULL; > pte_t ptent; > > /* > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* determine batch to skip */ > nr_ptes = mprotect_folio_pte_batch(folio, > - pte, oldpte, max_nr_ptes); > + pte, oldpte, max_nr_ptes, /* flags = */ 0); > continue; > } > } > > + if (!folio) > + folio = vm_normal_folio(vma, addr, oldpte); > + > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); > + > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); > ptent = pte_modify(oldpte, newprot); > > @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb, > * COW or special handling is required. > */ > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > - !pte_write(ptent) && > - can_change_pte_writable(vma, addr, ptent)) > - ptent = pte_mkwrite(ptent, vma); > - > - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); > - if (pte_needs_flush(oldpte, ptent)) > - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); > - pages++; > + !pte_write(ptent)) > + set_write_prot_commit_flush_ptes(vma, folio, > + addr, pte, oldpte, ptent, nr_ptes, tlb); While staring at this: Very broken indentation. > + else > + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); Semi-broken intendation. > + pages += nr_ptes; > } else if (is_swap_pte(oldpte)) { > swp_entry_t entry = pte_to_swp_entry(oldpte); > pte_t newpte; -- Cheers, David / dhildenb
On 06.08.25 10:08, David Hildenbrand wrote: > On 18.07.25 11:02, Dev Jain wrote: >> Use folio_pte_batch to batch process a large folio. Note that, PTE >> batching here will save a few function calls, and this strategy in certain >> cases (not this one) batches atomic operations in general, so we have >> a performance win for all arches. This patch paves the way for patch 7 >> which will help us elide the TLBI per contig block on arm64. >> >> The correctness of this patch lies on the correctness of setting the >> new ptes based upon information only from the first pte of the batch >> (which may also have accumulated a/d bits via modify_prot_start_ptes()). >> >> Observe that the flag combination we pass to mprotect_folio_pte_batch() >> guarantees that the batch is uniform w.r.t the soft-dirty bit and the >> writable bit. Therefore, the only bits which may differ are the a/d bits. >> So we only need to worry about code which is concerned about the a/d bits >> of the PTEs. >> >> Setting extra a/d bits on the new ptes where previously they were not set, >> is fine - setting access bit when it was not set is not an incorrectness >> problem but will only possibly delay the reclaim of the page mapped by >> the pte (which is in fact intended because the kernel just operated on this >> region via mprotect()!). Setting dirty bit when it was not set is again >> not an incorrectness problem but will only possibly force an unnecessary >> writeback. >> >> So now we need to reason whether something can go wrong via >> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, >> and userfaultfd_pte_wp cases are solved due to uniformity in the >> corresponding bits guaranteed by the flag combination. The ptes all >> belong to the same VMA (since callers guarantee that [start, end) will >> lie within the VMA) therefore the conditional based on the VMA is also >> safe to batch around. >> >> Since the dirty bit on the PTE really is just an indication that the folio >> got written to - even if the PTE is not actually dirty but one of the PTEs >> in the batch is, the wp-fault optimization can be made. Therefore, it is >> safe to batch around pte_dirty() in can_change_shared_pte_writable() >> (in fact this is better since without batching, it may happen that >> some ptes aren't changed to writable just because they are not dirty, >> even though the other ptes mapping the same large folio are dirty). >> >> To batch around the PageAnonExclusive case, we must check the corresponding >> condition for every single page. Therefore, from the large folio batch, >> we process sub batches of ptes mapping pages with the same >> PageAnonExclusive condition, and process that sub batch, then determine >> and process the next sub batch, and so on. Note that this does not cause >> any extra overhead; if suppose the size of the folio batch is 512, then >> the sub batch processing in total will take 512 iterations, which is the >> same as what we would have done before. >> >> For pte_needs_flush(): >> >> ppc does not care about the a/d bits. >> >> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits >> get cleared; since we can only have extra a/d bits due to batching, >> we will only have an extra flush, not a case where we elide a flush due >> to batching when we shouldn't have. >> >> Signed-off-by: Dev Jain <dev.jain@arm.com> > > > I wanted to review this, but looks like it's already upstream and I > suspect it's buggy (see the upstream report I cc'ed you on) > > [...] > >> + >> +/* >> + * This function is a result of trying our very best to retain the >> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), >> + * if the vma is a private vma, and we cannot determine whether to change >> + * the pte to writable just from the vma and the pte, we then need to look >> + * at the actual page pointed to by the pte. Unfortunately, if we have a >> + * batch of ptes pointing to consecutive pages of the same anon large folio, >> + * the anon-exclusivity (or the negation) of the first page does not guarantee >> + * the anon-exclusivity (or the negation) of the other pages corresponding to >> + * the pte batch; hence in this case it is incorrect to decide to change or >> + * not change the ptes to writable just by using information from the first >> + * pte of the batch. Therefore, we must individually check all pages and >> + * retrieve sub-batches. >> + */ >> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >> + struct folio *folio, unsigned long addr, pte_t *ptep, >> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >> +{ >> + struct page *first_page = folio_page(folio, 0); > > Who says that we have the first page of the folio mapped into the first > PTE of the batch? For the record, I *hate* that we moved from vm_normal_folio() to vm_normal_page(). Please undo that and forward the proper mapped page. -- Cheers, David / dhildenb
On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote: > On 18.07.25 11:02, Dev Jain wrote: > > Use folio_pte_batch to batch process a large folio. Note that, PTE > > batching here will save a few function calls, and this strategy in certain > > cases (not this one) batches atomic operations in general, so we have > > a performance win for all arches. This patch paves the way for patch 7 > > which will help us elide the TLBI per contig block on arm64. > > > > The correctness of this patch lies on the correctness of setting the > > new ptes based upon information only from the first pte of the batch > > (which may also have accumulated a/d bits via modify_prot_start_ptes()). > > > > Observe that the flag combination we pass to mprotect_folio_pte_batch() > > guarantees that the batch is uniform w.r.t the soft-dirty bit and the > > writable bit. Therefore, the only bits which may differ are the a/d bits. > > So we only need to worry about code which is concerned about the a/d bits > > of the PTEs. > > > > Setting extra a/d bits on the new ptes where previously they were not set, > > is fine - setting access bit when it was not set is not an incorrectness > > problem but will only possibly delay the reclaim of the page mapped by > > the pte (which is in fact intended because the kernel just operated on this > > region via mprotect()!). Setting dirty bit when it was not set is again > > not an incorrectness problem but will only possibly force an unnecessary > > writeback. > > > > So now we need to reason whether something can go wrong via > > can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, > > and userfaultfd_pte_wp cases are solved due to uniformity in the > > corresponding bits guaranteed by the flag combination. The ptes all > > belong to the same VMA (since callers guarantee that [start, end) will > > lie within the VMA) therefore the conditional based on the VMA is also > > safe to batch around. > > > > Since the dirty bit on the PTE really is just an indication that the folio > > got written to - even if the PTE is not actually dirty but one of the PTEs > > in the batch is, the wp-fault optimization can be made. Therefore, it is > > safe to batch around pte_dirty() in can_change_shared_pte_writable() > > (in fact this is better since without batching, it may happen that > > some ptes aren't changed to writable just because they are not dirty, > > even though the other ptes mapping the same large folio are dirty). > > > > To batch around the PageAnonExclusive case, we must check the corresponding > > condition for every single page. Therefore, from the large folio batch, > > we process sub batches of ptes mapping pages with the same > > PageAnonExclusive condition, and process that sub batch, then determine > > and process the next sub batch, and so on. Note that this does not cause > > any extra overhead; if suppose the size of the folio batch is 512, then > > the sub batch processing in total will take 512 iterations, which is the > > same as what we would have done before. > > > > For pte_needs_flush(): > > > > ppc does not care about the a/d bits. > > > > For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits > > get cleared; since we can only have extra a/d bits due to batching, > > we will only have an extra flush, not a case where we elide a flush due > > to batching when we shouldn't have. > > > > Signed-off-by: Dev Jain <dev.jain@arm.com> > > > I wanted to review this, but looks like it's already upstream and I suspect > it's buggy (see the upstream report I cc'ed you on) Please excuse my laziness, but do you have a link to the report? I've been looking at some oddities on arm64 coming back from some of the CI systems and was heading in the direction of a recent mm regression judging by the first-known-bad-build in linux-next. https://lore.kernel.org/r/CA+G9fYumD2MGjECCv0wx2V_96_FKNtFQpT63qVNrrCmomoPYVQ@mail.gmail.com Will
On 06.08.25 10:15, Will Deacon wrote: > On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote: >> On 18.07.25 11:02, Dev Jain wrote: >>> Use folio_pte_batch to batch process a large folio. Note that, PTE >>> batching here will save a few function calls, and this strategy in certain >>> cases (not this one) batches atomic operations in general, so we have >>> a performance win for all arches. This patch paves the way for patch 7 >>> which will help us elide the TLBI per contig block on arm64. >>> >>> The correctness of this patch lies on the correctness of setting the >>> new ptes based upon information only from the first pte of the batch >>> (which may also have accumulated a/d bits via modify_prot_start_ptes()). >>> >>> Observe that the flag combination we pass to mprotect_folio_pte_batch() >>> guarantees that the batch is uniform w.r.t the soft-dirty bit and the >>> writable bit. Therefore, the only bits which may differ are the a/d bits. >>> So we only need to worry about code which is concerned about the a/d bits >>> of the PTEs. >>> >>> Setting extra a/d bits on the new ptes where previously they were not set, >>> is fine - setting access bit when it was not set is not an incorrectness >>> problem but will only possibly delay the reclaim of the page mapped by >>> the pte (which is in fact intended because the kernel just operated on this >>> region via mprotect()!). Setting dirty bit when it was not set is again >>> not an incorrectness problem but will only possibly force an unnecessary >>> writeback. >>> >>> So now we need to reason whether something can go wrong via >>> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, >>> and userfaultfd_pte_wp cases are solved due to uniformity in the >>> corresponding bits guaranteed by the flag combination. The ptes all >>> belong to the same VMA (since callers guarantee that [start, end) will >>> lie within the VMA) therefore the conditional based on the VMA is also >>> safe to batch around. >>> >>> Since the dirty bit on the PTE really is just an indication that the folio >>> got written to - even if the PTE is not actually dirty but one of the PTEs >>> in the batch is, the wp-fault optimization can be made. Therefore, it is >>> safe to batch around pte_dirty() in can_change_shared_pte_writable() >>> (in fact this is better since without batching, it may happen that >>> some ptes aren't changed to writable just because they are not dirty, >>> even though the other ptes mapping the same large folio are dirty). >>> >>> To batch around the PageAnonExclusive case, we must check the corresponding >>> condition for every single page. Therefore, from the large folio batch, >>> we process sub batches of ptes mapping pages with the same >>> PageAnonExclusive condition, and process that sub batch, then determine >>> and process the next sub batch, and so on. Note that this does not cause >>> any extra overhead; if suppose the size of the folio batch is 512, then >>> the sub batch processing in total will take 512 iterations, which is the >>> same as what we would have done before. >>> >>> For pte_needs_flush(): >>> >>> ppc does not care about the a/d bits. >>> >>> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits >>> get cleared; since we can only have extra a/d bits due to batching, >>> we will only have an extra flush, not a case where we elide a flush due >>> to batching when we shouldn't have. >>> >>> Signed-off-by: Dev Jain <dev.jain@arm.com> >> >> >> I wanted to review this, but looks like it's already upstream and I suspect >> it's buggy (see the upstream report I cc'ed you on) > > Please excuse my laziness, but do you have a link to the report? I was lazy :) https://lkml.kernel.org/r/68930511.050a0220.7f033.003a.GAE@google.com > I've > been looking at some oddities on arm64 coming back from some of the CI > systems and was heading in the direction of a recent mm regression > judging by the first-known-bad-build in linux-next. > > https://lore.kernel.org/r/CA+G9fYumD2MGjECCv0wx2V_96_FKNtFQpT63qVNrrCmomoPYVQ@mail.gmail.com Hm, mprotect seems to be involved. So it might or might not correlate. -- Cheers, David / dhildenb
On 06/08/25 1:38 pm, David Hildenbrand wrote: > On 18.07.25 11:02, Dev Jain wrote: >> Use folio_pte_batch to batch process a large folio. Note that, PTE >> batching here will save a few function calls, and this strategy in >> certain >> cases (not this one) batches atomic operations in general, so we have >> a performance win for all arches. This patch paves the way for patch 7 >> which will help us elide the TLBI per contig block on arm64. >> >> The correctness of this patch lies on the correctness of setting the >> new ptes based upon information only from the first pte of the batch >> (which may also have accumulated a/d bits via modify_prot_start_ptes()). >> >> Observe that the flag combination we pass to mprotect_folio_pte_batch() >> guarantees that the batch is uniform w.r.t the soft-dirty bit and the >> writable bit. Therefore, the only bits which may differ are the a/d >> bits. >> So we only need to worry about code which is concerned about the a/d >> bits >> of the PTEs. >> >> Setting extra a/d bits on the new ptes where previously they were not >> set, >> is fine - setting access bit when it was not set is not an incorrectness >> problem but will only possibly delay the reclaim of the page mapped by >> the pte (which is in fact intended because the kernel just operated >> on this >> region via mprotect()!). Setting dirty bit when it was not set is again >> not an incorrectness problem but will only possibly force an unnecessary >> writeback. >> >> So now we need to reason whether something can go wrong via >> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, >> and userfaultfd_pte_wp cases are solved due to uniformity in the >> corresponding bits guaranteed by the flag combination. The ptes all >> belong to the same VMA (since callers guarantee that [start, end) will >> lie within the VMA) therefore the conditional based on the VMA is also >> safe to batch around. >> >> Since the dirty bit on the PTE really is just an indication that the >> folio >> got written to - even if the PTE is not actually dirty but one of the >> PTEs >> in the batch is, the wp-fault optimization can be made. Therefore, it is >> safe to batch around pte_dirty() in can_change_shared_pte_writable() >> (in fact this is better since without batching, it may happen that >> some ptes aren't changed to writable just because they are not dirty, >> even though the other ptes mapping the same large folio are dirty). >> >> To batch around the PageAnonExclusive case, we must check the >> corresponding >> condition for every single page. Therefore, from the large folio batch, >> we process sub batches of ptes mapping pages with the same >> PageAnonExclusive condition, and process that sub batch, then determine >> and process the next sub batch, and so on. Note that this does not cause >> any extra overhead; if suppose the size of the folio batch is 512, then >> the sub batch processing in total will take 512 iterations, which is the >> same as what we would have done before. >> >> For pte_needs_flush(): >> >> ppc does not care about the a/d bits. >> >> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits >> get cleared; since we can only have extra a/d bits due to batching, >> we will only have an extra flush, not a case where we elide a flush due >> to batching when we shouldn't have. >> >> Signed-off-by: Dev Jain <dev.jain@arm.com> > > > I wanted to review this, but looks like it's already upstream and I > suspect it's buggy (see the upstream report I cc'ed you on) Thank you for CCing, and quickly spotting the problem! > > [...] > >> + >> +/* >> + * This function is a result of trying our very best to retain the >> + * "avoid the write-fault handler" optimization. In >> can_change_pte_writable(), >> + * if the vma is a private vma, and we cannot determine whether to >> change >> + * the pte to writable just from the vma and the pte, we then need >> to look >> + * at the actual page pointed to by the pte. Unfortunately, if we >> have a >> + * batch of ptes pointing to consecutive pages of the same anon >> large folio, >> + * the anon-exclusivity (or the negation) of the first page does not >> guarantee >> + * the anon-exclusivity (or the negation) of the other pages >> corresponding to >> + * the pte batch; hence in this case it is incorrect to decide to >> change or >> + * not change the ptes to writable just by using information from >> the first >> + * pte of the batch. Therefore, we must individually check all pages >> and >> + * retrieve sub-batches. >> + */ >> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >> + struct folio *folio, unsigned long addr, pte_t *ptep, >> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >> +{ >> + struct page *first_page = folio_page(folio, 0); > > Who says that we have the first page of the folio mapped into the > first PTE of the batch? Oops, I thought I had taken care of this. We make the folio from vm_normal_folio(vma, addr, oldpte), which makes the struct page from the actual pte, but then I missed that page_folio() will reset the folio->page to the head page, I was under the impression it will retain the original page. And indeed, if it didn't do that then the purpose of vm_normal_folio() is nullified, should have thought about that :( Lemme send a fix patch. > >> + bool expected_anon_exclusive; >> + int sub_batch_idx = 0; >> + int len; >> + >> + while (nr_ptes) { >> + expected_anon_exclusive = PageAnonExclusive(first_page + >> sub_batch_idx); >> + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, >> + first_page, expected_anon_exclusive); >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, >> + sub_batch_idx, expected_anon_exclusive, tlb); >> + sub_batch_idx += len; >> + nr_ptes -= len; >> + } >> +} >> + >> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct >> *vma, >> + struct folio *folio, unsigned long addr, pte_t *ptep, >> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >> +{ >> + bool set_write; >> + >> + if (vma->vm_flags & VM_SHARED) { >> + set_write = can_change_shared_pte_writable(vma, ptent); >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, >> + /* idx = */ 0, set_write, tlb); >> + return; >> + } >> + >> + set_write = maybe_change_pte_writable(vma, ptent) && >> + (folio && folio_test_anon(folio)); >> + if (!set_write) { >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, >> + /* idx = */ 0, set_write, tlb); >> + return; >> + } >> + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, >> nr_ptes, tlb); >> +} >> + >> static long change_pte_range(struct mmu_gather *tlb, >> struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, >> unsigned long end, pgprot_t newprot, unsigned long cp_flags) >> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, >> nr_ptes = 1; >> oldpte = ptep_get(pte); >> if (pte_present(oldpte)) { >> + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | >> FPB_RESPECT_WRITE; >> int max_nr_ptes = (end - addr) >> PAGE_SHIFT; >> - struct folio *folio; >> + struct folio *folio = NULL; >> pte_t ptent; >> /* >> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather >> *tlb, >> /* determine batch to skip */ >> nr_ptes = mprotect_folio_pte_batch(folio, >> - pte, oldpte, max_nr_ptes); >> + pte, oldpte, max_nr_ptes, /* flags = */ 0); >> continue; >> } >> } >> + if (!folio) >> + folio = vm_normal_folio(vma, addr, oldpte); >> + >> + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, >> max_nr_ptes, flags); >> + >> oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); >> ptent = pte_modify(oldpte, newprot); >> @@ -248,14 +350,13 @@ static long change_pte_range(struct >> mmu_gather *tlb, >> * COW or special handling is required. >> */ >> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && >> - !pte_write(ptent) && >> - can_change_pte_writable(vma, addr, ptent)) >> - ptent = pte_mkwrite(ptent, vma); >> - >> - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, >> nr_ptes); >> - if (pte_needs_flush(oldpte, ptent)) >> - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); >> - pages++; >> + !pte_write(ptent)) >> + set_write_prot_commit_flush_ptes(vma, folio, >> + addr, pte, oldpte, ptent, nr_ptes, tlb); > > While staring at this: > > Very broken indentation. Indeed, but then in order to fix the indentation if we start positioning every argument just below the opening bracket then it will take at least four lines :) > >> + else >> + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, >> + nr_ptes, /* idx = */ 0, /* set_write = */ false, >> tlb); > > Semi-broken intendation. Yup I can position nr_ptes below vma and set_write below nr_ptes. But this should probably not be a part of the fix patch. > >> + pages += nr_ptes; >> } else if (is_swap_pte(oldpte)) { >> swp_entry_t entry = pte_to_swp_entry(oldpte); >> pte_t newpte; > >
>> While staring at this: >> >> Very broken indentation. > > Indeed, but then in order to fix the indentation if we start positioning > > every argument just below the opening bracket then it will take at least > four lines :) You are allowed to use more than 80 chars if there is good reason :) -- Cheers, David / dhildenb
On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote: > On 18.07.25 11:02, Dev Jain wrote: > > Signed-off-by: Dev Jain <dev.jain@arm.com> > > > I wanted to review this, but looks like it's already upstream and I suspect > it's buggy (see the upstream report I cc'ed you on) > > [...] > > > + > > +/* > > + * This function is a result of trying our very best to retain the > > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), > > + * if the vma is a private vma, and we cannot determine whether to change > > + * the pte to writable just from the vma and the pte, we then need to look > > + * at the actual page pointed to by the pte. Unfortunately, if we have a > > + * batch of ptes pointing to consecutive pages of the same anon large folio, > > + * the anon-exclusivity (or the negation) of the first page does not guarantee > > + * the anon-exclusivity (or the negation) of the other pages corresponding to > > + * the pte batch; hence in this case it is incorrect to decide to change or > > + * not change the ptes to writable just by using information from the first > > + * pte of the batch. Therefore, we must individually check all pages and > > + * retrieve sub-batches. > > + */ > > +static void commit_anon_folio_batch(struct vm_area_struct *vma, > > + struct folio *folio, unsigned long addr, pte_t *ptep, > > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > > +{ > > + struct page *first_page = folio_page(folio, 0); > > Who says that we have the first page of the folio mapped into the first PTE > of the batch? Yikes, missed this sorry. Got too tied up in alogrithm here. You mean in _this_ PTE of the batch right? As we're invoking these on each part of the PTE table. I mean I guess we can simply do: struct page *first_page = pte_page(ptent); Right? Presuming ptent is the PTE entry we are curently examining (and applying the PAE sub-batching to etc.) [snip] > > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* determine batch to skip */ > > nr_ptes = mprotect_folio_pte_batch(folio, > > - pte, oldpte, max_nr_ptes); > > + pte, oldpte, max_nr_ptes, /* flags = */ 0); > > continue; > > } > > } > > + if (!folio) > > + folio = vm_normal_folio(vma, addr, oldpte); > > + > > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); > > + > > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); > > ptent = pte_modify(oldpte, newprot); > > @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb, > > * COW or special handling is required. > > */ > > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > > - !pte_write(ptent) && > > - can_change_pte_writable(vma, addr, ptent)) > > - ptent = pte_mkwrite(ptent, vma); > > - > > - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); > > - if (pte_needs_flush(oldpte, ptent)) > > - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); > > - pages++; > > + !pte_write(ptent)) > > + set_write_prot_commit_flush_ptes(vma, folio, > > + addr, pte, oldpte, ptent, nr_ptes, tlb); > > While staring at this: > > Very broken indentation. Hmm... I wonder if checkpatch.pl would have picked this up actually. Yeah that 2nd line is just wrong... > > > + else > > + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > > + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); > > Semi-broken intendation. Because of else then 2 lines after? > > > + pages += nr_ptes; > > } else if (is_swap_pte(oldpte)) { > > swp_entry_t entry = pte_to_swp_entry(oldpte); > > pte_t newpte; > > > -- > Cheers, > > David / dhildenb > I think on this series I was so happy to see the implementation evolve to something that was so much nicer than it originally was, I did not scrutinise the impl. details of this one close enough, but as Jamie Lannister said, "there are always lessons in failures" :) Cheers, Lorenzo
On 06.08.25 11:12, Lorenzo Stoakes wrote: > On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote: >> On 18.07.25 11:02, Dev Jain wrote: >>> Signed-off-by: Dev Jain <dev.jain@arm.com> >> >> >> I wanted to review this, but looks like it's already upstream and I suspect >> it's buggy (see the upstream report I cc'ed you on) >> >> [...] >> >>> + >>> +/* >>> + * This function is a result of trying our very best to retain the >>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), >>> + * if the vma is a private vma, and we cannot determine whether to change >>> + * the pte to writable just from the vma and the pte, we then need to look >>> + * at the actual page pointed to by the pte. Unfortunately, if we have a >>> + * batch of ptes pointing to consecutive pages of the same anon large folio, >>> + * the anon-exclusivity (or the negation) of the first page does not guarantee >>> + * the anon-exclusivity (or the negation) of the other pages corresponding to >>> + * the pte batch; hence in this case it is incorrect to decide to change or >>> + * not change the ptes to writable just by using information from the first >>> + * pte of the batch. Therefore, we must individually check all pages and >>> + * retrieve sub-batches. >>> + */ >>> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >>> + struct folio *folio, unsigned long addr, pte_t *ptep, >>> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >>> +{ >>> + struct page *first_page = folio_page(folio, 0); >> >> Who says that we have the first page of the folio mapped into the first PTE >> of the batch? > > Yikes, missed this sorry. Got too tied up in alogrithm here. > > You mean in _this_ PTE of the batch right? As we're invoking these on each part > of the PTE table. > > I mean I guess we can simply do: > > struct page *first_page = pte_page(ptent); > > Right? Yes, but we should forward the result from vm_normal_page(), which does exactly that for you, and increment the page accordingly as required, just like with the pte we are processing. ... >> >>> + else >>> + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, >>> + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); >> >> Semi-broken intendation. > > Because of else then 2 lines after? prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); Is what I would have expected. I think a smart man once said, that if you need more than one line per statement in an if/else clause, a set of {} can aid readability. But I don't particularly care :) -- Cheers, David / dhildenb
On 06/08/25 2:51 pm, David Hildenbrand wrote: > On 06.08.25 11:12, Lorenzo Stoakes wrote: >> On Wed, Aug 06, 2025 at 10:08:33AM +0200, David Hildenbrand wrote: >>> On 18.07.25 11:02, Dev Jain wrote: >>>> Signed-off-by: Dev Jain <dev.jain@arm.com> >>> >>> >>> I wanted to review this, but looks like it's already upstream and I >>> suspect >>> it's buggy (see the upstream report I cc'ed you on) >>> >>> [...] >>> >>>> + >>>> +/* >>>> + * This function is a result of trying our very best to retain the >>>> + * "avoid the write-fault handler" optimization. In >>>> can_change_pte_writable(), >>>> + * if the vma is a private vma, and we cannot determine whether to >>>> change >>>> + * the pte to writable just from the vma and the pte, we then need >>>> to look >>>> + * at the actual page pointed to by the pte. Unfortunately, if we >>>> have a >>>> + * batch of ptes pointing to consecutive pages of the same anon >>>> large folio, >>>> + * the anon-exclusivity (or the negation) of the first page does >>>> not guarantee >>>> + * the anon-exclusivity (or the negation) of the other pages >>>> corresponding to >>>> + * the pte batch; hence in this case it is incorrect to decide to >>>> change or >>>> + * not change the ptes to writable just by using information from >>>> the first >>>> + * pte of the batch. Therefore, we must individually check all >>>> pages and >>>> + * retrieve sub-batches. >>>> + */ >>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >>>> + struct folio *folio, unsigned long addr, pte_t *ptep, >>>> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather >>>> *tlb) >>>> +{ >>>> + struct page *first_page = folio_page(folio, 0); >>> >>> Who says that we have the first page of the folio mapped into the >>> first PTE >>> of the batch? >> >> Yikes, missed this sorry. Got too tied up in alogrithm here. >> >> You mean in _this_ PTE of the batch right? As we're invoking these on >> each part >> of the PTE table. >> >> I mean I guess we can simply do: >> >> struct page *first_page = pte_page(ptent); >> >> Right? > > Yes, but we should forward the result from vm_normal_page(), which does > exactly that for you, and increment the page accordingly as required, > just like with the pte we are processing. Makes sense, so I guess I will have to change the signature of prot_numa_skip() to pass a double ptr to a page instead of folio and derive the folio in the caller, and pass down both the folio and the page to set_write_prot_commit_flush_ptes. > > ... > >>> >>>> + else >>>> + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, >>>> + nr_ptes, /* idx = */ 0, /* set_write = */ >>>> false, tlb); >>> >>> Semi-broken intendation. >> >> Because of else then 2 lines after? > > prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); > > Is what I would have expected. > > > I think a smart man once said, that if you need more than one line per > statement in > an if/else clause, a set of {} can aid readability. But I don't > particularly care :) >
On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: > > > > > > You mean in _this_ PTE of the batch right? As we're invoking these > > > on each part > > > of the PTE table. > > > > > > I mean I guess we can simply do: > > > > > > struct page *first_page = pte_page(ptent); > > > > > > Right? > > > > Yes, but we should forward the result from vm_normal_page(), which does > > exactly that for you, and increment the page accordingly as required, > > just like with the pte we are processing. > > Makes sense, so I guess I will have to change the signature of > prot_numa_skip() > > to pass a double ptr to a page instead of folio and derive the folio in the > caller, > > and pass down both the folio and the page to > set_write_prot_commit_flush_ptes. I already don't love how we psas the folio back from there for very dubious benefit. I really hate the idea of having a struct **page parameter... I wonder if we should just have a quick fixup for hotfix, and refine this more later? I foresee some debate otherwise... > > > > > > ... > > > > > > > > > > > + else > > > > > + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > > > > > + nr_ptes, /* idx = */ 0, /* set_write = > > > > > */ false, tlb); > > > > > > > > Semi-broken intendation. > > > > > > Because of else then 2 lines after? > > > > prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > > nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); > > > > Is what I would have expected. > > > > > > I think a smart man once said, that if you need more than one line per > > statement in > > an if/else clause, a set of {} can aid readability. But I don't > > particularly care :) > > Can do this in follow up too.
On 06.08.25 11:50, Lorenzo Stoakes wrote: > On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: >>>> >>>> You mean in _this_ PTE of the batch right? As we're invoking these >>>> on each part >>>> of the PTE table. >>>> >>>> I mean I guess we can simply do: >>>> >>>> struct page *first_page = pte_page(ptent); >>>> >>>> Right? >>> >>> Yes, but we should forward the result from vm_normal_page(), which does >>> exactly that for you, and increment the page accordingly as required, >>> just like with the pte we are processing. >> >> Makes sense, so I guess I will have to change the signature of >> prot_numa_skip() >> >> to pass a double ptr to a page instead of folio and derive the folio in the >> caller, >> >> and pass down both the folio and the page to >> set_write_prot_commit_flush_ptes. > > I already don't love how we psas the folio back from there for very dubious > benefit. I really hate the idea of having a struct **page parameter... > > I wonder if we should just have a quick fixup for hotfix, and refine this more > later? This is not an issue in any released kernel, so we can do this properly. We should just remove that nested vm_normal_folio(). Untested, but should give an idea what we can do. diff --git a/mm/mprotect.c b/mm/mprotect.c index 78bded7acf795..4e0a22f7db495 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, pte_t oldpte, pte_t *pte, int target_node, - struct folio **foliop) + struct folio *folio) { struct folio *folio = NULL; bool ret = true; @@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, if (pte_protnone(oldpte)) goto skip; - folio = vm_normal_folio(vma, addr, oldpte); if (!folio) goto skip; @@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, if (folio_use_access_time(folio)) folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); -skip: - *foliop = folio; return ret; } @@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, * retrieve sub-batches. */ static void commit_anon_folio_batch(struct vm_area_struct *vma, - struct folio *folio, unsigned long addr, pte_t *ptep, + struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { - struct page *first_page = folio_page(folio, 0); bool expected_anon_exclusive; int sub_batch_idx = 0; int len; @@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, first_page, expected_anon_exclusive); - prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, + prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, len, sub_batch_idx, expected_anon_exclusive, tlb); sub_batch_idx += len; nr_ptes -= len; @@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, } static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, - struct folio *folio, unsigned long addr, pte_t *ptep, + struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) { bool set_write; @@ -270,7 +266,7 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, /* idx = */ 0, set_write, tlb); return; } - commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); + commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); } static long change_pte_range(struct mmu_gather *tlb, @@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather *tlb, const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; int max_nr_ptes = (end - addr) >> PAGE_SHIFT; struct folio *folio = NULL; + struct page *page; pte_t ptent; + page = vm_normal_folio(vma, addr, oldpte); + if (page) + folio = page_folio(page); + /* * Avoid trapping faults against the zero or KSM * pages. See similar comment in change_huge_pmd. */ if (prot_numa) { int ret = prot_numa_skip(vma, addr, oldpte, pte, - target_node, &folio); + target_node, folio); if (ret) { /* determine batch to skip */ @@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb, } } - if (!folio) - folio = vm_normal_folio(vma, addr, oldpte); - nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); @@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb, */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pte_write(ptent)) - set_write_prot_commit_flush_ptes(vma, folio, + set_write_prot_commit_flush_ptes(vma, folio, page, addr, pte, oldpte, ptent, nr_ptes, tlb); else prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, -- 2.50.1 -- Cheers, David / dhildenb
On 06/08/25 3:41 pm, David Hildenbrand wrote: > On 06.08.25 11:50, Lorenzo Stoakes wrote: >> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: >>>>> >>>>> You mean in _this_ PTE of the batch right? As we're invoking these >>>>> on each part >>>>> of the PTE table. >>>>> >>>>> I mean I guess we can simply do: >>>>> >>>>> struct page *first_page = pte_page(ptent); >>>>> >>>>> Right? >>>> >>>> Yes, but we should forward the result from vm_normal_page(), which >>>> does >>>> exactly that for you, and increment the page accordingly as required, >>>> just like with the pte we are processing. >>> >>> Makes sense, so I guess I will have to change the signature of >>> prot_numa_skip() >>> >>> to pass a double ptr to a page instead of folio and derive the folio >>> in the >>> caller, >>> >>> and pass down both the folio and the page to >>> set_write_prot_commit_flush_ptes. >> >> I already don't love how we psas the folio back from there for very >> dubious >> benefit. I really hate the idea of having a struct **page parameter... >> >> I wonder if we should just have a quick fixup for hotfix, and refine >> this more >> later? > > This is not an issue in any released kernel, so we can do this properly. > > We should just remove that nested vm_normal_folio(). > > Untested, but should give an idea what we can do. This puts the overhead of vm_normal_folio() unconditionally into the pte_present path. Although I am guessing that is already happening assuming prot_numa case is not the hot path. This is fine by me. So I guess I shouldn't have done that "reuse the folio from prot_numa case if possible" thingy at all :) > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 78bded7acf795..4e0a22f7db495 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio > *folio, pte_t *ptep, > > static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long > addr, > pte_t oldpte, pte_t *pte, int target_node, > - struct folio **foliop) > + struct folio *folio) > { > struct folio *folio = NULL; > bool ret = true; > @@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct > *vma, unsigned long addr, > if (pte_protnone(oldpte)) > goto skip; > > - folio = vm_normal_folio(vma, addr, oldpte); > if (!folio) > goto skip; > > @@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct > *vma, unsigned long addr, > if (folio_use_access_time(folio)) > folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); > > -skip: > - *foliop = folio; > return ret; > } > > @@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int > start_idx, int max_len, > * retrieve sub-batches. > */ > static void commit_anon_folio_batch(struct vm_area_struct *vma, > - struct folio *folio, unsigned long addr, pte_t *ptep, > + struct folio *folio, struct page *first_page, unsigned long > addr, pte_t *ptep, > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > { > - struct page *first_page = folio_page(folio, 0); > bool expected_anon_exclusive; > int sub_batch_idx = 0; > int len; > @@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct > vm_area_struct *vma, > expected_anon_exclusive = PageAnonExclusive(first_page + > sub_batch_idx); > len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, > first_page, expected_anon_exclusive); > - prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, > + prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, > len, > sub_batch_idx, expected_anon_exclusive, tlb); > sub_batch_idx += len; > nr_ptes -= len; > @@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct > vm_area_struct *vma, > } > > static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > - struct folio *folio, unsigned long addr, pte_t *ptep, > + struct folio *folio, struct page *page, unsigned long addr, > pte_t *ptep, > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > { > bool set_write; > @@ -270,7 +266,7 @@ static void > set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > /* idx = */ 0, set_write, tlb); > return; > } > - commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, > nr_ptes, tlb); > + commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, > ptent, nr_ptes, tlb); > } > > static long change_pte_range(struct mmu_gather *tlb, > @@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather > *tlb, > const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | > FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > struct folio *folio = NULL; > + struct page *page; > pte_t ptent; > > + page = vm_normal_folio(vma, addr, oldpte); > + if (page) > + folio = page_folio(page); > + > /* > * Avoid trapping faults against the zero or KSM > * pages. See similar comment in change_huge_pmd. > */ > if (prot_numa) { > int ret = prot_numa_skip(vma, addr, oldpte, pte, > - target_node, &folio); > + target_node, folio); > if (ret) { > > /* determine batch to skip */ > @@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb, > } > } > > - if (!folio) > - folio = vm_normal_folio(vma, addr, oldpte); > - > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, > max_nr_ptes, flags); > > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); > @@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb, > */ > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > !pte_write(ptent)) > - set_write_prot_commit_flush_ptes(vma, folio, > + set_write_prot_commit_flush_ptes(vma, folio, page, > addr, pte, oldpte, ptent, nr_ptes, tlb); > else > prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
On Wed, Aug 06, 2025 at 03:50:21PM +0530, Dev Jain wrote: > > On 06/08/25 3:41 pm, David Hildenbrand wrote: > > On 06.08.25 11:50, Lorenzo Stoakes wrote: > > > On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: > > > > > > > > > > > > You mean in _this_ PTE of the batch right? As we're invoking these > > > > > > on each part > > > > > > of the PTE table. > > > > > > > > > > > > I mean I guess we can simply do: > > > > > > > > > > > > struct page *first_page = pte_page(ptent); > > > > > > > > > > > > Right? > > > > > > > > > > Yes, but we should forward the result from vm_normal_page(), > > > > > which does > > > > > exactly that for you, and increment the page accordingly as required, > > > > > just like with the pte we are processing. > > > > > > > > Makes sense, so I guess I will have to change the signature of > > > > prot_numa_skip() > > > > > > > > to pass a double ptr to a page instead of folio and derive the > > > > folio in the > > > > caller, > > > > > > > > and pass down both the folio and the page to > > > > set_write_prot_commit_flush_ptes. > > > > > > I already don't love how we psas the folio back from there for very > > > dubious > > > benefit. I really hate the idea of having a struct **page parameter... > > > > > > I wonder if we should just have a quick fixup for hotfix, and refine > > > this more > > > later? > > > > This is not an issue in any released kernel, so we can do this properly. > > > > We should just remove that nested vm_normal_folio(). > > > > Untested, but should give an idea what we can do. > > This puts the overhead of vm_normal_folio() unconditionally into the > pte_present path. > > Although I am guessing that is already happening assuming prot_numa case is > not the > > hot path. This is fine by me. So I guess I shouldn't have done that "reuse > the folio > > from prot_numa case if possible" thingy at all :) Don't worry it was a reasonable suggestion at the time, I think let's just make our lives easier here :) Can you check David's patch and make sure it's ok? It looks reasonable to me (other than the eobvious typo I pointed out).
On 06.08.25 12:20, Dev Jain wrote: > > On 06/08/25 3:41 pm, David Hildenbrand wrote: >> On 06.08.25 11:50, Lorenzo Stoakes wrote: >>> On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: >>>>>> >>>>>> You mean in _this_ PTE of the batch right? As we're invoking these >>>>>> on each part >>>>>> of the PTE table. >>>>>> >>>>>> I mean I guess we can simply do: >>>>>> >>>>>> struct page *first_page = pte_page(ptent); >>>>>> >>>>>> Right? >>>>> >>>>> Yes, but we should forward the result from vm_normal_page(), which >>>>> does >>>>> exactly that for you, and increment the page accordingly as required, >>>>> just like with the pte we are processing. >>>> >>>> Makes sense, so I guess I will have to change the signature of >>>> prot_numa_skip() >>>> >>>> to pass a double ptr to a page instead of folio and derive the folio >>>> in the >>>> caller, >>>> >>>> and pass down both the folio and the page to >>>> set_write_prot_commit_flush_ptes. >>> >>> I already don't love how we psas the folio back from there for very >>> dubious >>> benefit. I really hate the idea of having a struct **page parameter... >>> >>> I wonder if we should just have a quick fixup for hotfix, and refine >>> this more >>> later? >> >> This is not an issue in any released kernel, so we can do this properly. >> >> We should just remove that nested vm_normal_folio(). >> >> Untested, but should give an idea what we can do. > > This puts the overhead of vm_normal_folio() unconditionally into the > pte_present path. > > Although I am guessing that is already happening assuming prot_numa case > is not the > > hot path. This is fine by me. So I guess I shouldn't have done that > "reuse the folio > > from prot_numa case if possible" thingy at all :) I mean, it only applies when trying to NUMA-protect something that is already protected. Not sure how relevant that is in practice. As we don't even batch these today, we could just do: diff --git a/mm/mprotect.c b/mm/mprotect.c index 4e0a22f7db495..2154a1a3c6656 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -127,10 +127,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, bool toptier; int nid; - /* Avoid TLB flush if possible */ - if (pte_protnone(oldpte)) - goto skip; - if (!folio) goto skip; @@ -304,6 +300,9 @@ static long change_pte_range(struct mmu_gather *tlb, struct page *page; pte_t ptent; + if (prot_numa && pte_protnone(oldpte)) + continue; + page = vm_normal_folio(vma, addr, oldpte); if (page) folio = page_folio(page); But with my change, we could actually batch-skip such large folios, because mprotect_folio_pte_batch() would see the folio. -- Cheers, David / dhildenb
On Wed, Aug 06, 2025 at 12:11:33PM +0200, David Hildenbrand wrote: > On 06.08.25 11:50, Lorenzo Stoakes wrote: > > On Wed, Aug 06, 2025 at 03:07:49PM +0530, Dev Jain wrote: > > > > > > > > > > You mean in _this_ PTE of the batch right? As we're invoking these > > > > > on each part > > > > > of the PTE table. > > > > > > > > > > I mean I guess we can simply do: > > > > > > > > > > struct page *first_page = pte_page(ptent); > > > > > > > > > > Right? > > > > > > > > Yes, but we should forward the result from vm_normal_page(), which does > > > > exactly that for you, and increment the page accordingly as required, > > > > just like with the pte we are processing. > > > > > > Makes sense, so I guess I will have to change the signature of > > > prot_numa_skip() > > > > > > to pass a double ptr to a page instead of folio and derive the folio in the > > > caller, > > > > > > and pass down both the folio and the page to > > > set_write_prot_commit_flush_ptes. > > > > I already don't love how we psas the folio back from there for very dubious > > benefit. I really hate the idea of having a struct **page parameter... > > > > I wonder if we should just have a quick fixup for hotfix, and refine this more > > later? > > This is not an issue in any released kernel, so we can do this properly. > > We should just remove that nested vm_normal_folio(). > > Untested, but should give an idea what we can do. > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 78bded7acf795..4e0a22f7db495 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -120,7 +120,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > pte_t oldpte, pte_t *pte, int target_node, > - struct folio **foliop) > + struct folio *folio) > { > struct folio *folio = NULL; > bool ret = true; > @@ -131,7 +131,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > if (pte_protnone(oldpte)) > goto skip; > - folio = vm_normal_folio(vma, addr, oldpte); > if (!folio) > goto skip; > @@ -172,8 +171,6 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > if (folio_use_access_time(folio)) > folio_xchg_access_time(folio, jiffies_to_msecs(jiffies)); > -skip: > - *foliop = folio; > return ret; > } > @@ -231,10 +228,9 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > * retrieve sub-batches. > */ > static void commit_anon_folio_batch(struct vm_area_struct *vma, > - struct folio *folio, unsigned long addr, pte_t *ptep, > + struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep, > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > { > - struct page *first_page = folio_page(folio, 0); > bool expected_anon_exclusive; > int sub_batch_idx = 0; > int len; > @@ -243,7 +239,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, > expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); > len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, > first_page, expected_anon_exclusive); > - prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, > + prot_commit_flush_ptes(vma, addr, ptep, page, oldpte, ptent, len, > sub_batch_idx, expected_anon_exclusive, tlb); > sub_batch_idx += len; > nr_ptes -= len; > @@ -251,7 +247,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma, > } > static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > - struct folio *folio, unsigned long addr, pte_t *ptep, > + struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep, > pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > { > bool set_write; > @@ -270,7 +266,7 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > /* idx = */ 0, set_write, tlb); > return; > } > - commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); > + commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb); > } > static long change_pte_range(struct mmu_gather *tlb, > @@ -305,15 +301,20 @@ static long change_pte_range(struct mmu_gather *tlb, > const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > struct folio *folio = NULL; > + struct page *page; > pte_t ptent; > + page = vm_normal_folio(vma, addr, oldpte); Surely vm_normal_page()? :P > + if (page) > + folio = page_folio(page); > + > /* > * Avoid trapping faults against the zero or KSM > * pages. See similar comment in change_huge_pmd. > */ > if (prot_numa) { > int ret = prot_numa_skip(vma, addr, oldpte, pte, > - target_node, &folio); > + target_node, folio); > if (ret) { > /* determine batch to skip */ > @@ -323,9 +324,6 @@ static long change_pte_range(struct mmu_gather *tlb, > } > } > - if (!folio) > - folio = vm_normal_folio(vma, addr, oldpte); > - Yes :) thanks :>) > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); > @@ -351,7 +349,7 @@ static long change_pte_range(struct mmu_gather *tlb, > */ > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > !pte_write(ptent)) > - set_write_prot_commit_flush_ptes(vma, folio, > + set_write_prot_commit_flush_ptes(vma, folio, page, > addr, pte, oldpte, ptent, nr_ptes, tlb); > else > prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > -- > 2.50.1 > > > -- > Cheers, > > David / dhildenb >
On 18 Jul 2025, at 5:02, Dev Jain wrote: > Use folio_pte_batch to batch process a large folio. Note that, PTE > batching here will save a few function calls, and this strategy in certain > cases (not this one) batches atomic operations in general, so we have > a performance win for all arches. This patch paves the way for patch 7 > which will help us elide the TLBI per contig block on arm64. > > The correctness of this patch lies on the correctness of setting the > new ptes based upon information only from the first pte of the batch > (which may also have accumulated a/d bits via modify_prot_start_ptes()). > > Observe that the flag combination we pass to mprotect_folio_pte_batch() > guarantees that the batch is uniform w.r.t the soft-dirty bit and the > writable bit. Therefore, the only bits which may differ are the a/d bits. > So we only need to worry about code which is concerned about the a/d bits > of the PTEs. > > Setting extra a/d bits on the new ptes where previously they were not set, > is fine - setting access bit when it was not set is not an incorrectness > problem but will only possibly delay the reclaim of the page mapped by > the pte (which is in fact intended because the kernel just operated on this > region via mprotect()!). Setting dirty bit when it was not set is again > not an incorrectness problem but will only possibly force an unnecessary > writeback. > > So now we need to reason whether something can go wrong via > can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, > and userfaultfd_pte_wp cases are solved due to uniformity in the > corresponding bits guaranteed by the flag combination. The ptes all > belong to the same VMA (since callers guarantee that [start, end) will > lie within the VMA) therefore the conditional based on the VMA is also > safe to batch around. > > Since the dirty bit on the PTE really is just an indication that the folio > got written to - even if the PTE is not actually dirty but one of the PTEs > in the batch is, the wp-fault optimization can be made. Therefore, it is > safe to batch around pte_dirty() in can_change_shared_pte_writable() > (in fact this is better since without batching, it may happen that > some ptes aren't changed to writable just because they are not dirty, > even though the other ptes mapping the same large folio are dirty). > > To batch around the PageAnonExclusive case, we must check the corresponding > condition for every single page. Therefore, from the large folio batch, > we process sub batches of ptes mapping pages with the same > PageAnonExclusive condition, and process that sub batch, then determine > and process the next sub batch, and so on. Note that this does not cause > any extra overhead; if suppose the size of the folio batch is 512, then > the sub batch processing in total will take 512 iterations, which is the > same as what we would have done before. > > For pte_needs_flush(): > > ppc does not care about the a/d bits. > > For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits > get cleared; since we can only have extra a/d bits due to batching, > we will only have an extra flush, not a case where we elide a flush due > to batching when we shouldn't have. > > Signed-off-by: Dev Jain <dev.jain@arm.com> > --- > mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 113 insertions(+), 12 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index a1c7d8a4648d..2ddd37b2f462 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > } > > static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > - pte_t pte, int max_nr_ptes) > + pte_t pte, int max_nr_ptes, fpb_t flags) > { > /* No underlying folio, so cannot batch */ > if (!folio) > @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > if (!folio_test_large(folio)) > return 1; > > - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); > + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); > } > > static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > return ret; > } > > +/* Set nr_ptes number of ptes, starting from idx */ > +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, > + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, > + int idx, bool set_write, struct mmu_gather *tlb) > +{ > + /* > + * Advance the position in the batch by idx; note that if idx > 0, > + * then the nr_ptes passed here is <= batch size - idx. > + */ > + addr += idx * PAGE_SIZE; > + ptep += idx; > + oldpte = pte_advance_pfn(oldpte, idx); > + ptent = pte_advance_pfn(ptent, idx); > + > + if (set_write) > + ptent = pte_mkwrite(ptent, vma); > + > + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); > + if (pte_needs_flush(oldpte, ptent)) > + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); > +} > + > +/* > + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or > + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce > + * that the ptes point to consecutive pages of the same anon large folio. > + */ > +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > + struct page *first_page, bool expected_anon_exclusive) > +{ > + int idx; > + > + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { > + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) > + break; > + } > + return idx - start_idx; > +} > + > +/* > + * This function is a result of trying our very best to retain the > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), > + * if the vma is a private vma, and we cannot determine whether to change > + * the pte to writable just from the vma and the pte, we then need to look > + * at the actual page pointed to by the pte. Unfortunately, if we have a > + * batch of ptes pointing to consecutive pages of the same anon large folio, > + * the anon-exclusivity (or the negation) of the first page does not guarantee > + * the anon-exclusivity (or the negation) of the other pages corresponding to > + * the pte batch; hence in this case it is incorrect to decide to change or > + * not change the ptes to writable just by using information from the first > + * pte of the batch. Therefore, we must individually check all pages and > + * retrieve sub-batches. > + */ > +static void commit_anon_folio_batch(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + struct page *first_page = folio_page(folio, 0); > + bool expected_anon_exclusive; > + int sub_batch_idx = 0; > + int len; > + > + while (nr_ptes) { > + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); > + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, > + first_page, expected_anon_exclusive); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, > + sub_batch_idx, expected_anon_exclusive, tlb); > + sub_batch_idx += len; > + nr_ptes -= len; > + } > +} > + > +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + bool set_write; > + > + if (vma->vm_flags & VM_SHARED) { > + set_write = can_change_shared_pte_writable(vma, ptent); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + > + set_write = maybe_change_pte_writable(vma, ptent) && > + (folio && folio_test_anon(folio)); > + if (!set_write) { > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); > +} > + > static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, > nr_ptes = 1; > oldpte = ptep_get(pte); > if (pte_present(oldpte)) { > + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > - struct folio *folio; > + struct folio *folio = NULL; > pte_t ptent; > > /* > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* determine batch to skip */ > nr_ptes = mprotect_folio_pte_batch(folio, > - pte, oldpte, max_nr_ptes); > + pte, oldpte, max_nr_ptes, /* flags = */ 0); > continue; > } > } > > + if (!folio) > + folio = vm_normal_folio(vma, addr, oldpte); > + > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); > + > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); Here both mprotect_folio_pte_batch() and modify_prot_start_ptes() can collect dirty and access bits from PTEs in the batch. But modify_prot_start_ptes() clears PTEs to prevent concurrent hardware modification. It took me a while to figure this out, since I was wondering why oldpte is not used after mprotect_folio_pte_batch(). I wish folio_pte_batch_flags() used by mprotect_folio_pte_batch() can take a function pointer to allow either read PTEs or clear PTEs, so that there could be one function. But that could complicate folio_pte_batch_flags(). Just a comment. :) Reviewed-by: Zi Yan <ziy@nvidia.com> Best Regards, Yan, Zi
On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote: > Use folio_pte_batch to batch process a large folio. Note that, PTE > batching here will save a few function calls, and this strategy in certain > cases (not this one) batches atomic operations in general, so we have > a performance win for all arches. This patch paves the way for patch 7 > which will help us elide the TLBI per contig block on arm64. > > The correctness of this patch lies on the correctness of setting the > new ptes based upon information only from the first pte of the batch > (which may also have accumulated a/d bits via modify_prot_start_ptes()). > > Observe that the flag combination we pass to mprotect_folio_pte_batch() > guarantees that the batch is uniform w.r.t the soft-dirty bit and the > writable bit. Therefore, the only bits which may differ are the a/d bits. > So we only need to worry about code which is concerned about the a/d bits > of the PTEs. > > Setting extra a/d bits on the new ptes where previously they were not set, > is fine - setting access bit when it was not set is not an incorrectness > problem but will only possibly delay the reclaim of the page mapped by > the pte (which is in fact intended because the kernel just operated on this > region via mprotect()!). Setting dirty bit when it was not set is again > not an incorrectness problem but will only possibly force an unnecessary > writeback. > > So now we need to reason whether something can go wrong via > can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, > and userfaultfd_pte_wp cases are solved due to uniformity in the > corresponding bits guaranteed by the flag combination. The ptes all > belong to the same VMA (since callers guarantee that [start, end) will > lie within the VMA) therefore the conditional based on the VMA is also > safe to batch around. > > Since the dirty bit on the PTE really is just an indication that the folio > got written to - even if the PTE is not actually dirty but one of the PTEs > in the batch is, the wp-fault optimization can be made. Therefore, it is > safe to batch around pte_dirty() in can_change_shared_pte_writable() > (in fact this is better since without batching, it may happen that > some ptes aren't changed to writable just because they are not dirty, > even though the other ptes mapping the same large folio are dirty). > > To batch around the PageAnonExclusive case, we must check the corresponding > condition for every single page. Therefore, from the large folio batch, > we process sub batches of ptes mapping pages with the same > PageAnonExclusive condition, and process that sub batch, then determine > and process the next sub batch, and so on. Note that this does not cause > any extra overhead; if suppose the size of the folio batch is 512, then > the sub batch processing in total will take 512 iterations, which is the > same as what we would have done before. > > For pte_needs_flush(): > > ppc does not care about the a/d bits. > > For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits > get cleared; since we can only have extra a/d bits due to batching, > we will only have an extra flush, not a case where we elide a flush due > to batching when we shouldn't have. > Thanks for great commit message! > Signed-off-by: Dev Jain <dev.jain@arm.com> This is looking MUCH better :) Thanks! Some nits below, but I've gone through this carefully and can't find anything that seems obviously wrong here, so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 113 insertions(+), 12 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index a1c7d8a4648d..2ddd37b2f462 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > } > > static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > - pte_t pte, int max_nr_ptes) > + pte_t pte, int max_nr_ptes, fpb_t flags) > { > /* No underlying folio, so cannot batch */ > if (!folio) > @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > if (!folio_test_large(folio)) > return 1; > > - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); > + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); > } > > static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > return ret; > } > > +/* Set nr_ptes number of ptes, starting from idx */ > +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, > + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, > + int idx, bool set_write, struct mmu_gather *tlb) > +{ > + /* > + * Advance the position in the batch by idx; note that if idx > 0, > + * then the nr_ptes passed here is <= batch size - idx. > + */ > + addr += idx * PAGE_SIZE; > + ptep += idx; > + oldpte = pte_advance_pfn(oldpte, idx); > + ptent = pte_advance_pfn(ptent, idx); > + > + if (set_write) > + ptent = pte_mkwrite(ptent, vma); > + > + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); > + if (pte_needs_flush(oldpte, ptent)) > + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); > +} > + > +/* > + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or > + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce > + * that the ptes point to consecutive pages of the same anon large folio. > + */ > +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > + struct page *first_page, bool expected_anon_exclusive) > +{ > + int idx; Nit but: int end = start_idx + max_len; for (idx = start_idx + 1; idx < end; idx++) { Would be a little neater here. > + > + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { Nitty again but the below might be a little clearer? struct page *page = &firstpage[idx]; if (expected_anon_exclusive != PageAnonExclusive(page)) > + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) > + break; > + } > + return idx - start_idx; > +} > + > +/* > + * This function is a result of trying our very best to retain the > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), > + * if the vma is a private vma, and we cannot determine whether to change > + * the pte to writable just from the vma and the pte, we then need to look > + * at the actual page pointed to by the pte. Unfortunately, if we have a > + * batch of ptes pointing to consecutive pages of the same anon large folio, > + * the anon-exclusivity (or the negation) of the first page does not guarantee > + * the anon-exclusivity (or the negation) of the other pages corresponding to > + * the pte batch; hence in this case it is incorrect to decide to change or > + * not change the ptes to writable just by using information from the first > + * pte of the batch. Therefore, we must individually check all pages and > + * retrieve sub-batches. > + */ Nice comment thanks. > +static void commit_anon_folio_batch(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + struct page *first_page = folio_page(folio, 0); > + bool expected_anon_exclusive; > + int sub_batch_idx = 0; > + int len; > + > + while (nr_ptes) { I'd prefer this to be: int i; ... for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) { > + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); Nit but would prefer: struct page *page = &first_page[sub_batch_idx]; expected_anon_exclusive = PageAnonExclusive(page); > + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, > + first_page, expected_anon_exclusive); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, > + sub_batch_idx, expected_anon_exclusive, tlb); > + sub_batch_idx += len; > + nr_ptes -= len; > + } > +} > + > +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, > + struct folio *folio, unsigned long addr, pte_t *ptep, > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > +{ > + bool set_write; > + > + if (vma->vm_flags & VM_SHARED) { > + set_write = can_change_shared_pte_writable(vma, ptent); > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + > + set_write = maybe_change_pte_writable(vma, ptent) && > + (folio && folio_test_anon(folio)); > + if (!set_write) { > + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, > + /* idx = */ 0, set_write, tlb); > + return; > + } > + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); > +} > + > static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, > nr_ptes = 1; > oldpte = ptep_get(pte); > if (pte_present(oldpte)) { > + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; > int max_nr_ptes = (end - addr) >> PAGE_SHIFT; > - struct folio *folio; > + struct folio *folio = NULL; > pte_t ptent; > > /* > @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* determine batch to skip */ > nr_ptes = mprotect_folio_pte_batch(folio, > - pte, oldpte, max_nr_ptes); > + pte, oldpte, max_nr_ptes, /* flags = */ 0); > continue; > } > } > > + if (!folio) > + folio = vm_normal_folio(vma, addr, oldpte); > + > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); > + > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); > ptent = pte_modify(oldpte, newprot); > > @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb, > * COW or special handling is required. > */ > if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && > - !pte_write(ptent) && > - can_change_pte_writable(vma, addr, ptent)) > - ptent = pte_mkwrite(ptent, vma); > - > - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); > - if (pte_needs_flush(oldpte, ptent)) > - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); > - pages++; > + !pte_write(ptent)) > + set_write_prot_commit_flush_ptes(vma, folio, > + addr, pte, oldpte, ptent, nr_ptes, tlb); > + else > + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, > + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); > + pages += nr_ptes; > } else if (is_swap_pte(oldpte)) { > swp_entry_t entry = pte_to_swp_entry(oldpte); > pte_t newpte; > -- > 2.30.2 >
On 19/07/25 12:19 am, Lorenzo Stoakes wrote: > On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote: >> Use folio_pte_batch to batch process a large folio. Note that, PTE >> batching here will save a few function calls, and this strategy in certain >> cases (not this one) batches atomic operations in general, so we have >> a performance win for all arches. This patch paves the way for patch 7 >> which will help us elide the TLBI per contig block on arm64. >> >> The correctness of this patch lies on the correctness of setting the >> new ptes based upon information only from the first pte of the batch >> (which may also have accumulated a/d bits via modify_prot_start_ptes()). >> >> Observe that the flag combination we pass to mprotect_folio_pte_batch() >> guarantees that the batch is uniform w.r.t the soft-dirty bit and the >> writable bit. Therefore, the only bits which may differ are the a/d bits. >> So we only need to worry about code which is concerned about the a/d bits >> of the PTEs. >> >> Setting extra a/d bits on the new ptes where previously they were not set, >> is fine - setting access bit when it was not set is not an incorrectness >> problem but will only possibly delay the reclaim of the page mapped by >> the pte (which is in fact intended because the kernel just operated on this >> region via mprotect()!). Setting dirty bit when it was not set is again >> not an incorrectness problem but will only possibly force an unnecessary >> writeback. >> >> So now we need to reason whether something can go wrong via >> can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, >> and userfaultfd_pte_wp cases are solved due to uniformity in the >> corresponding bits guaranteed by the flag combination. The ptes all >> belong to the same VMA (since callers guarantee that [start, end) will >> lie within the VMA) therefore the conditional based on the VMA is also >> safe to batch around. >> >> Since the dirty bit on the PTE really is just an indication that the folio >> got written to - even if the PTE is not actually dirty but one of the PTEs >> in the batch is, the wp-fault optimization can be made. Therefore, it is >> safe to batch around pte_dirty() in can_change_shared_pte_writable() >> (in fact this is better since without batching, it may happen that >> some ptes aren't changed to writable just because they are not dirty, >> even though the other ptes mapping the same large folio are dirty). >> >> To batch around the PageAnonExclusive case, we must check the corresponding >> condition for every single page. Therefore, from the large folio batch, >> we process sub batches of ptes mapping pages with the same >> PageAnonExclusive condition, and process that sub batch, then determine >> and process the next sub batch, and so on. Note that this does not cause >> any extra overhead; if suppose the size of the folio batch is 512, then >> the sub batch processing in total will take 512 iterations, which is the >> same as what we would have done before. >> >> For pte_needs_flush(): >> >> ppc does not care about the a/d bits. >> >> For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits >> get cleared; since we can only have extra a/d bits due to batching, >> we will only have an extra flush, not a case where we elide a flush due >> to batching when we shouldn't have. >> > Thanks for great commit message! > >> Signed-off-by: Dev Jain <dev.jain@arm.com> > This is looking MUCH better :) Thanks! > > Some nits below, but I've gone through this carefully and can't find > anything that seems obviously wrong here, so: > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Thanks! > >> --- >> mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 113 insertions(+), 12 deletions(-) >> >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index a1c7d8a4648d..2ddd37b2f462 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >> } >> >> static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >> - pte_t pte, int max_nr_ptes) >> + pte_t pte, int max_nr_ptes, fpb_t flags) >> { >> /* No underlying folio, so cannot batch */ >> if (!folio) >> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >> if (!folio_test_large(folio)) >> return 1; >> >> - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); >> + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); >> } >> >> static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, >> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, >> return ret; >> } >> >> +/* Set nr_ptes number of ptes, starting from idx */ >> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, >> + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, >> + int idx, bool set_write, struct mmu_gather *tlb) >> +{ >> + /* >> + * Advance the position in the batch by idx; note that if idx > 0, >> + * then the nr_ptes passed here is <= batch size - idx. >> + */ >> + addr += idx * PAGE_SIZE; >> + ptep += idx; >> + oldpte = pte_advance_pfn(oldpte, idx); >> + ptent = pte_advance_pfn(ptent, idx); >> + >> + if (set_write) >> + ptent = pte_mkwrite(ptent, vma); >> + >> + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); >> + if (pte_needs_flush(oldpte, ptent)) >> + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); >> +} >> + >> +/* >> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or >> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce >> + * that the ptes point to consecutive pages of the same anon large folio. >> + */ >> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, >> + struct page *first_page, bool expected_anon_exclusive) >> +{ >> + int idx; > Nit but: > > int end = start_idx + max_len; > > for (idx = start_idx + 1; idx < end; idx++) { > > Would be a little neater here. I politely disagree :) start_idx + max_len is *obviously* the end index, no need to add one more line of code asserting that. > >> + >> + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { > Nitty again but the below might be a little clearer? > > struct page *page = &firstpage[idx]; > > if (expected_anon_exclusive != PageAnonExclusive(page)) I don't think so. first_page[idx] may confuse us into thinking that we have an array of pages. Also, the way you define it assigns a stack address to struct page *page; this is not a problem in theory and the code will still be correct, but I will prefer struct page *page containing the actual address of the linear map struct page, which is vmemmap + PFN. The way I write it is, I initialize first_page from folio_page() which will derive the address from folio->page, and folio was derived from vm_normal_folio() (which was derived from the PFN in the PTE), therefore first_page will contain the actual vmemmap address of corresponding struct page, hence it is guaranteed that first_page + x will give me the x'th page in the folio. > > >> + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) >> + break; >> + } >> + return idx - start_idx; >> +} >> + >> +/* >> + * This function is a result of trying our very best to retain the >> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), >> + * if the vma is a private vma, and we cannot determine whether to change >> + * the pte to writable just from the vma and the pte, we then need to look >> + * at the actual page pointed to by the pte. Unfortunately, if we have a >> + * batch of ptes pointing to consecutive pages of the same anon large folio, >> + * the anon-exclusivity (or the negation) of the first page does not guarantee >> + * the anon-exclusivity (or the negation) of the other pages corresponding to >> + * the pte batch; hence in this case it is incorrect to decide to change or >> + * not change the ptes to writable just by using information from the first >> + * pte of the batch. Therefore, we must individually check all pages and >> + * retrieve sub-batches. >> + */ > Nice comment thanks. > >> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >> + struct folio *folio, unsigned long addr, pte_t *ptep, >> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >> +{ >> + struct page *first_page = folio_page(folio, 0); >> + bool expected_anon_exclusive; >> + int sub_batch_idx = 0; >> + int len; >> + >> + while (nr_ptes) { > I'd prefer this to be: > > int i; > > ... > > for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) { > >> + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); We won't be able to do nr_ptes -= len with this. And personally a while loop is clearer to me here. > Nit but would prefer: > > struct page *page = &first_page[sub_batch_idx]; > > expected_anon_exclusive = PageAnonExclusive(page); > >> + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, >> + first_page, expected_anon_exclusive); >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, >> + sub_batch_idx, expected_anon_exclusive, tlb); >> + sub_batch_idx += len; >> + nr_ptes -= len; >> + } >> +} >> + >> +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, >> + struct folio *folio, unsigned long addr, pte_t *ptep, >> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >> +{ >> + bool set_write; >> + >> + if (vma->vm_flags & VM_SHARED) { >> + set_write = can_change_shared_pte_writable(vma, ptent); >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, >> + /* idx = */ 0, set_write, tlb); >> + return; >> + } >> + >> + set_write = maybe_change_pte_writable(vma, ptent) && >> + (folio && folio_test_anon(folio)); >> + if (!set_write) { >> + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, >> + /* idx = */ 0, set_write, tlb); >> + return; >> + } >> + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); >> +} >> + >> static long change_pte_range(struct mmu_gather *tlb, >> struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, >> unsigned long end, pgprot_t newprot, unsigned long cp_flags) >> @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_gather *tlb, >> nr_ptes = 1; >> oldpte = ptep_get(pte); >> if (pte_present(oldpte)) { >> + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; >> int max_nr_ptes = (end - addr) >> PAGE_SHIFT; >> - struct folio *folio; >> + struct folio *folio = NULL; >> pte_t ptent; >> >> /* >> @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_gather *tlb, >> >> /* determine batch to skip */ >> nr_ptes = mprotect_folio_pte_batch(folio, >> - pte, oldpte, max_nr_ptes); >> + pte, oldpte, max_nr_ptes, /* flags = */ 0); >> continue; >> } >> } >> >> + if (!folio) >> + folio = vm_normal_folio(vma, addr, oldpte); >> + >> + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); >> + >> oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); >> ptent = pte_modify(oldpte, newprot); >> >> @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_gather *tlb, >> * COW or special handling is required. >> */ >> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && >> - !pte_write(ptent) && >> - can_change_pte_writable(vma, addr, ptent)) >> - ptent = pte_mkwrite(ptent, vma); >> - >> - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); >> - if (pte_needs_flush(oldpte, ptent)) >> - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); >> - pages++; >> + !pte_write(ptent)) >> + set_write_prot_commit_flush_ptes(vma, folio, >> + addr, pte, oldpte, ptent, nr_ptes, tlb); >> + else >> + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, >> + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); >> + pages += nr_ptes; >> } else if (is_swap_pte(oldpte)) { >> swp_entry_t entry = pte_to_swp_entry(oldpte); >> pte_t newpte; >> -- >> 2.30.2 >>
On Sat, Jul 19, 2025 at 07:16:48PM +0530, Dev Jain wrote: > > On 19/07/25 12:19 am, Lorenzo Stoakes wrote: > > On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote: > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Thanks! You're welcome :) > > > > > > --- > > > mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- > > > 1 file changed, 113 insertions(+), 12 deletions(-) > > > > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index a1c7d8a4648d..2ddd37b2f462 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > > > } > > > > > > static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > > > - pte_t pte, int max_nr_ptes) > > > + pte_t pte, int max_nr_ptes, fpb_t flags) > > > { > > > /* No underlying folio, so cannot batch */ > > > if (!folio) > > > @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, > > > if (!folio_test_large(folio)) > > > return 1; > > > > > > - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); > > > + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); > > > } > > > > > > static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > > > @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, > > > return ret; > > > } > > > > > > +/* Set nr_ptes number of ptes, starting from idx */ > > > +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, > > > + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, > > > + int idx, bool set_write, struct mmu_gather *tlb) > > > +{ > > > + /* > > > + * Advance the position in the batch by idx; note that if idx > 0, > > > + * then the nr_ptes passed here is <= batch size - idx. > > > + */ > > > + addr += idx * PAGE_SIZE; > > > + ptep += idx; > > > + oldpte = pte_advance_pfn(oldpte, idx); > > > + ptent = pte_advance_pfn(ptent, idx); > > > + > > > + if (set_write) > > > + ptent = pte_mkwrite(ptent, vma); > > > + > > > + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); > > > + if (pte_needs_flush(oldpte, ptent)) > > > + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); > > > +} > > > + > > > +/* > > > + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or > > > + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce > > > + * that the ptes point to consecutive pages of the same anon large folio. > > > + */ > > > +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, > > > + struct page *first_page, bool expected_anon_exclusive) > > > +{ > > > + int idx; > > Nit but: > > > > int end = start_idx + max_len; > > > > for (idx = start_idx + 1; idx < end; idx++) { > > > > Would be a little neater here. > > I politely disagree :) start_idx + max_len is *obviously* the > end index, no need to add one more line of code asserting that. Haha, well disagreement is permitted you know ;) as long as it's polite of course... That's fine, this isn't a big deal. > > > > > > > + > > > + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { > > Nitty again but the below might be a little clearer? > > > > struct page *page = &firstpage[idx]; > > > > if (expected_anon_exclusive != PageAnonExclusive(page)) > > I don't think so. first_page[idx] may confuse us into thinking that > we have an array of pages. Also, the way you define it assigns a > stack address to struct page *page; this is not a problem in theory > and the code will still be correct, but I will prefer struct page *page > containing the actual address of the linear map struct page, which is > vmemmap + PFN. The way I write it is, I initialize first_page from folio_page() > which will derive the address from folio->page, and folio was derived from > vm_normal_folio() (which was derived from the PFN in the PTE), therefore > first_page will contain the actual vmemmap address of corresponding struct page, > hence it is guaranteed that first_page + x will give me the x'th page in > the folio. OK, I don't think this is an issue, but I"m not going to press this, as it's a trivial thing, it's fine as-is :) > > > > > > > > > + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) > > > + break; > > > + } > > > + return idx - start_idx; > > > +} > > > + > > > +/* > > > + * This function is a result of trying our very best to retain the > > > + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), > > > + * if the vma is a private vma, and we cannot determine whether to change > > > + * the pte to writable just from the vma and the pte, we then need to look > > > + * at the actual page pointed to by the pte. Unfortunately, if we have a > > > + * batch of ptes pointing to consecutive pages of the same anon large folio, > > > + * the anon-exclusivity (or the negation) of the first page does not guarantee > > > + * the anon-exclusivity (or the negation) of the other pages corresponding to > > > + * the pte batch; hence in this case it is incorrect to decide to change or > > > + * not change the ptes to writable just by using information from the first > > > + * pte of the batch. Therefore, we must individually check all pages and > > > + * retrieve sub-batches. > > > + */ > > Nice comment thanks. > > > > > +static void commit_anon_folio_batch(struct vm_area_struct *vma, > > > + struct folio *folio, unsigned long addr, pte_t *ptep, > > > + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) > > > +{ > > > + struct page *first_page = folio_page(folio, 0); > > > + bool expected_anon_exclusive; > > > + int sub_batch_idx = 0; > > > + int len; > > > + > > > + while (nr_ptes) { > > I'd prefer this to be: > > > > int i; > > > > ... > > > > for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) { > > > > > + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); > > We won't be able to do nr_ptes -= len with this. And personally a while loop > is clearer to me here. Well, you don't need to :) maybe rename i to pte_idx + pass nr_ptes - pte_idx. Buuuut I'm not going to press this, it's not a big deal, and I see your point! Overall the R-b tag still stands with the above unchanged. Thanks for doing this series and being open to feedback, I feel we're iterated to something nice here! Cheers, Lorenzo
On 20/07/25 4:50 pm, Lorenzo Stoakes wrote: > On Sat, Jul 19, 2025 at 07:16:48PM +0530, Dev Jain wrote: >> On 19/07/25 12:19 am, Lorenzo Stoakes wrote: >>> On Fri, Jul 18, 2025 at 02:32:43PM +0530, Dev Jain wrote: >>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Thanks! > You're welcome :) > >>>> --- >>>> mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++++----- >>>> 1 file changed, 113 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/mm/mprotect.c b/mm/mprotect.c >>>> index a1c7d8a4648d..2ddd37b2f462 100644 >>>> --- a/mm/mprotect.c >>>> +++ b/mm/mprotect.c >>>> @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, >>>> } >>>> >>>> static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >>>> - pte_t pte, int max_nr_ptes) >>>> + pte_t pte, int max_nr_ptes, fpb_t flags) >>>> { >>>> /* No underlying folio, so cannot batch */ >>>> if (!folio) >>>> @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, >>>> if (!folio_test_large(folio)) >>>> return 1; >>>> >>>> - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); >>>> + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); >>>> } >>>> >>>> static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, >>>> @@ -177,6 +177,102 @@ static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, >>>> return ret; >>>> } >>>> >>>> +/* Set nr_ptes number of ptes, starting from idx */ >>>> +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, >>>> + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, >>>> + int idx, bool set_write, struct mmu_gather *tlb) >>>> +{ >>>> + /* >>>> + * Advance the position in the batch by idx; note that if idx > 0, >>>> + * then the nr_ptes passed here is <= batch size - idx. >>>> + */ >>>> + addr += idx * PAGE_SIZE; >>>> + ptep += idx; >>>> + oldpte = pte_advance_pfn(oldpte, idx); >>>> + ptent = pte_advance_pfn(ptent, idx); >>>> + >>>> + if (set_write) >>>> + ptent = pte_mkwrite(ptent, vma); >>>> + >>>> + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); >>>> + if (pte_needs_flush(oldpte, ptent)) >>>> + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); >>>> +} >>>> + >>>> +/* >>>> + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or >>>> + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce >>>> + * that the ptes point to consecutive pages of the same anon large folio. >>>> + */ >>>> +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, >>>> + struct page *first_page, bool expected_anon_exclusive) >>>> +{ >>>> + int idx; >>> Nit but: >>> >>> int end = start_idx + max_len; >>> >>> for (idx = start_idx + 1; idx < end; idx++) { >>> >>> Would be a little neater here. >> I politely disagree :) start_idx + max_len is *obviously* the >> end index, no need to add one more line of code asserting that. > Haha, well disagreement is permitted you know ;) as long as it's polite of > course... > > That's fine, this isn't a big deal. > >> >>>> + >>>> + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { >>> Nitty again but the below might be a little clearer? >>> >>> struct page *page = &firstpage[idx]; >>> >>> if (expected_anon_exclusive != PageAnonExclusive(page)) >> I don't think so. first_page[idx] may confuse us into thinking that >> we have an array of pages. Also, the way you define it assigns a >> stack address to struct page *page; this is not a problem in theory >> and the code will still be correct, but I will prefer struct page *page >> containing the actual address of the linear map struct page, which is >> vmemmap + PFN. The way I write it is, I initialize first_page from folio_page() >> which will derive the address from folio->page, and folio was derived from >> vm_normal_folio() (which was derived from the PFN in the PTE), therefore >> first_page will contain the actual vmemmap address of corresponding struct page, >> hence it is guaranteed that first_page + x will give me the x'th page in >> the folio. > OK, I don't think this is an issue, but I"m not going to press this, as it's a > trivial thing, it's fine as-is :) > >> >>> >>>> + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) >>>> + break; >>>> + } >>>> + return idx - start_idx; >>>> +} >>>> + >>>> +/* >>>> + * This function is a result of trying our very best to retain the >>>> + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), >>>> + * if the vma is a private vma, and we cannot determine whether to change >>>> + * the pte to writable just from the vma and the pte, we then need to look >>>> + * at the actual page pointed to by the pte. Unfortunately, if we have a >>>> + * batch of ptes pointing to consecutive pages of the same anon large folio, >>>> + * the anon-exclusivity (or the negation) of the first page does not guarantee >>>> + * the anon-exclusivity (or the negation) of the other pages corresponding to >>>> + * the pte batch; hence in this case it is incorrect to decide to change or >>>> + * not change the ptes to writable just by using information from the first >>>> + * pte of the batch. Therefore, we must individually check all pages and >>>> + * retrieve sub-batches. >>>> + */ >>> Nice comment thanks. >>> >>>> +static void commit_anon_folio_batch(struct vm_area_struct *vma, >>>> + struct folio *folio, unsigned long addr, pte_t *ptep, >>>> + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) >>>> +{ >>>> + struct page *first_page = folio_page(folio, 0); >>>> + bool expected_anon_exclusive; >>>> + int sub_batch_idx = 0; >>>> + int len; >>>> + >>>> + while (nr_ptes) { >>> I'd prefer this to be: >>> >>> int i; >>> >>> ... >>> >>> for (i = 0; i < nr_ptes; i += len, sub_batch_idx += len) { >>> >>>> + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); >> We won't be able to do nr_ptes -= len with this. And personally a while loop >> is clearer to me here. > Well, you don't need to :) maybe rename i to pte_idx + pass nr_ptes - pte_idx. > > Buuuut I'm not going to press this, it's not a big deal, and I see your point! > > Overall the R-b tag still stands with the above unchanged. > > Thanks for doing this series and being open to feedback, I feel we're iterated > to something nice here! We definitely iterated to something nice! I definitely found your review very useful; as contributors trying to solve our own problems, we simply keep adding new features and the code becomes a mess of if-else statements - this is what I felt when I began reading kernel code. I am reading code and suddenly some conditional handling a uffd-wp pte appears (I know you have some nice words about uffd :)) Thank you for actually forcing me to think into how I can implement my ideas into code better! > > Cheers, Lorenzo
© 2016 - 2025 Red Hat, Inc.