[PATCH v2 6/7] mm: Batch around can_change_pte_writable()

Dev Jain posted 7 patches 9 months, 2 weeks ago
[PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by Dev Jain 9 months, 2 weeks ago
In preparation for patch 7, we need to properly batch around
can_change_pte_writable(). We batch around pte_needs_soft_dirty_wp() by
the corresponding fpb flag, we batch around the page-anon exclusive check
using folio_maybe_mapped_shared(); modify_prot_start_ptes() collects the
dirty and access bits across the batch, therefore batching across
pte_dirty(): this is correct since the dirty bit on the PTE really
is just an indication that the folio got written to, so even if
the PTE is not actually dirty (but one of the PTEs in the batch is),
the wp-fault optimization can be made.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/mm.h |  4 ++--
 mm/gup.c           |  2 +-
 mm/huge_memory.c   |  4 ++--
 mm/memory.c        |  6 +++---
 mm/mprotect.c      | 11 ++++++-----
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21dd110b6655..2f639f6d93f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2487,8 +2487,8 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
 					    MM_CP_UFFD_WP_RESOLVE)
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte);
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, struct folio *folio, unsigned int nr);
 extern long change_protection(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, unsigned long cp_flags);
diff --git a/mm/gup.c b/mm/gup.c
index f32168339390..c39f587842a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -614,7 +614,7 @@ static inline bool can_follow_write_common(struct page *page,
 		return false;
 
 	/*
-	 * See can_change_pte_writable(): we broke COW and could map the page
+	 * See can_change_ptes_writable(): we broke COW and could map the page
 	 * writable if we have an exclusive anonymous page ...
 	 */
 	return page && PageAnon(page) && PageAnonExclusive(page);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2780a12b25f0..a58445fcedfc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2032,12 +2032,12 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 		return false;
 
 	if (!(vma->vm_flags & VM_SHARED)) {
-		/* See can_change_pte_writable(). */
+		/* See can_change_ptes_writable(). */
 		page = vm_normal_page_pmd(vma, addr, pmd);
 		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
-	/* See can_change_pte_writable(). */
+	/* See can_change_ptes_writable(). */
 	return pmd_dirty(pmd);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 68c1d962d0ad..e7ebc6b70421 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -750,7 +750,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 		pte = pte_mkuffd_wp(pte);
 
 	if ((vma->vm_flags & VM_WRITE) &&
-	    can_change_pte_writable(vma, address, pte)) {
+	    can_change_ptes_writable(vma, address, pte, NULL, 1)) {
 		if (folio_test_dirty(folio))
 			pte = pte_mkdirty(pte);
 		pte = pte_mkwrite(pte, vma);
@@ -5796,7 +5796,7 @@ static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_stru
 			ptent = pte_modify(ptent, vma->vm_page_prot);
 			writable = pte_write(ptent);
 			if (!writable && pte_write_upgrade &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_ptes_writable(vma, addr, ptent, NULL, 1))
 				writable = true;
 		}
 
@@ -5837,7 +5837,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 */
 	writable = pte_write(pte);
 	if (!writable && pte_write_upgrade &&
-	    can_change_pte_writable(vma, vmf->address, pte))
+	    can_change_ptes_writable(vma, vmf->address, pte, NULL, 1))
 		writable = true;
 
 	folio = vm_normal_folio(vma, vmf->address, pte);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ec5d17af7650..baff009fc981 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -40,8 +40,8 @@
 
 #include "internal.h"
 
-bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
-			     pte_t pte)
+bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t pte, struct folio *folio, unsigned int nr)
 {
 	struct page *page;
 
@@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 		 * write-fault handler similarly would map them writable without
 		 * any additional checks while holding the PT lock.
 		 */
-		page = vm_normal_page(vma, addr, pte);
-		return page && PageAnon(page) && PageAnonExclusive(page);
+		if (!folio)
+			folio = vm_normal_folio(vma, addr, pte);
+		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
 	}
 
 	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
@@ -222,7 +223,7 @@ static long change_pte_range(struct mmu_gather *tlb,
 			 */
 			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
 			    !pte_write(ptent) &&
-			    can_change_pte_writable(vma, addr, ptent))
+			    can_change_ptes_writable(vma, addr, ptent, folio, 1))
 				ptent = pte_mkwrite(ptent, vma);
 
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-- 
2.30.2
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by kernel test robot 9 months, 2 weeks ago
Hi Dev,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-7-dev.jain%40arm.com
patch subject: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
config: arm64-randconfig-002-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301306.AU2G1yvg-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301306.AU2G1yvg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301306.AU2G1yvg-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/mprotect.c:46:15: warning: unused variable 'page' [-Wunused-variable]
      46 |         struct page *page;
         |                      ^~~~
   mm/mprotect.c:226:51: error: use of undeclared identifier 'folio'
     226 |                             can_change_ptes_writable(vma, addr, ptent, folio, 1))
         |                                                                        ^
   1 warning and 1 error generated.


vim +/page +46 mm/mprotect.c

36f881883c5794 Kirill A. Shutemov 2015-06-24  42  
695112a1385b39 Dev Jain           2025-04-29  43  bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
695112a1385b39 Dev Jain           2025-04-29  44  			      pte_t pte, struct folio *folio, unsigned int nr)
64fe24a3e05e5f David Hildenbrand  2022-06-14  45  {
64fe24a3e05e5f David Hildenbrand  2022-06-14 @46  	struct page *page;
64fe24a3e05e5f David Hildenbrand  2022-06-14  47  
7ea7e333842ed5 David Hildenbrand  2022-11-08  48  	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
7ea7e333842ed5 David Hildenbrand  2022-11-08  49  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  50  
7ea7e333842ed5 David Hildenbrand  2022-11-08  51  	/* Don't touch entries that are not even readable. */
d84887739d5c98 Nadav Amit         2022-11-08  52  	if (pte_protnone(pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  53  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  54  
64fe24a3e05e5f David Hildenbrand  2022-06-14  55  	/* Do we need write faults for softdirty tracking? */
f38ee285191813 Barry Song         2024-06-08  56  	if (pte_needs_soft_dirty_wp(vma, pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  57  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  58  
64fe24a3e05e5f David Hildenbrand  2022-06-14  59  	/* Do we need write faults for uffd-wp tracking? */
64fe24a3e05e5f David Hildenbrand  2022-06-14  60  	if (userfaultfd_pte_wp(vma, pte))
64fe24a3e05e5f David Hildenbrand  2022-06-14  61  		return false;
64fe24a3e05e5f David Hildenbrand  2022-06-14  62  
64fe24a3e05e5f David Hildenbrand  2022-06-14  63  	if (!(vma->vm_flags & VM_SHARED)) {
64fe24a3e05e5f David Hildenbrand  2022-06-14  64  		/*
7ea7e333842ed5 David Hildenbrand  2022-11-08  65  		 * Writable MAP_PRIVATE mapping: We can only special-case on
7ea7e333842ed5 David Hildenbrand  2022-11-08  66  		 * exclusive anonymous pages, because we know that our
7ea7e333842ed5 David Hildenbrand  2022-11-08  67  		 * write-fault handler similarly would map them writable without
7ea7e333842ed5 David Hildenbrand  2022-11-08  68  		 * any additional checks while holding the PT lock.
64fe24a3e05e5f David Hildenbrand  2022-06-14  69  		 */
695112a1385b39 Dev Jain           2025-04-29  70  		if (!folio)
695112a1385b39 Dev Jain           2025-04-29  71  			folio = vm_normal_folio(vma, addr, pte);
695112a1385b39 Dev Jain           2025-04-29  72  		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
64fe24a3e05e5f David Hildenbrand  2022-06-14  73  	}
64fe24a3e05e5f David Hildenbrand  2022-06-14  74  
fce831c92092ad David Hildenbrand  2024-05-22  75  	VM_WARN_ON_ONCE(is_zero_pfn(pte_pfn(pte)) && pte_dirty(pte));
fce831c92092ad David Hildenbrand  2024-05-22  76  
7ea7e333842ed5 David Hildenbrand  2022-11-08  77  	/*
7ea7e333842ed5 David Hildenbrand  2022-11-08  78  	 * Writable MAP_SHARED mapping: "clean" might indicate that the FS still
7ea7e333842ed5 David Hildenbrand  2022-11-08  79  	 * needs a real write-fault for writenotify
7ea7e333842ed5 David Hildenbrand  2022-11-08  80  	 * (see vma_wants_writenotify()). If "dirty", the assumption is that the
7ea7e333842ed5 David Hildenbrand  2022-11-08  81  	 * FS was already notified and we can simply mark the PTE writable
7ea7e333842ed5 David Hildenbrand  2022-11-08  82  	 * just like the write-fault handler would do.
7ea7e333842ed5 David Hildenbrand  2022-11-08  83  	 */
d84887739d5c98 Nadav Amit         2022-11-08  84  	return pte_dirty(pte);
64fe24a3e05e5f David Hildenbrand  2022-06-14  85  }
64fe24a3e05e5f David Hildenbrand  2022-06-14  86  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by David Hildenbrand 9 months, 2 weeks ago
>   #include "internal.h"
>   
> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> -			     pte_t pte)
> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> +			      pte_t pte, struct folio *folio, unsigned int nr)
>   {
>   	struct page *page;
>   
> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>   		 * write-fault handler similarly would map them writable without
>   		 * any additional checks while holding the PT lock.
>   		 */
> -		page = vm_normal_page(vma, addr, pte);
> -		return page && PageAnon(page) && PageAnonExclusive(page);
> +		if (!folio)
> +			folio = vm_normal_folio(vma, addr, pte);
> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);

Oh no, now I spot it. That is horribly wrong.

Please understand first what you are doing.

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 29.04.25 11:19, David Hildenbrand wrote:
> 
>>    #include "internal.h"
>>    
>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>> -			     pte_t pte)
>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>    {
>>    	struct page *page;
>>    
>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>    		 * write-fault handler similarly would map them writable without
>>    		 * any additional checks while holding the PT lock.
>>    		 */
>> -		page = vm_normal_page(vma, addr, pte);
>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>> +		if (!folio)
>> +			folio = vm_normal_folio(vma, addr, pte);
>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
> 
> Oh no, now I spot it. That is horribly wrong.
> 
> Please understand first what you are doing.

Also, would expect that the cow.c selftest would catch that:

"vmsplice() + unmap in child with mprotect() optimization"

After fork() we have a R/O PTE in the parent. Our child then uses 
vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the 
parent.

ret = mprotect(mem, size, PROT_READ);
ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);

should turn the PTE writable, although it shouldn't.

If that test case does not detect the issue you're introducing, we 
should look into adding a test case that detects it.

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by Dev Jain 9 months, 1 week ago

On 29/04/25 2:57 pm, David Hildenbrand wrote:
> On 29.04.25 11:19, David Hildenbrand wrote:
>>
>>>    #include "internal.h"
>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> -                 pte_t pte)
>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned 
>>> long addr,
>>> +                  pte_t pte, struct folio *folio, unsigned int nr)
>>>    {
>>>        struct page *page;
>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct 
>>> *vma, unsigned long addr,
>>>             * write-fault handler similarly would map them writable 
>>> without
>>>             * any additional checks while holding the PT lock.
>>>             */
>>> -        page = vm_normal_page(vma, addr, pte);
>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>> +        if (!folio)
>>> +            folio = vm_normal_folio(vma, addr, pte);
>>> +        return folio_test_anon(folio) && ! 
>>> folio_maybe_mapped_shared(folio);
>>
>> Oh no, now I spot it. That is horribly wrong.
>>
>> Please understand first what you are doing.
> 
> Also, would expect that the cow.c selftest would catch that:
> 
> "vmsplice() + unmap in child with mprotect() optimization"
> 
> After fork() we have a R/O PTE in the parent. Our child then uses 
> vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the 
> parent.
> 
> ret = mprotect(mem, size, PROT_READ);
> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
> 
> should turn the PTE writable, although it shouldn't.
> 
> If that test case does not detect the issue you're introducing, we 
> should look into adding a test case that detects it.
> 

Hi David, I am afraid I don't understand my mistake :( PageAnon(page) 
boils down to folio_test_anon(folio). Next we want to determine whether 
the page underlying a PTE is mapped exclusively or not. I approximate 
this by folio_maybe_mapped_shared -> if the folio => all pages are 
mapped exclusively, then I convert the entire batch to writable. If one 
of the pages is mapped shared, then I do not convert the batch to 
writable, thus missing out on the optimization. As far as I understand,
the test failure points out exactly this right?

Do you suggest an alternate way? My initial approach was to add a new 
flag to folio_pte_batch: FPB_IGNORE_ANON_EXCLUSIVE, but from an API 
design PoV Ryan pointed out that that looked bad.

Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by David Hildenbrand 9 months, 1 week ago
On 06.05.25 11:16, Dev Jain wrote:
> 
> 
> On 29/04/25 2:57 pm, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned
>>>> long addr,
>>>> -                 pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned
>>>> long addr,
>>>> +                  pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>         struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct
>>>> *vma, unsigned long addr,
>>>>              * write-fault handler similarly would map them writable
>>>> without
>>>>              * any additional checks while holding the PT lock.
>>>>              */
>>>> -        page = vm_normal_page(vma, addr, pte);
>>>> -        return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +        if (!folio)
>>>> +            folio = vm_normal_folio(vma, addr, pte);
>>>> +        return folio_test_anon(folio) && !
>>>> folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses
>> vmsplice() and unmaps the R/O PTE, meaning it is only left mapped by the
>> parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
>>
>> If that test case does not detect the issue you're introducing, we
>> should look into adding a test case that detects it.
>>
> 
> Hi David, I am afraid I don't understand my mistake :( PageAnon(page)
> boils down to folio_test_anon(folio). Next we want to determine whether
> the page underlying a PTE is mapped exclusively or not.

No. :)

There is your mistake.

We need to know if this folio page is *exclusive* not if the folio is 
*mapped exclusively*.

I know, it's confusing, but that's an important distinction.

You really have to test all PAE bits. I recently sent a mail on how we 
could remove PAE and encode it in the PTE value itself (W || (!W 
+dirty)), which would mean that we could batch more easily. For the time 
being, we have to stick with what we have.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by Lorenzo Stoakes 9 months, 2 weeks ago
On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
> On 29.04.25 11:19, David Hildenbrand wrote:
> >
> > >    #include "internal.h"
> > > -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > > -			     pte_t pte)
> > > +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
> > > +			      pte_t pte, struct folio *folio, unsigned int nr)
> > >    {
> > >    	struct page *page;
> > > @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > >    		 * write-fault handler similarly would map them writable without
> > >    		 * any additional checks while holding the PT lock.
> > >    		 */
> > > -		page = vm_normal_page(vma, addr, pte);
> > > -		return page && PageAnon(page) && PageAnonExclusive(page);
> > > +		if (!folio)
> > > +			folio = vm_normal_folio(vma, addr, pte);
> > > +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
> >
> > Oh no, now I spot it. That is horribly wrong.
> >
> > Please understand first what you are doing.
>
> Also, would expect that the cow.c selftest would catch that:
>
> "vmsplice() + unmap in child with mprotect() optimization"
>
> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>
> ret = mprotect(mem, size, PROT_READ);
> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>
> should turn the PTE writable, although it shouldn't.

This makes me concerned about the stability of this series as a whole...

>
> If that test case does not detect the issue you're introducing, we should
> look into adding a test case that detects it.

There are 25 tests that fail for the cow self-test with this series
applied:

# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)


Dev, please take a little more time to test your series :) the current
patch set doesn't compile and needs fixes applied to do so, and we're at
v2, and you've clearly not run self-tests as these also fail.

Please ensure you do a smoke test and check compilation before sending out,
as well as running self tests also.

Thanks, Lorenzo

>
> --
> Cheers,
>
> David / dhildenb
>
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by Dev Jain 9 months, 2 weeks ago

On 29/04/25 7:27 pm, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -			     pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>     	struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>>     		 * write-fault handler similarly would map them writable without
>>>>     		 * any additional checks while holding the PT lock.
>>>>     		 */
>>>> -		page = vm_normal_page(vma, addr, pte);
>>>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +		if (!folio)
>>>> +			folio = vm_normal_folio(vma, addr, pte);
>>>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
>> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
> 
> This makes me concerned about the stability of this series as a whole...
> 
>>
>> If that test case does not detect the issue you're introducing, we should
>> look into adding a test case that detects it.
> 
> There are 25 tests that fail for the cow self-test with this series
> applied:
> 
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)
> 
> 
> Dev, please take a little more time to test your series :) the current
> patch set doesn't compile and needs fixes applied to do so, and we're at
> v2, and you've clearly not run self-tests as these also fail.
> 
> Please ensure you do a smoke test and check compilation before sending out,
> as well as running self tests also.

Apologies, I over-confidently skipped over selftests, and didn't build 
for x86 :( Shall take care.

> 
> Thanks, Lorenzo
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 29.04.25 15:57, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 11:27:43AM +0200, David Hildenbrand wrote:
>> On 29.04.25 11:19, David Hildenbrand wrote:
>>>
>>>>     #include "internal.h"
>>>> -bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> -			     pte_t pte)
>>>> +bool can_change_ptes_writable(struct vm_area_struct *vma, unsigned long addr,
>>>> +			      pte_t pte, struct folio *folio, unsigned int nr)
>>>>     {
>>>>     	struct page *page;
>>>> @@ -67,8 +67,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>>     		 * write-fault handler similarly would map them writable without
>>>>     		 * any additional checks while holding the PT lock.
>>>>     		 */
>>>> -		page = vm_normal_page(vma, addr, pte);
>>>> -		return page && PageAnon(page) && PageAnonExclusive(page);
>>>> +		if (!folio)
>>>> +			folio = vm_normal_folio(vma, addr, pte);
>>>> +		return folio_test_anon(folio) && !folio_maybe_mapped_shared(folio);
>>>
>>> Oh no, now I spot it. That is horribly wrong.
>>>
>>> Please understand first what you are doing.
>>
>> Also, would expect that the cow.c selftest would catch that:
>>
>> "vmsplice() + unmap in child with mprotect() optimization"
>>
>> After fork() we have a R/O PTE in the parent. Our child then uses vmsplice()
>> and unmaps the R/O PTE, meaning it is only left mapped by the parent.
>>
>> ret = mprotect(mem, size, PROT_READ);
>> ret |= mprotect(mem, size, PROT_READ|PROT_WRITE);
>>
>> should turn the PTE writable, although it shouldn't.
> 
> This makes me concerned about the stability of this series as a whole...
> 
>>
>> If that test case does not detect the issue you're introducing, we should
>> look into adding a test case that detects it.
> 
> There are 25 tests that fail for the cow self-test with this series
> applied:
> 
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with base page
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (16 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (32 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (64 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (128 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (256 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (512 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (1024 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with PTE-mapped THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with single PTE of THP (2048 kB)
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with partially shared THP (2048 kB)

As expected ... :)

-- 
Cheers,

David / dhildenb
Re: [PATCH v2 6/7] mm: Batch around can_change_pte_writable()
Posted by David Hildenbrand 9 months, 2 weeks ago
On 29.04.25 07:23, Dev Jain wrote:
> In preparation for patch 7, we need to properly batch around
> can_change_pte_writable(). We batch around pte_needs_soft_dirty_wp() by
> the corresponding fpb flag, we batch around the page-anon exclusive check
> using folio_maybe_mapped_shared(); modify_prot_start_ptes() collects the
> dirty and access bits across the batch, therefore batching across
> pte_dirty(): this is correct since the dirty bit on the PTE really
> is just an indication that the folio got written to, so even if
> the PTE is not actually dirty (but one of the PTEs in the batch is),
> the wp-fault optimization can be made.

If you want to add a batched version of can_change_pte_writable(), do it 
right away instead of just adding parameters to functions.

Then, add a simple

#define can_change_pte_writable(...) can_change_ptes_writable(..., 1);

So you don't have to touch all callers and don't have to update the 
function name in comments.

-- 
Cheers,

David / dhildenb