[v2] Optimize mprotect for large folios

[PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 9 months, 2 weeks ago

Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
Architecture can override these helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b50447ef1c92..ed287289335f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/* See the comment for ptep_modify_prot_start */
+#ifndef modify_prot_start_ptes
+static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+	pte_t pte, tmp_pte;
+
+	pte = ptep_modify_prot_start(vma, addr, ptep);
+	while (--nr) {
+		ptep++;
+		addr += PAGE_SIZE;
+		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
+		if (pte_dirty(tmp_pte))
+			pte = pte_mkdirty(pte);
+		if (pte_young(tmp_pte))
+			pte = pte_mkyoung(pte);
+	}
+	return pte;
+}
+#endif
+
+/* See the comment for ptep_modify_prot_commit */
+#ifndef modify_prot_commit_ptes
+static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
+{
+	for (;;) {
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+		old_pte = pte_next_pfn(old_pte);
+		pte = pte_next_pfn(pte);
+	}
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
-- 
2.30.2

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 9 months, 1 week ago

On 29/04/2025 06:23, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.

I would suggest merging this with patch #7 since that's where they actually get
used. Then you can add a single patch after that to specialize them for arm64,
which will give a performance win.

Thanks,
Ryan

> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/* See the comment for ptep_modify_prot_start */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by kernel test robot 9 months, 2 weeks ago

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-4-dev.jain%40arm.com
patch subject: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
config: arm-randconfig-003-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301328.ltLGuTxD-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301328.ltLGuTxD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301328.ltLGuTxD-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:31:
>> include/linux/pgtable.h:901:8: error: call to undeclared function 'ptep_modify_prot_start'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     901 |         pte = ptep_modify_prot_start(vma, addr, ptep);
         |               ^
>> include/linux/pgtable.h:921:3: error: call to undeclared function 'ptep_modify_prot_commit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     921 |                 ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |                 ^
>> include/linux/pgtable.h:1356:21: error: static declaration of 'ptep_modify_prot_start' follows non-static declaration
    1356 | static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
         |                     ^
   include/linux/pgtable.h:901:8: note: previous implicit declaration is here
     901 |         pte = ptep_modify_prot_start(vma, addr, ptep);
         |               ^
   include/linux/pgtable.h:1367:20: error: static declaration of 'ptep_modify_prot_commit' follows non-static declaration
    1367 | static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
         |                    ^
   include/linux/pgtable.h:921:3: note: previous implicit declaration is here
     921 |                 ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |                 ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:11: warning: array index 3 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                         ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:98:25: warning: array index 2 is past the end of the array (that has type 'unsigned long[2]') [-Warray-bounds]
      98 |                 return (set->sig[3] | set->sig[2] |
         |                                       ^        ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:11: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:27: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     114 |                 return  (set1->sig[3] == set2->sig[3]) &&
         |                                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:5: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:115:21: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     115 |                         (set1->sig[2] == set2->sig[2]) &&
         |                                          ^         ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 3 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     157 | _SIG_SET_BINOP(sigorsets, _sig_or)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:138:8: note: expanded from macro '_SIG_SET_BINOP'
     138 |                 a3 = a->sig[3]; a2 = a->sig[2];                         \
         |                      ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:157:1: warning: array index 2 is past the end of the array (that has type 'const unsigned long[2]') [-Warray-bounds]
     157 | _SIG_SET_BINOP(sigorsets, _sig_or)
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:138:24: note: expanded from macro '_SIG_SET_BINOP'
     138 |                 a3 = a->sig[3]; a2 = a->sig[2];                         \
         |                                      ^      ~
   arch/arm/include/asm/signal.h:17:2: note: array 'sig' declared here
      17 |         unsigned long sig[_NSIG_WORDS];
         |         ^
   In file included from arch/arm/kernel/asm-offsets.c:12:
   In file included from include/linux/mm.h:36:
   In file included from include/linux/rcuwait.h:6:


vim +/ptep_modify_prot_start +901 include/linux/pgtable.h

   893	
   894	/* See the comment for ptep_modify_prot_start */
   895	#ifndef modify_prot_start_ptes
   896	static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
   897			unsigned long addr, pte_t *ptep, unsigned int nr)
   898	{
   899		pte_t pte, tmp_pte;
   900	
 > 901		pte = ptep_modify_prot_start(vma, addr, ptep);
   902		while (--nr) {
   903			ptep++;
   904			addr += PAGE_SIZE;
   905			tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
   906			if (pte_dirty(tmp_pte))
   907				pte = pte_mkdirty(pte);
   908			if (pte_young(tmp_pte))
   909				pte = pte_mkyoung(pte);
   910		}
   911		return pte;
   912	}
   913	#endif
   914	
   915	/* See the comment for ptep_modify_prot_commit */
   916	#ifndef modify_prot_commit_ptes
   917	static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
   918			pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
   919	{
   920		for (;;) {
 > 921			ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
   922			if (--nr == 0)
   923				break;
   924			ptep++;
   925			addr += PAGE_SIZE;
   926			old_pte = pte_next_pfn(old_pte);
   927			pte = pte_next_pfn(pte);
   928		}
   929	}
   930	#endif
   931	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by kernel test robot 9 months, 2 weeks ago

Hi Dev,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on arm64/for-next/core linus/master v6.15-rc4 next-20250429]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-Refactor-code-in-mprotect/20250429-133151
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250429052336.18912-4-dev.jain%40arm.com
patch subject: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
config: arm-randconfig-001-20250430 (https://download.01.org/0day-ci/archive/20250430/202504301319.aph2eccP-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250430/202504301319.aph2eccP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504301319.aph2eccP-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:31,
                    from arch/arm/kernel/asm-offsets.c:12:
   include/linux/pgtable.h: In function 'modify_prot_start_ptes':
   include/linux/pgtable.h:901:8: error: implicit declaration of function 'ptep_modify_prot_start' [-Werror=implicit-function-declaration]
     901 |  pte = ptep_modify_prot_start(vma, addr, ptep);
         |        ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h: In function 'modify_prot_commit_ptes':
   include/linux/pgtable.h:921:3: error: implicit declaration of function 'ptep_modify_prot_commit' [-Werror=implicit-function-declaration]
     921 |   ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |   ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h: At top level:
>> include/linux/pgtable.h:1356:21: error: conflicting types for 'ptep_modify_prot_start'
    1356 | static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
         |                     ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h:901:8: note: previous implicit declaration of 'ptep_modify_prot_start' was here
     901 |  pte = ptep_modify_prot_start(vma, addr, ptep);
         |        ^~~~~~~~~~~~~~~~~~~~~~
>> include/linux/pgtable.h:1367:20: warning: conflicting types for 'ptep_modify_prot_commit'
    1367 | static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
         |                    ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/pgtable.h:1367:20: error: static declaration of 'ptep_modify_prot_commit' follows non-static declaration
   include/linux/pgtable.h:921:3: note: previous implicit declaration of 'ptep_modify_prot_commit' was here
     921 |   ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
         |   ^~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
   make[3]: *** [scripts/Makefile.build:98: arch/arm/kernel/asm-offsets.s] Error 1 shuffle=2044487564
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1280: prepare0] Error 2 shuffle=2044487564
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:248: __sub-make] Error 2 shuffle=2044487564
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2 shuffle=2044487564
   make: Target 'prepare' not remade because of errors.


vim +/ptep_modify_prot_start +1356 include/linux/pgtable.h

1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1340  
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1341  #ifndef __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1342  /*
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1343   * Start a pte protection read-modify-write transaction, which
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1344   * protects against asynchronous hardware modifications to the pte.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1345   * The intention is not to prevent the hardware from making pte
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1346   * updates, but to prevent any updates it may make from being lost.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1347   *
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1348   * This does not protect against other software modifications of the
2eb70aab25dd9b include/linux/pgtable.h       Bhaskar Chowdhury   2021-05-06  1349   * pte; the appropriate pte lock must be held over the transaction.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1350   *
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1351   * Note that this interface is intended to be batchable, meaning that
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1352   * ptep_modify_prot_commit may not actually update the pte, but merely
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1353   * queue the update to be done at some later time.  The update must be
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1354   * actually committed before the pte lock is released, however.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1355   */
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05 @1356  static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1357  					   unsigned long addr,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1358  					   pte_t *ptep)
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1359  {
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1360  	return __ptep_modify_prot_start(vma, addr, ptep);
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1361  }
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1362  
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1363  /*
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1364   * Commit an update to a pte, leaving any hardware-controlled bits in
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1365   * the PTE unmodified.
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1366   */
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05 @1367  static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1368  					   unsigned long addr,
04a8645304500b include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1369  					   pte_t *ptep, pte_t old_pte, pte_t pte)
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1370  {
0cbe3e26abe0cf include/asm-generic/pgtable.h Aneesh Kumar K.V    2019-03-05  1371  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1372  }
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1373  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
fe1a6875fcaaac include/asm-generic/pgtable.h Sebastian Siewior   2008-07-15  1374  #endif /* CONFIG_MMU */
1ea0704e0da65b include/asm-generic/pgtable.h Jeremy Fitzhardinge 2008-06-16  1375  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Lorenzo Stoakes 9 months, 2 weeks ago

On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +/* See the comment for ptep_modify_prot_start */

I feel like you really should add a little more here, perhaps point out
that it's batched etc.

> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)

This name is a bit confusing, it's not any ptes, it's those pte entries
belonging to a large folio capped to the PTE table right that you are
batching right?

Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
the name?

We definitely need to mention in comment or name or _somewhere_ the intent
and motivation for this.

> +{
> +	pte_t pte, tmp_pte;
> +

are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
love this interface, where you require the user to know the number of
remaining PTE entries in a PTE table.

> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {

This loop is a bit horrible. It seems needlessly confusing and you're in
_dire_ need of comments to explain what's going on.

So my understanding is, you have the user figure out:

nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)

Then, you want to return the pte entry belonging to the start of the large
folio batch, but you want to adjust that pte value to propagate dirty and
young page table flags if any page table entries within the range contain
those page table flags, having called ptep_modify_prot_start() on all of
them?

This is quite a bit to a. put in a header like this and b. not
comment/explain.

So maybe something like:

pte = ptep_modify_prot_start(vma, addr, ptep);

/* Iterate through large folio tail PTEs. */
for (pg = 1; pg < nr; pg++) {
	pte_t inner_pte;

	ptep++;
	addr += PAGE_SIZE;

	inner_pte = ptep_modify_prot_start(vma, addr, ptep);

	/* We must propagate A/D state from tail PTEs. */
	if (pte_dirty(inner_pte))
		pte = pte_mkdirty(pte);
	if (pte_young(inner_pte))
		pte = pte_mkyoung(pte);
}

Would work better?

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);

> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);

Why are you propagating these?

> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */

Same comments as above, needs more meat on the bones!

> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,

Again need to reference large folio, batched or something relevant here,
'ptes' is super vague.

> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)

Nit, but you put 'p' suffix on ptep but not on 'old_pte'?

I'm even more concerned about the 'nr' API here now.

So this is now a user-calculated:

min3(large_folio_pages, number of pte entries left in ptep,
	number of pte entries left in old_pte)

It really feels like something that should be calculated here, or at least
be broken out more clearly.

You definitely _at the very least_ need to document it in a comment.

> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;

Why are you doing an infinite loop here with a break like this? Again feels
needlessly confusing.

I think it's ok to duplicate this single line for the sake of clarity,
also.

Which gives us:

unsigned long pg;

ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
for (pg = 1; pg < nr; pg++) {
	ptep++;
	addr += PAGE_SIZE;
	old_pte = pte_next_pfn(old_pte);
	pte = pte_next_pfn(pte);

	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
}

There are alternative approaches, but I think doing an infinite loop that
breaks and especially the confusing 'if (--foo) break;' stuff is much
harder to parse than a super simple ranged loop.

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> --
> 2.30.2
>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 9 months, 1 week ago

On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  #endif
>>
>> +/* See the comment for ptep_modify_prot_start */
> 
> I feel like you really should add a little more here, perhaps point out
> that it's batched etc.
> 
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> 
> This name is a bit confusing, 

On naming, the existing (modern) convention for single-pte helpers is to start
the function name with ptep_. When we started adding batched versions, we took
the approach of adding _ptes as a suffix. For example:

set_pte_at()
ptep_get_and_clear_full()
ptep_set_wrprotect()

set_ptes()
get_and_clear_full_ptes()
wrprotect_ptes()

In this case, we already have ptep_modify_prot_start() and
ptep_modify_prot_commit() for the existing single-pte helper versions. So
according to the convention (or at least how I interpret the convention), the
proposed names seem reasonable.

> it's not any ptes, it's those pte entries
> belonging to a large folio capped to the PTE table right that you are
> batching right?

Yes, but again by convention, that is captured in the kerneldoc comment for the
functions. We are operating on a batch of *ptes* not on a folio or batch of
folios. But it is a requirement of the function that the batch of ptes all lie
within a single large folio (i.e. the pfns are sequential).
 > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> the name?
> 
> We definitely need to mention in comment or name or _somewhere_ the intent
> and motivation for this.

Agreed!

> 
>> +{
>> +	pte_t pte, tmp_pte;
>> +
> 
> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> love this interface, where you require the user to know the number of
> remaining PTE entries in a PTE table.

For better or worse, that's the established convention. See part of comment for
set_ptes() for example:

"""
 * Context: The caller holds the page table lock.  The pages all belong
 * to the same folio.  The PTEs are all in the same PMD.
"""

> 
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> 
> This loop is a bit horrible. It seems needlessly confusing and you're in
> _dire_ need of comments to explain what's going on.
> 
> So my understanding is, you have the user figure out:
> 
> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> 
> Then, you want to return the pte entry belonging to the start of the large
> folio batch, but you want to adjust that pte value to propagate dirty and
> young page table flags if any page table entries within the range contain
> those page table flags, having called ptep_modify_prot_start() on all of
> them?
> 
> This is quite a bit to a. put in a header like this and b. not
> comment/explain.

This style is copied from get_and_clear_full_ptes(), which has this comment,
which explains all this complexity. My vote would be to have a simple comment
for this function:

/**
 * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
 *			     the same folio, collecting dirty/accessed bits.
 * @mm: Address space the pages are mapped into.
 * @addr: Address the first page is mapped at.
 * @ptep: Page table pointer for the first entry.
 * @nr: Number of entries to clear.
 * @full: Whether we are clearing a full mm.
 *
 * May be overridden by the architecture; otherwise, implemented as a simple
 * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
 * returned PTE.
 *
 * Note that PTE bits in the PTE range besides the PFN can differ. For example,
 * some PTEs might be write-protected.
 *
 * Context: The caller holds the page table lock.  The PTEs map consecutive
 * pages that belong to the same folio.  The PTEs are all in the same PMD.
 */

> 
> So maybe something like:
> 
> pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> /* Iterate through large folio tail PTEs. */
> for (pg = 1; pg < nr; pg++) {
> 	pte_t inner_pte;
> 
> 	ptep++;
> 	addr += PAGE_SIZE;
> 
> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 	/* We must propagate A/D state from tail PTEs. */
> 	if (pte_dirty(inner_pte))
> 		pte = pte_mkdirty(pte);
> 	if (pte_young(inner_pte))
> 		pte = pte_mkyoung(pte);
> }
> 
> Would work better?
> 
> 
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 
> 
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
> 
> Why are you propagating these?
> 
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
> 
> Same comments as above, needs more meat on the bones!
> 
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> 
> Again need to reference large folio, batched or something relevant here,
> 'ptes' is super vague.
> 
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> 
> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> 
> I'm even more concerned about the 'nr' API here now.
> 
> So this is now a user-calculated:
> 
> min3(large_folio_pages, number of pte entries left in ptep,
> 	number of pte entries left in old_pte)
> 
> It really feels like something that should be calculated here, or at least
> be broken out more clearly.
> 
> You definitely _at the very least_ need to document it in a comment.
> 
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
> 
> Why are you doing an infinite loop here with a break like this? Again feels
> needlessly confusing.

I agree it's not pretty to look at. But apparently it's the most efficient. This
is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
definition of set_ptes()").

For the record, I think all your comments make good sense, Lorenzo. But there is
an established style, and personally I think at this point is it more confusing
to break from that style.

Thanks,
Ryan


> 
> I think it's ok to duplicate this single line for the sake of clarity,
> also.
> 
> Which gives us:
> 
> unsigned long pg;
> 
> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> for (pg = 1; pg < nr; pg++) {
> 	ptep++;
> 	addr += PAGE_SIZE;
> 	old_pte = pte_next_pfn(old_pte);
> 	pte = pte_next_pfn(pte);
> 
> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> }
> 
> There are alternative approaches, but I think doing an infinite loop that
> breaks and especially the confusing 'if (--foo) break;' stuff is much
> harder to parse than a super simple ranged loop.
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>  /*
>>   * On some architectures hardware does not set page access bit when accessing
>>   * memory page, it is responsibility of software setting this bit. It brings
>> --
>> 2.30.2
>>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> >> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >> Architecture can override these helpers.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 38 insertions(+)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index b50447ef1c92..ed287289335f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>  }
> >>  #endif
> >>
> >> +/* See the comment for ptep_modify_prot_start */
> >
> > I feel like you really should add a little more here, perhaps point out
> > that it's batched etc.
> >
> >> +#ifndef modify_prot_start_ptes
> >> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >
> > This name is a bit confusing,
>
> On naming, the existing (modern) convention for single-pte helpers is to start
> the function name with ptep_. When we started adding batched versions, we took
> the approach of adding _ptes as a suffix. For example:
>
> set_pte_at()
> ptep_get_and_clear_full()
> ptep_set_wrprotect()
>
> set_ptes()
> get_and_clear_full_ptes()
> wrprotect_ptes()
>
> In this case, we already have ptep_modify_prot_start() and
> ptep_modify_prot_commit() for the existing single-pte helper versions. So
> according to the convention (or at least how I interpret the convention), the
> proposed names seem reasonable.
>

Right, I'm fine with following convention (we should), I just find 'ptes'
really ambiguous. It's not just a -set of PTE entries- it's very explicitly
for a large folio. I'd interpret some 'ptes' case to mean 'any number of
pte entries', though I suppose it'd not in practice be any different if
that were the intended use.

However, the proposed use case is large folio 'sub' PTEs and it'd be useful
in callers to know this is explicitly what you're doing.

I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
entriess you're after (not just in effect multiple PTE entries).

However, I'm less insistent on this with a comment that explains what's
going on.

I don't want to hold this up with trivialities around naming...

ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
'PxD/PTE entries' and the fact we use both as a short-hand for each
other. But that's not related to this series, just a pet peeve... :)

> > it's not any ptes, it's those pte entries
> > belonging to a large folio capped to the PTE table right that you are
> > batching right?
>
> Yes, but again by convention, that is captured in the kerneldoc comment for the
> functions. We are operating on a batch of *ptes* not on a folio or batch of
> folios. But it is a requirement of the function that the batch of ptes all lie
> within a single large folio (i.e. the pfns are sequential).

Ack, yeah don't love this nr stuff but fine if it's convention...

>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > the name?
> >
> > We definitely need to mention in comment or name or _somewhere_ the intent
> > and motivation for this.
>
> Agreed!

...and luckily we are aligned on this :)

>
> >
> >> +{
> >> +	pte_t pte, tmp_pte;
> >> +
> >
> > are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> > love this interface, where you require the user to know the number of
> > remaining PTE entries in a PTE table.
>
> For better or worse, that's the established convention. See part of comment for
> set_ptes() for example:
>
> """
>  * Context: The caller holds the page table lock.  The pages all belong
>  * to the same folio.  The PTEs are all in the same PMD.
> """
>
> >
> >> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> >> +	while (--nr) {
> >
> > This loop is a bit horrible. It seems needlessly confusing and you're in
> > _dire_ need of comments to explain what's going on.
> >
> > So my understanding is, you have the user figure out:
> >
> > nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >
> > Then, you want to return the pte entry belonging to the start of the large
> > folio batch, but you want to adjust that pte value to propagate dirty and
> > young page table flags if any page table entries within the range contain
> > those page table flags, having called ptep_modify_prot_start() on all of
> > them?
> >
> > This is quite a bit to a. put in a header like this and b. not
> > comment/explain.
>
> This style is copied from get_and_clear_full_ptes(), which has this comment,
> which explains all this complexity. My vote would be to have a simple comment
> for this function:
>
> /**
>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>  *			     the same folio, collecting dirty/accessed bits.
>  * @mm: Address space the pages are mapped into.
>  * @addr: Address the first page is mapped at.
>  * @ptep: Page table pointer for the first entry.
>  * @nr: Number of entries to clear.
>  * @full: Whether we are clearing a full mm.
>  *
>  * May be overridden by the architecture; otherwise, implemented as a simple
>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>  * returned PTE.
>  *
>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>  * some PTEs might be write-protected.
>  *
>  * Context: The caller holds the page table lock.  The PTEs map consecutive
>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
>  */
>

OK I think the key bit here is 'consecutive pages of the same folio'.

I'd like at least a paragraph about implementation, yes the original
function doesn't have that (and should imo), something like:

	We perform the operation on the first PTE, then if any others
	follow, we invoke the ptep_modify_prot_start() for each and
	aggregate A/D bits.

Something like this.

Point taken on consistency though!

> >
> > So maybe something like:
> >
> > pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > /* Iterate through large folio tail PTEs. */
> > for (pg = 1; pg < nr; pg++) {
> > 	pte_t inner_pte;
> >
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> >
> > 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > 	/* We must propagate A/D state from tail PTEs. */
> > 	if (pte_dirty(inner_pte))
> > 		pte = pte_mkdirty(pte);
> > 	if (pte_young(inner_pte))
> > 		pte = pte_mkyoung(pte);
> > }
> >
> > Would work better?
> >
> >
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> >
> >
> >> +		if (pte_dirty(tmp_pte))
> >> +			pte = pte_mkdirty(pte);
> >> +		if (pte_young(tmp_pte))
> >> +			pte = pte_mkyoung(pte);
> >
> > Why are you propagating these?
> >
> >> +	}
> >> +	return pte;
> >> +}
> >> +#endif
> >> +
> >> +/* See the comment for ptep_modify_prot_commit */
> >
> > Same comments as above, needs more meat on the bones!
> >
> >> +#ifndef modify_prot_commit_ptes
> >> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >
> > Again need to reference large folio, batched or something relevant here,
> > 'ptes' is super vague.
> >
> >> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >
> > Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> >
> > I'm even more concerned about the 'nr' API here now.
> >
> > So this is now a user-calculated:
> >
> > min3(large_folio_pages, number of pte entries left in ptep,
> > 	number of pte entries left in old_pte)
> >
> > It really feels like something that should be calculated here, or at least
> > be broken out more clearly.
> >
> > You definitely _at the very least_ need to document it in a comment.
> >
> >> +{
> >> +	for (;;) {
> >> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >> +		if (--nr == 0)
> >> +			break;
> >
> > Why are you doing an infinite loop here with a break like this? Again feels
> > needlessly confusing.
>
> I agree it's not pretty to look at. But apparently it's the most efficient. This
> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
> definition of set_ptes()").
>
> For the record, I think all your comments make good sense, Lorenzo. But there is
> an established style, and personally I think at this point is it more confusing
> to break from that style.

This isn't _quite_ style, I'd say it's implementation, we're kind of
crossing over into something a little more I'd say :) but I mean I get your
point, sure.

I mean, fine, if (I presume you're referring _only_ to the for (;;) case
above) you are absolutely certain it is more performant in practice I
wouldn't want to stand in the way of that.

I would at least like a comment in the commit message about propagating an
odd loop for performance though to explain the 'qualities'... :)

>
> Thanks,
> Ryan
>
>
> >
> > I think it's ok to duplicate this single line for the sake of clarity,
> > also.
> >
> > Which gives us:
> >
> > unsigned long pg;
> >
> > ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > for (pg = 1; pg < nr; pg++) {
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> > 	old_pte = pte_next_pfn(old_pte);
> > 	pte = pte_next_pfn(pte);
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > }
> >
> > There are alternative approaches, but I think doing an infinite loop that
> > breaks and especially the confusing 'if (--foo) break;' stuff is much
> > harder to parse than a super simple ranged loop.
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		old_pte = pte_next_pfn(old_pte);
> >> +		pte = pte_next_pfn(pte);
> >> +	}
> >> +}
> >> +#endif
> >> +
> >>  /*
> >>   * On some architectures hardware does not set page access bit when accessing
> >>   * memory page, it is responsibility of software setting this bit. It brings
> >> --
> >> 2.30.2
> >>
>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Ryan Roberts 9 months, 1 week ago

On 30/04/2025 15:34, Lorenzo Stoakes wrote:
> On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
>> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..ed287289335f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>  }
>>>>  #endif
>>>>
>>>> +/* See the comment for ptep_modify_prot_start */
>>>
>>> I feel like you really should add a little more here, perhaps point out
>>> that it's batched etc.
>>>
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>
>>> This name is a bit confusing,
>>
>> On naming, the existing (modern) convention for single-pte helpers is to start
>> the function name with ptep_. When we started adding batched versions, we took
>> the approach of adding _ptes as a suffix. For example:
>>
>> set_pte_at()
>> ptep_get_and_clear_full()
>> ptep_set_wrprotect()
>>
>> set_ptes()
>> get_and_clear_full_ptes()
>> wrprotect_ptes()
>>
>> In this case, we already have ptep_modify_prot_start() and
>> ptep_modify_prot_commit() for the existing single-pte helper versions. So
>> according to the convention (or at least how I interpret the convention), the
>> proposed names seem reasonable.
>>
> 
> Right, I'm fine with following convention (we should), I just find 'ptes'
> really ambiguous. It's not just a -set of PTE entries- it's very explicitly
> for a large folio. I'd interpret some 'ptes' case to mean 'any number of
> pte entries', though I suppose it'd not in practice be any different if
> that were the intended use.
> 
> However, the proposed use case is large folio 'sub' PTEs and it'd be useful
> in callers to know this is explicitly what you're doing.
> 
> I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
> entriess you're after (not just in effect multiple PTE entries).

I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?

But if we go with one of these, then we should consistently apply it to all the
existing helpers IMHO - perhaps with a preparatory patch at the start of the series.

> 
> However, I'm less insistent on this with a comment that explains what's
> going on.

That would still get my vote :)

> 
> I don't want to hold this up with trivialities around naming...

There are TWO hard things in computer science; cache invalidation, naming, and
off-by-one errors :)

> 
> ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
> 'PxD/PTE entries' and the fact we use both as a short-hand for each
> other. But that's not related to this series, just a pet peeve... :)

I assume you are referring to the ambiguity between the *table* and the *entry*
(which just goes to show how ambiguous it is I guess)... I also hate this and
still trip over it all the time...

> 
>>> it's not any ptes, it's those pte entries
>>> belonging to a large folio capped to the PTE table right that you are
>>> batching right?
>>
>> Yes, but again by convention, that is captured in the kerneldoc comment for the
>> functions. We are operating on a batch of *ptes* not on a folio or batch of
>> folios. But it is a requirement of the function that the batch of ptes all lie
>> within a single large folio (i.e. the pfns are sequential).
> 
> Ack, yeah don't love this nr stuff but fine if it's convention...
> 
>>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>> the name?
>>>
>>> We definitely need to mention in comment or name or _somewhere_ the intent
>>> and motivation for this.
>>
>> Agreed!
> 
> ...and luckily we are aligned on this :)
> 
>>
>>>
>>>> +{
>>>> +	pte_t pte, tmp_pte;
>>>> +
>>>
>>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
>>> love this interface, where you require the user to know the number of
>>> remaining PTE entries in a PTE table.
>>
>> For better or worse, that's the established convention. See part of comment for
>> set_ptes() for example:
>>
>> """
>>  * Context: The caller holds the page table lock.  The pages all belong
>>  * to the same folio.  The PTEs are all in the same PMD.
>> """
>>
>>>
>>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>>>> +	while (--nr) {
>>>
>>> This loop is a bit horrible. It seems needlessly confusing and you're in
>>> _dire_ need of comments to explain what's going on.
>>>
>>> So my understanding is, you have the user figure out:
>>>
>>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
>>>
>>> Then, you want to return the pte entry belonging to the start of the large
>>> folio batch, but you want to adjust that pte value to propagate dirty and
>>> young page table flags if any page table entries within the range contain
>>> those page table flags, having called ptep_modify_prot_start() on all of
>>> them?
>>>
>>> This is quite a bit to a. put in a header like this and b. not
>>> comment/explain.
>>
>> This style is copied from get_and_clear_full_ptes(), which has this comment,
>> which explains all this complexity. My vote would be to have a simple comment

Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
simple :)

>> for this function:
>>
>> /**
>>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>>  *			     the same folio, collecting dirty/accessed bits.
>>  * @mm: Address space the pages are mapped into.
>>  * @addr: Address the first page is mapped at.
>>  * @ptep: Page table pointer for the first entry.
>>  * @nr: Number of entries to clear.
>>  * @full: Whether we are clearing a full mm.
>>  *
>>  * May be overridden by the architecture; otherwise, implemented as a simple
>>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>>  * returned PTE.
>>  *
>>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>  * some PTEs might be write-protected.
>>  *
>>  * Context: The caller holds the page table lock.  The PTEs map consecutive
>>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>  */
>>
> 
> OK I think the key bit here is 'consecutive pages of the same folio'.
> 
> I'd like at least a paragraph about implementation, yes the original
> function doesn't have that (and should imo), something like:
> 
> 	We perform the operation on the first PTE, then if any others
> 	follow, we invoke the ptep_modify_prot_start() for each and
> 	aggregate A/D bits.
> 
> Something like this.
> 
> Point taken on consistency though!
> 
>>>
>>> So maybe something like:
>>>
>>> pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>> /* Iterate through large folio tail PTEs. */
>>> for (pg = 1; pg < nr; pg++) {
>>> 	pte_t inner_pte;
>>>
>>> 	ptep++;
>>> 	addr += PAGE_SIZE;
>>>
>>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>> 	/* We must propagate A/D state from tail PTEs. */
>>> 	if (pte_dirty(inner_pte))
>>> 		pte = pte_mkdirty(pte);
>>> 	if (pte_young(inner_pte))
>>> 		pte = pte_mkyoung(pte);
>>> }
>>>
>>> Would work better?
>>>
>>>
>>>
>>>> +		ptep++;
>>>> +		addr += PAGE_SIZE;
>>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>
>>>
>>>
>>>> +		if (pte_dirty(tmp_pte))
>>>> +			pte = pte_mkdirty(pte);
>>>> +		if (pte_young(tmp_pte))
>>>> +			pte = pte_mkyoung(pte);
>>>
>>> Why are you propagating these?
>>>
>>>> +	}
>>>> +	return pte;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/* See the comment for ptep_modify_prot_commit */
>>>
>>> Same comments as above, needs more meat on the bones!
>>>
>>>> +#ifndef modify_prot_commit_ptes
>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>
>>> Again need to reference large folio, batched or something relevant here,
>>> 'ptes' is super vague.
>>>
>>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>
>>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>>>
>>> I'm even more concerned about the 'nr' API here now.
>>>
>>> So this is now a user-calculated:
>>>
>>> min3(large_folio_pages, number of pte entries left in ptep,
>>> 	number of pte entries left in old_pte)
>>>
>>> It really feels like something that should be calculated here, or at least
>>> be broken out more clearly.
>>>
>>> You definitely _at the very least_ need to document it in a comment.
>>>
>>>> +{
>>>> +	for (;;) {
>>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>> +		if (--nr == 0)
>>>> +			break;
>>>
>>> Why are you doing an infinite loop here with a break like this? Again feels
>>> needlessly confusing.
>>
>> I agree it's not pretty to look at. But apparently it's the most efficient. This
>> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
>> definition of set_ptes()").
>>
>> For the record, I think all your comments make good sense, Lorenzo. But there is
>> an established style, and personally I think at this point is it more confusing
>> to break from that style.
> 
> This isn't _quite_ style, I'd say it's implementation, we're kind of
> crossing over into something a little more I'd say :) but I mean I get your
> point, sure.
> 
> I mean, fine, if (I presume you're referring _only_ to the for (;;) case
> above) you are absolutely certain it is more performant in practice I
> wouldn't want to stand in the way of that.

No I'm not certain at all... I'm just saying that's been the argument in the
past. I vaguely recall I even tried changing the loop style in batched helpers I
implemented in the past and David asked me to stick with the established style.

> 
> I would at least like a comment in the commit message about propagating an
> odd loop for performance though to explain the 'qualities'... :)

Just to make it clear, I'm just trying to provide some historical context, I'm
not arguing that all those decisions were perfect. How about we take these
concrete steps:

  - Stick with the _ptes naming convention
  - Add kerneldoc comments for the 2 new functions that are very clear about
    what the function does and the requirements on the batch of ptes (just like
    the other batched helpers)
  - Rework the looping styles in the 2 new functions to be more "standard";
    let's not micro-optimize unless we have real evidence that it is useful.
  - Merge this patch with the one that uses these new functions

How does that sound as a way forwards?

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>> I think it's ok to duplicate this single line for the sake of clarity,
>>> also.
>>>
>>> Which gives us:
>>>
>>> unsigned long pg;
>>>
>>> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> for (pg = 1; pg < nr; pg++) {
>>> 	ptep++;
>>> 	addr += PAGE_SIZE;
>>> 	old_pte = pte_next_pfn(old_pte);
>>> 	pte = pte_next_pfn(pte);
>>>
>>> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>> }
>>>
>>> There are alternative approaches, but I think doing an infinite loop that
>>> breaks and especially the confusing 'if (--foo) break;' stuff is much
>>> harder to parse than a super simple ranged loop.
>>>
>>>> +		ptep++;
>>>> +		addr += PAGE_SIZE;
>>>> +		old_pte = pte_next_pfn(old_pte);
>>>> +		pte = pte_next_pfn(pte);
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * On some architectures hardware does not set page access bit when accessing
>>>>   * memory page, it is responsibility of software setting this bit. It brings
>>>> --
>>>> 2.30.2
>>>>
>>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Thu, May 01, 2025 at 12:33:30PM +0100, Ryan Roberts wrote:
> On 30/04/2025 15:34, Lorenzo Stoakes wrote:
> > On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
> >> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> >>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> >>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >>>> Architecture can override these helpers.
> >>>>
> >>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>>> ---
> >>>>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 38 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >>>> index b50447ef1c92..ed287289335f 100644
> >>>> --- a/include/linux/pgtable.h
> >>>> +++ b/include/linux/pgtable.h
> >>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>>>  }
> >>>>  #endif
> >>>>
> >>>> +/* See the comment for ptep_modify_prot_start */
> >>>
> >>> I feel like you really should add a little more here, perhaps point out
> >>> that it's batched etc.
> >>>
> >>>> +#ifndef modify_prot_start_ptes
> >>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >>>
> >>> This name is a bit confusing,
> >>
> >> On naming, the existing (modern) convention for single-pte helpers is to start
> >> the function name with ptep_. When we started adding batched versions, we took
> >> the approach of adding _ptes as a suffix. For example:
> >>
> >> set_pte_at()
> >> ptep_get_and_clear_full()
> >> ptep_set_wrprotect()
> >>
> >> set_ptes()
> >> get_and_clear_full_ptes()
> >> wrprotect_ptes()
> >>
> >> In this case, we already have ptep_modify_prot_start() and
> >> ptep_modify_prot_commit() for the existing single-pte helper versions. So
> >> according to the convention (or at least how I interpret the convention), the
> >> proposed names seem reasonable.
> >>
> >
> > Right, I'm fine with following convention (we should), I just find 'ptes'
> > really ambiguous. It's not just a -set of PTE entries- it's very explicitly
> > for a large folio. I'd interpret some 'ptes' case to mean 'any number of
> > pte entries', though I suppose it'd not in practice be any different if
> > that were the intended use.
> >
> > However, the proposed use case is large folio 'sub' PTEs and it'd be useful
> > in callers to know this is explicitly what you're doing.
> >
> > I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
> > entriess you're after (not just in effect multiple PTE entries).
>
> I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?

Sounds good!

>
> But if we go with one of these, then we should consistently apply it to all the
> existing helpers IMHO - perhaps with a preparatory patch at the start of the series.
>
> >
> > However, I'm less insistent on this with a comment that explains what's
> > going on.
>
> That would still get my vote :)

Awesome :)

>
> >
> > I don't want to hold this up with trivialities around naming...
>
> There are TWO hard things in computer science; cache invalidation, naming, and
> off-by-one errors :)

Haha yes... I continue to be surprised at how bloody hard it is as my
career goes on...

>
> >
> > ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
> > 'PxD/PTE entries' and the fact we use both as a short-hand for each
> > other. But that's not related to this series, just a pet peeve... :)
>
> I assume you are referring to the ambiguity between the *table* and the *entry*
> (which just goes to show how ambiguous it is I guess)... I also hate this and
> still trip over it all the time...

Yes. As do I, as does everybody I think... Sadly I think unavoidable :(

>
> >
> >>> it's not any ptes, it's those pte entries
> >>> belonging to a large folio capped to the PTE table right that you are
> >>> batching right?
> >>
> >> Yes, but again by convention, that is captured in the kerneldoc comment for the
> >> functions. We are operating on a batch of *ptes* not on a folio or batch of
> >> folios. But it is a requirement of the function that the batch of ptes all lie
> >> within a single large folio (i.e. the pfns are sequential).
> >
> > Ack, yeah don't love this nr stuff but fine if it's convention...
> >
> >>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> >>> the name?
> >>>
> >>> We definitely need to mention in comment or name or _somewhere_ the intent
> >>> and motivation for this.
> >>
> >> Agreed!
> >
> > ...and luckily we are aligned on this :)
> >
> >>
> >>>
> >>>> +{
> >>>> +	pte_t pte, tmp_pte;
> >>>> +
> >>>
> >>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> >>> love this interface, where you require the user to know the number of
> >>> remaining PTE entries in a PTE table.
> >>
> >> For better or worse, that's the established convention. See part of comment for
> >> set_ptes() for example:
> >>
> >> """
> >>  * Context: The caller holds the page table lock.  The pages all belong
> >>  * to the same folio.  The PTEs are all in the same PMD.
> >> """
> >>
> >>>
> >>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>> +	while (--nr) {
> >>>
> >>> This loop is a bit horrible. It seems needlessly confusing and you're in
> >>> _dire_ need of comments to explain what's going on.
> >>>
> >>> So my understanding is, you have the user figure out:
> >>>
> >>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >>>
> >>> Then, you want to return the pte entry belonging to the start of the large
> >>> folio batch, but you want to adjust that pte value to propagate dirty and
> >>> young page table flags if any page table entries within the range contain
> >>> those page table flags, having called ptep_modify_prot_start() on all of
> >>> them?
> >>>
> >>> This is quite a bit to a. put in a header like this and b. not
> >>> comment/explain.
> >>
> >> This style is copied from get_and_clear_full_ptes(), which has this comment,
> >> which explains all this complexity. My vote would be to have a simple comment
>
> Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
> simple :)

Ha, yeah indeed :P that makes more sense!

>
> >> for this function:
> >>
> >> /**
> >>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
> >>  *			     the same folio, collecting dirty/accessed bits.
> >>  * @mm: Address space the pages are mapped into.
> >>  * @addr: Address the first page is mapped at.
> >>  * @ptep: Page table pointer for the first entry.
> >>  * @nr: Number of entries to clear.
> >>  * @full: Whether we are clearing a full mm.
> >>  *
> >>  * May be overridden by the architecture; otherwise, implemented as a simple
> >>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
> >>  * returned PTE.
> >>  *
> >>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >>  * some PTEs might be write-protected.
> >>  *
> >>  * Context: The caller holds the page table lock.  The PTEs map consecutive
> >>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >>  */
> >>
> >
> > OK I think the key bit here is 'consecutive pages of the same folio'.
> >
> > I'd like at least a paragraph about implementation, yes the original
> > function doesn't have that (and should imo), something like:
> >
> > 	We perform the operation on the first PTE, then if any others
> > 	follow, we invoke the ptep_modify_prot_start() for each and
> > 	aggregate A/D bits.
> >
> > Something like this.
> >
> > Point taken on consistency though!
> >
> >>>
> >>> So maybe something like:
> >>>
> >>> pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>> /* Iterate through large folio tail PTEs. */
> >>> for (pg = 1; pg < nr; pg++) {
> >>> 	pte_t inner_pte;
> >>>
> >>> 	ptep++;
> >>> 	addr += PAGE_SIZE;
> >>>
> >>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>> 	/* We must propagate A/D state from tail PTEs. */
> >>> 	if (pte_dirty(inner_pte))
> >>> 		pte = pte_mkdirty(pte);
> >>> 	if (pte_young(inner_pte))
> >>> 		pte = pte_mkyoung(pte);
> >>> }
> >>>
> >>> Would work better?
> >>>
> >>>
> >>>
> >>>> +		ptep++;
> >>>> +		addr += PAGE_SIZE;
> >>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >>>
> >>>
> >>>
> >>>> +		if (pte_dirty(tmp_pte))
> >>>> +			pte = pte_mkdirty(pte);
> >>>> +		if (pte_young(tmp_pte))
> >>>> +			pte = pte_mkyoung(pte);
> >>>
> >>> Why are you propagating these?
> >>>
> >>>> +	}
> >>>> +	return pte;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>> +/* See the comment for ptep_modify_prot_commit */
> >>>
> >>> Same comments as above, needs more meat on the bones!
> >>>
> >>>> +#ifndef modify_prot_commit_ptes
> >>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >>>
> >>> Again need to reference large folio, batched or something relevant here,
> >>> 'ptes' is super vague.
> >>>
> >>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >>>
> >>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> >>>
> >>> I'm even more concerned about the 'nr' API here now.
> >>>
> >>> So this is now a user-calculated:
> >>>
> >>> min3(large_folio_pages, number of pte entries left in ptep,
> >>> 	number of pte entries left in old_pte)
> >>>
> >>> It really feels like something that should be calculated here, or at least
> >>> be broken out more clearly.
> >>>
> >>> You definitely _at the very least_ need to document it in a comment.
> >>>
> >>>> +{
> >>>> +	for (;;) {
> >>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>>> +		if (--nr == 0)
> >>>> +			break;
> >>>
> >>> Why are you doing an infinite loop here with a break like this? Again feels
> >>> needlessly confusing.
> >>
> >> I agree it's not pretty to look at. But apparently it's the most efficient. This
> >> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
> >> definition of set_ptes()").
> >>
> >> For the record, I think all your comments make good sense, Lorenzo. But there is
> >> an established style, and personally I think at this point is it more confusing
> >> to break from that style.
> >
> > This isn't _quite_ style, I'd say it's implementation, we're kind of
> > crossing over into something a little more I'd say :) but I mean I get your
> > point, sure.
> >
> > I mean, fine, if (I presume you're referring _only_ to the for (;;) case
> > above) you are absolutely certain it is more performant in practice I
> > wouldn't want to stand in the way of that.
>
> No I'm not certain at all... I'm just saying that's been the argument in the
> past. I vaguely recall I even tried changing the loop style in batched helpers I
> implemented in the past and David asked me to stick with the established style.

I definitely defer to David's expertise, but I feel there's some room here
for improving things.

>
> >
> > I would at least like a comment in the commit message about propagating an
> > odd loop for performance though to explain the 'qualities'... :)
>
> Just to make it clear, I'm just trying to provide some historical context, I'm
> not arguing that all those decisions were perfect. How about we take these
> concrete steps:

Ack sure.

>
>   - Stick with the _ptes naming convention
>   - Add kerneldoc comments for the 2 new functions that are very clear about
>     what the function does and the requirements on the batch of ptes (just like
>     the other batched helpers)
>   - Rework the looping styles in the 2 new functions to be more "standard";
>     let's not micro-optimize unless we have real evidence that it is useful.
>   - Merge this patch with the one that uses these new functions
>
> How does that sound as a way forwards?

Sounds good to me!

Cheers, Lorenzo

>
> Thanks,
> Ryan
>
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >>>
> >>> I think it's ok to duplicate this single line for the sake of clarity,
> >>> also.
> >>>
> >>> Which gives us:
> >>>
> >>> unsigned long pg;
> >>>
> >>> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> for (pg = 1; pg < nr; pg++) {
> >>> 	ptep++;
> >>> 	addr += PAGE_SIZE;
> >>> 	old_pte = pte_next_pfn(old_pte);
> >>> 	pte = pte_next_pfn(pte);
> >>>
> >>> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >>> }
> >>>
> >>> There are alternative approaches, but I think doing an infinite loop that
> >>> breaks and especially the confusing 'if (--foo) break;' stuff is much
> >>> harder to parse than a super simple ranged loop.
> >>>
> >>>> +		ptep++;
> >>>> +		addr += PAGE_SIZE;
> >>>> +		old_pte = pte_next_pfn(old_pte);
> >>>> +		pte = pte_next_pfn(pte);
> >>>> +	}
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  /*
> >>>>   * On some architectures hardware does not set page access bit when accessing
> >>>>   * memory page, it is responsibility of software setting this bit. It brings
> >>>> --
> >>>> 2.30.2
> >>>>
> >>
>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by David Hildenbrand 9 months, 1 week ago

On 01.05.25 14:58, Lorenzo Stoakes wrote:
> On Thu, May 01, 2025 at 12:33:30PM +0100, Ryan Roberts wrote:
>> On 30/04/2025 15:34, Lorenzo Stoakes wrote:
>>> On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
>>>> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
>>>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>>>> Architecture can override these helpers.
>>>>>>
>>>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>>>> ---
>>>>>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>>>   1 file changed, 38 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index b50447ef1c92..ed287289335f 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>   }
>>>>>>   #endif
>>>>>>
>>>>>> +/* See the comment for ptep_modify_prot_start */
>>>>>
>>>>> I feel like you really should add a little more here, perhaps point out
>>>>> that it's batched etc.
>>>>>
>>>>>> +#ifndef modify_prot_start_ptes
>>>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>>>
>>>>> This name is a bit confusing,
>>>>
>>>> On naming, the existing (modern) convention for single-pte helpers is to start
>>>> the function name with ptep_. When we started adding batched versions, we took
>>>> the approach of adding _ptes as a suffix. For example:
>>>>
>>>> set_pte_at()
>>>> ptep_get_and_clear_full()
>>>> ptep_set_wrprotect()
>>>>
>>>> set_ptes()
>>>> get_and_clear_full_ptes()
>>>> wrprotect_ptes()
>>>>
>>>> In this case, we already have ptep_modify_prot_start() and
>>>> ptep_modify_prot_commit() for the existing single-pte helper versions. So
>>>> according to the convention (or at least how I interpret the convention), the
>>>> proposed names seem reasonable.
>>>>
>>>
>>> Right, I'm fine with following convention (we should), I just find 'ptes'
>>> really ambiguous. It's not just a -set of PTE entries- it's very explicitly
>>> for a large folio. I'd interpret some 'ptes' case to mean 'any number of
>>> pte entries', though I suppose it'd not in practice be any different if
>>> that were the intended use.
>>>
>>> However, the proposed use case is large folio 'sub' PTEs and it'd be useful
>>> in callers to know this is explicitly what you're doing.
>>>
>>> I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
>>> entriess you're after (not just in effect multiple PTE entries).
>>
>> I don't mind _batched_ptes. _pte_batch would be shorter though - what do you think?
> 
> Sounds good!
> 
>>
>> But if we go with one of these, then we should consistently apply it to all the
>> existing helpers IMHO - perhaps with a preparatory patch at the start of the series.
>>
>>>
>>> However, I'm less insistent on this with a comment that explains what's
>>> going on.
>>
>> That would still get my vote :)
> 
> Awesome :)
> 
>>
>>>
>>> I don't want to hold this up with trivialities around naming...
>>
>> There are TWO hard things in computer science; cache invalidation, naming, and
>> off-by-one errors :)
> 
> Haha yes... I continue to be surprised at how bloody hard it is as my
> career goes on...
> 
>>
>>>
>>> ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
>>> 'PxD/PTE entries' and the fact we use both as a short-hand for each
>>> other. But that's not related to this series, just a pet peeve... :)
>>
>> I assume you are referring to the ambiguity between the *table* and the *entry*
>> (which just goes to show how ambiguous it is I guess)... I also hate this and
>> still trip over it all the time...
> 
> Yes. As do I, as does everybody I think... Sadly I think unavoidable :(
> 
>>
>>>
>>>>> it's not any ptes, it's those pte entries
>>>>> belonging to a large folio capped to the PTE table right that you are
>>>>> batching right?
>>>>
>>>> Yes, but again by convention, that is captured in the kerneldoc comment for the
>>>> functions. We are operating on a batch of *ptes* not on a folio or batch of
>>>> folios. But it is a requirement of the function that the batch of ptes all lie
>>>> within a single large folio (i.e. the pfns are sequential).
>>>
>>> Ack, yeah don't love this nr stuff but fine if it's convention...
>>>
>>>>   > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>>>> the name?
>>>>>
>>>>> We definitely need to mention in comment or name or _somewhere_ the intent
>>>>> and motivation for this.
>>>>
>>>> Agreed!
>>>
>>> ...and luckily we are aligned on this :)
>>>
>>>>
>>>>>
>>>>>> +{
>>>>>> +	pte_t pte, tmp_pte;
>>>>>> +
>>>>>
>>>>> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
>>>>> love this interface, where you require the user to know the number of
>>>>> remaining PTE entries in a PTE table.
>>>>
>>>> For better or worse, that's the established convention. See part of comment for
>>>> set_ptes() for example:
>>>>
>>>> """
>>>>   * Context: The caller holds the page table lock.  The pages all belong
>>>>   * to the same folio.  The PTEs are all in the same PMD.
>>>> """
>>>>
>>>>>
>>>>>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>> +	while (--nr) {
>>>>>
>>>>> This loop is a bit horrible. It seems needlessly confusing and you're in
>>>>> _dire_ need of comments to explain what's going on.
>>>>>
>>>>> So my understanding is, you have the user figure out:
>>>>>
>>>>> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
>>>>>
>>>>> Then, you want to return the pte entry belonging to the start of the large
>>>>> folio batch, but you want to adjust that pte value to propagate dirty and
>>>>> young page table flags if any page table entries within the range contain
>>>>> those page table flags, having called ptep_modify_prot_start() on all of
>>>>> them?
>>>>>
>>>>> This is quite a bit to a. put in a header like this and b. not
>>>>> comment/explain.
>>>>
>>>> This style is copied from get_and_clear_full_ptes(), which has this comment,
>>>> which explains all this complexity. My vote would be to have a simple comment
>>
>> Oops; I meant "similar" when my fingers somehow typed "simple"... This is not
>> simple :)
> 
> Ha, yeah indeed :P that makes more sense!
> 
>>
>>>> for this function:
>>>>
>>>> /**
>>>>   * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>>>>   *			     the same folio, collecting dirty/accessed bits.
>>>>   * @mm: Address space the pages are mapped into.
>>>>   * @addr: Address the first page is mapped at.
>>>>   * @ptep: Page table pointer for the first entry.
>>>>   * @nr: Number of entries to clear.
>>>>   * @full: Whether we are clearing a full mm.
>>>>   *
>>>>   * May be overridden by the architecture; otherwise, implemented as a simple
>>>>   * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>>>>   * returned PTE.
>>>>   *
>>>>   * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>>>   * some PTEs might be write-protected.
>>>>   *
>>>>   * Context: The caller holds the page table lock.  The PTEs map consecutive
>>>>   * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>>>   */
>>>>
>>>
>>> OK I think the key bit here is 'consecutive pages of the same folio'.
>>>
>>> I'd like at least a paragraph about implementation, yes the original
>>> function doesn't have that (and should imo), something like:
>>>
>>> 	We perform the operation on the first PTE, then if any others
>>> 	follow, we invoke the ptep_modify_prot_start() for each and
>>> 	aggregate A/D bits.
>>>
>>> Something like this.
>>>
>>> Point taken on consistency though!
>>>
>>>>>
>>>>> So maybe something like:
>>>>>
>>>>> pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>> /* Iterate through large folio tail PTEs. */
>>>>> for (pg = 1; pg < nr; pg++) {
>>>>> 	pte_t inner_pte;
>>>>>
>>>>> 	ptep++;
>>>>> 	addr += PAGE_SIZE;
>>>>>
>>>>> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>> 	/* We must propagate A/D state from tail PTEs. */
>>>>> 	if (pte_dirty(inner_pte))
>>>>> 		pte = pte_mkdirty(pte);
>>>>> 	if (pte_young(inner_pte))
>>>>> 		pte = pte_mkyoung(pte);
>>>>> }
>>>>>
>>>>> Would work better?
>>>>>
>>>>>
>>>>>
>>>>>> +		ptep++;
>>>>>> +		addr += PAGE_SIZE;
>>>>>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>>>>>
>>>>>
>>>>>
>>>>>> +		if (pte_dirty(tmp_pte))
>>>>>> +			pte = pte_mkdirty(pte);
>>>>>> +		if (pte_young(tmp_pte))
>>>>>> +			pte = pte_mkyoung(pte);
>>>>>
>>>>> Why are you propagating these?
>>>>>
>>>>>> +	}
>>>>>> +	return pte;
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>> +/* See the comment for ptep_modify_prot_commit */
>>>>>
>>>>> Same comments as above, needs more meat on the bones!
>>>>>
>>>>>> +#ifndef modify_prot_commit_ptes
>>>>>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>>>>>
>>>>> Again need to reference large folio, batched or something relevant here,
>>>>> 'ptes' is super vague.
>>>>>
>>>>>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>>>>>
>>>>> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>>>>>
>>>>> I'm even more concerned about the 'nr' API here now.
>>>>>
>>>>> So this is now a user-calculated:
>>>>>
>>>>> min3(large_folio_pages, number of pte entries left in ptep,
>>>>> 	number of pte entries left in old_pte)
>>>>>
>>>>> It really feels like something that should be calculated here, or at least
>>>>> be broken out more clearly.
>>>>>
>>>>> You definitely _at the very least_ need to document it in a comment.
>>>>>
>>>>>> +{
>>>>>> +	for (;;) {
>>>>>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>>>>>> +		if (--nr == 0)
>>>>>> +			break;
>>>>>
>>>>> Why are you doing an infinite loop here with a break like this? Again feels
>>>>> needlessly confusing.
>>>>
>>>> I agree it's not pretty to look at. But apparently it's the most efficient. This
>>>> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
>>>> definition of set_ptes()").
>>>>
>>>> For the record, I think all your comments make good sense, Lorenzo. But there is
>>>> an established style, and personally I think at this point is it more confusing
>>>> to break from that style.
>>>
>>> This isn't _quite_ style, I'd say it's implementation, we're kind of
>>> crossing over into something a little more I'd say :) but I mean I get your
>>> point, sure.
>>>
>>> I mean, fine, if (I presume you're referring _only_ to the for (;;) case
>>> above) you are absolutely certain it is more performant in practice I
>>> wouldn't want to stand in the way of that.
>>
>> No I'm not certain at all... I'm just saying that's been the argument in the
>> past. I vaguely recall I even tried changing the loop style in batched helpers I
>> implemented in the past and David asked me to stick with the established style.
> 
> I definitely defer to David's expertise, but I feel there's some room here
> for improving things.

Yeah, I recall Willy introducing that scheme, arguing that it is the 
most efficient once. Can't argue with that :)


-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 9 months, 2 weeks ago


On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>
>> +/* See the comment for ptep_modify_prot_start */
> 
> I feel like you really should add a little more here, perhaps point out
> that it's batched etc.

Sure. I couldn't easily figure out a way to write the documentation 
nicely, I'll do it this time.

> 
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> 
> This name is a bit confusing, it's not any ptes, it's those pte entries
> belonging to a large folio capped to the PTE table right that you are
> batching right?

yes, but I am just following the convention. See wrprotect_ptes(), etc. 
I don't have a strong preference anyways.

> 
> Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> the name?

How about modify_prot_start_batched_ptes()?

> 
> We definitely need to mention in comment or name or _somewhere_ the intent
> and motivation for this.
> 
>> +{
>> +	pte_t pte, tmp_pte;
>> +
> 
> are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> love this interface, where you require the user to know the number of
> remaining PTE entries in a PTE table.

Shall I write in the comments that the range is supposed to be within a 
PTE table?

> 
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
> 
> This loop is a bit horrible. It seems needlessly confusing and you're in
> _dire_ need of comments to explain what's going on.

Again, following the pattern of get_and_clear_full_ptes :)
> 
> So my understanding is, you have the user figure out:
> 
> nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> 
> Then, you want to return the pte entry belonging to the start of the large
> folio batch, but you want to adjust that pte value to propagate dirty and
> young page table flags if any page table entries within the range contain
> those page table flags, having called ptep_modify_prot_start() on all of
> them?
> 
> This is quite a bit to a. put in a header like this and b. not
> comment/explain.
> 
> So maybe something like:
> 
> pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> /* Iterate through large folio tail PTEs. */
> for (pg = 1; pg < nr; pg++) {
> 	pte_t inner_pte;
> 
> 	ptep++;
> 	addr += PAGE_SIZE;
> 
> 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 	/* We must propagate A/D state from tail PTEs. */
> 	if (pte_dirty(inner_pte))
> 		pte = pte_mkdirty(pte);
> 	if (pte_young(inner_pte))
> 		pte = pte_mkyoung(pte);
> }
> 
> Would work better?

No preference, I'll do this then.

> 
> 
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> 
> 
> 
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
> 
> Why are you propagating these?

Because the a/d bits are per-folio; and, this will help us batch around 
can_change_pte_writable (return pte_dirty(pte)) and, batch around 
pte_needs_flush() for parisc.

> 
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
> 
> Same comments as above, needs more meat on the bones!
> 
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> 
> Again need to reference large folio, batched or something relevant here,
> 'ptes' is super vague.
> 
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> 
> Nit, but you put 'p' suffix on ptep but not on 'old_pte'?

Because ptep is a pointer, and old_pte isn't.

> 
> I'm even more concerned about the 'nr' API here now.
> 
> So this is now a user-calculated:
> 
> min3(large_folio_pages, number of pte entries left in ptep,
> 	number of pte entries left in old_pte)
> 
> It really feels like something that should be calculated here, or at least
> be broken out more clearly.
> 
> You definitely _at the very least_ need to document it in a comment.
> 
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
> 
> Why are you doing an infinite loop here with a break like this? Again feels
> needlessly confusing.

Following wrprotect_ptes().
I agree that this is confusing, which is why I thought why it was done 
in the first place :) but I just followed what already is.
I'll change this to a simple for loop if that is your inclination.

> 
> I think it's ok to duplicate this single line for the sake of clarity,
> also.
> 
> Which gives us:
> 
> unsigned long pg;
> 
> ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> for (pg = 1; pg < nr; pg++) {
> 	ptep++;
> 	addr += PAGE_SIZE;
> 	old_pte = pte_next_pfn(old_pte);
> 	pte = pte_next_pfn(pte);
> 
> 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> }
> 
> There are alternative approaches, but I think doing an infinite loop that
> breaks and especially the confusing 'if (--foo) break;' stuff is much
> harder to parse than a super simple ranged loop.
> 
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> --
>> 2.30.2
>>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
>
>
> On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> > > Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> > > Architecture can override these helpers.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 38 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index b50447ef1c92..ed287289335f 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > >   }
> > >   #endif
> > >
> > > +/* See the comment for ptep_modify_prot_start */
> >
> > I feel like you really should add a little more here, perhaps point out
> > that it's batched etc.
>
> Sure. I couldn't easily figure out a way to write the documentation nicely,
> I'll do it this time.

Thanks! Though see the discussion with Ryan also.

>
> >
> > > +#ifndef modify_prot_start_ptes
> > > +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > > +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >
> > This name is a bit confusing, it's not any ptes, it's those pte entries
> > belonging to a large folio capped to the PTE table right that you are
> > batching right?
>
> yes, but I am just following the convention. See wrprotect_ptes(), etc. I
> don't have a strong preference anyways.
>
> >
> > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > the name?
>
> How about modify_prot_start_batched_ptes()?

I like this :) Ryan - that work for you, or do you feel _batched_ should be
dropped here?

>
> >
> > We definitely need to mention in comment or name or _somewhere_ the intent
> > and motivation for this.
> >
> > > +{
> > > +	pte_t pte, tmp_pte;
> > > +
> >
> > are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> > love this interface, where you require the user to know the number of
> > remaining PTE entries in a PTE table.
>
> Shall I write in the comments that the range is supposed to be within a PTE
> table?

Yeah that'd be helpful I think thanks!

>
> >
> > > +	pte = ptep_modify_prot_start(vma, addr, ptep);
> > > +	while (--nr) {
> >
> > This loop is a bit horrible. It seems needlessly confusing and you're in
> > _dire_ need of comments to explain what's going on.
>
> Again, following the pattern of get_and_clear_full_ptes :)

Yeah, see discussion with Ryan :>)

> >
> > So my understanding is, you have the user figure out:
> >
> > nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >
> > Then, you want to return the pte entry belonging to the start of the large
> > folio batch, but you want to adjust that pte value to propagate dirty and
> > young page table flags if any page table entries within the range contain
> > those page table flags, having called ptep_modify_prot_start() on all of
> > them?
> >
> > This is quite a bit to a. put in a header like this and b. not
> > comment/explain.
> >
> > So maybe something like:
> >
> > pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > /* Iterate through large folio tail PTEs. */
> > for (pg = 1; pg < nr; pg++) {
> > 	pte_t inner_pte;
> >
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> >
> > 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > 	/* We must propagate A/D state from tail PTEs. */
> > 	if (pte_dirty(inner_pte))
> > 		pte = pte_mkdirty(pte);
> > 	if (pte_young(inner_pte))
> > 		pte = pte_mkyoung(pte);
> > }
> >
> > Would work better?
>
> No preference, I'll do this then.

Thanks!

>
> >
> >
> >
> > > +		ptep++;
> > > +		addr += PAGE_SIZE;
> > > +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> >
> >
> > > +		if (pte_dirty(tmp_pte))
> > > +			pte = pte_mkdirty(pte);
> > > +		if (pte_young(tmp_pte))
> > > +			pte = pte_mkyoung(pte);
> >
> > Why are you propagating these?
>
> Because the a/d bits are per-folio; and, this will help us batch around
> can_change_pte_writable (return pte_dirty(pte)) and, batch around
> pte_needs_flush() for parisc.

Understood, thanks!

>
> >
> > > +	}
> > > +	return pte;
> > > +}
> > > +#endif
> > > +
> > > +/* See the comment for ptep_modify_prot_commit */
> >
> > Same comments as above, needs more meat on the bones!
> >
> > > +#ifndef modify_prot_commit_ptes
> > > +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >
> > Again need to reference large folio, batched or something relevant here,
> > 'ptes' is super vague.
> >
> > > +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >
> > Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
>
> Because ptep is a pointer, and old_pte isn't.

Oops :P :) sorry, this is me being a little 'slow' here... I missed that. Carry
on then :P

>
> >
> > I'm even more concerned about the 'nr' API here now.
> >
> > So this is now a user-calculated:
> >
> > min3(large_folio_pages, number of pte entries left in ptep,
> > 	number of pte entries left in old_pte)
> >
> > It really feels like something that should be calculated here, or at least
> > be broken out more clearly.
> >
> > You definitely _at the very least_ need to document it in a comment.
> >
> > > +{
> > > +	for (;;) {
> > > +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > > +		if (--nr == 0)
> > > +			break;
> >
> > Why are you doing an infinite loop here with a break like this? Again feels
> > needlessly confusing.
>
> Following wrprotect_ptes().
> I agree that this is confusing, which is why I thought why it was done in
> the first place :) but I just followed what already is.
> I'll change this to a simple for loop if that is your inclination.

No, I guess let's keep it as-is, Ryan pointed out there are perf considerations
here. This one is a lot less egregious.

>
> >
> > I think it's ok to duplicate this single line for the sake of clarity,
> > also.
> >
> > Which gives us:
> >
> > unsigned long pg;
> >
> > ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > for (pg = 1; pg < nr; pg++) {
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> > 	old_pte = pte_next_pfn(old_pte);
> > 	pte = pte_next_pfn(pte);
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > }
> >
> > There are alternative approaches, but I think doing an infinite loop that
> > breaks and especially the confusing 'if (--foo) break;' stuff is much
> > harder to parse than a super simple ranged loop.
> >
> > > +		ptep++;
> > > +		addr += PAGE_SIZE;
> > > +		old_pte = pte_next_pfn(old_pte);
> > > +		pte = pte_next_pfn(pte);
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >   /*
> > >    * On some architectures hardware does not set page access bit when accessing
> > >    * memory page, it is responsibility of software setting this bit. It brings
> > > --
> > > 2.30.2
> > >
>

Thanks, Lorenzo

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by David Hildenbrand 9 months, 1 week ago

On 30.04.25 16:37, Lorenzo Stoakes wrote:
> On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
>>
>>
>> On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
>>> On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
>>>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>>>> Architecture can override these helpers.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index b50447ef1c92..ed287289335f 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    }
>>>>    #endif
>>>>
>>>> +/* See the comment for ptep_modify_prot_start */
>>>
>>> I feel like you really should add a little more here, perhaps point out
>>> that it's batched etc.
>>
>> Sure. I couldn't easily figure out a way to write the documentation nicely,
>> I'll do it this time.
> 
> Thanks! Though see the discussion with Ryan also.
> 
>>
>>>
>>>> +#ifndef modify_prot_start_ptes
>>>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>>>
>>> This name is a bit confusing, it's not any ptes, it's those pte entries
>>> belonging to a large folio capped to the PTE table right that you are
>>> batching right?
>>
>> yes, but I am just following the convention. See wrprotect_ptes(), etc. I
>> don't have a strong preference anyways.
>>
>>>
>>> Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
>>> the name?
>>
>> How about modify_prot_start_batched_ptes()?
> 
> I like this :) Ryan - that work for you, or do you feel _batched_ should be
> dropped here?


modify_prot_start_folio_ptes ?

But I would rather go with

modify_prot_folio_ptes_start

The "batched" is implicit, and "large folio" is not required if it's 
more than one pte ...

:)

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Tue, May 06, 2025 at 04:30:18PM +0200, David Hildenbrand wrote:
> On 30.04.25 16:37, Lorenzo Stoakes wrote:
> > On Wed, Apr 30, 2025 at 11:55:12AM +0530, Dev Jain wrote:
> > >
> > >
> > > On 29/04/25 7:22 pm, Lorenzo Stoakes wrote:
> > > > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> > > > > Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> > > > > Architecture can override these helpers.
> > > > >
> > > > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > > > ---
> > > > >    include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 38 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > > > index b50447ef1c92..ed287289335f 100644
> > > > > --- a/include/linux/pgtable.h
> > > > > +++ b/include/linux/pgtable.h
> > > > > @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    }
> > > > >    #endif
> > > > >
> > > > > +/* See the comment for ptep_modify_prot_start */
> > > >
> > > > I feel like you really should add a little more here, perhaps point out
> > > > that it's batched etc.
> > >
> > > Sure. I couldn't easily figure out a way to write the documentation nicely,
> > > I'll do it this time.
> >
> > Thanks! Though see the discussion with Ryan also.
> >
> > >
> > > >
> > > > > +#ifndef modify_prot_start_ptes
> > > > > +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> > > > > +		unsigned long addr, pte_t *ptep, unsigned int nr)
> > > >
> > > > This name is a bit confusing, it's not any ptes, it's those pte entries
> > > > belonging to a large folio capped to the PTE table right that you are
> > > > batching right?
> > >
> > > yes, but I am just following the convention. See wrprotect_ptes(), etc. I
> > > don't have a strong preference anyways.
> > >
> > > >
> > > > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > > > the name?
> > >
> > > How about modify_prot_start_batched_ptes()?
> >
> > I like this :) Ryan - that work for you, or do you feel _batched_ should be
> > dropped here?
>
>
> modify_prot_start_folio_ptes ?
>
> But I would rather go with
>
> modify_prot_folio_ptes_start
>
> The "batched" is implicit, and "large folio" is not required if it's more
> than one pte ...

Yeah that works for me! The mention of folio with plural does neatly imply
the rest.

Naming is hard :P

>
> :)
>
> --
> Cheers,
>
> David / dhildenb
>

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Anshuman Khandual 9 months, 2 weeks ago


On 4/29/25 10:53, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>  
> +/* See the comment for ptep_modify_prot_start */
> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> +{
> +	pte_t pte, tmp_pte;
> +
> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);
> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */
> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif

Is there any particular reason why the first loop here is while { }
based where as the second one is a for { } loop ?

> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings

These helpers should be added with at least a single caller using them.

Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit

Posted by Dev Jain 9 months, 2 weeks ago


On 29/04/25 2:09 pm, Anshuman Khandual wrote:
> 
> 
> On 4/29/25 10:53, Dev Jain wrote:
>> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
>> Architecture can override these helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 38 insertions(+)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index b50447ef1c92..ed287289335f 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   }
>>   #endif
>>   
>> +/* See the comment for ptep_modify_prot_start */
>> +#ifndef modify_prot_start_ptes
>> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr)
>> +{
>> +	pte_t pte, tmp_pte;
>> +
>> +	pte = ptep_modify_prot_start(vma, addr, ptep);
>> +	while (--nr) {
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
>> +		if (pte_dirty(tmp_pte))
>> +			pte = pte_mkdirty(pte);
>> +		if (pte_young(tmp_pte))
>> +			pte = pte_mkyoung(pte);
>> +	}
>> +	return pte;
>> +}
>> +#endif
>> +
>> +/* See the comment for ptep_modify_prot_commit */
>> +#ifndef modify_prot_commit_ptes
>> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
>> +{
>> +	for (;;) {
>> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +		old_pte = pte_next_pfn(old_pte);
>> +		pte = pte_next_pfn(pte);
>> +	}
>> +}
>> +#endif
> 
> Is there any particular reason why the first loop here is while { }
> based where as the second one is a for { } loop ?

modify_prot_start_ptes follows the pattern of get_and_clear_full_ptes. 
modify_prot_commit_ptes follows the pattern of wrprotect_ptes.

> 
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
> 
> These helpers should be added with at least a single caller using them.

[PATCH v2 1/7] mm: Refactor code in mprotect
[PATCH v2 2/7] mm: Optimize mprotect() by batch-skipping PTEs
[PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot_start/commit
[PATCH v2 4/7] arm64: Add batched version of ptep_modify_prot_start
[PATCH v2 5/7] arm64: Add batched version of ptep_modify_prot_commit
[PATCH v2 6/7] mm: Batch around can_change_pte_writable()
[PATCH v2 7/7] mm: Optimize mprotect() through PTE-batching