[v4] Optimize mprotect() for large folios

[PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Dev Jain 3 months, 1 week ago

This patchset optimizes the mprotect() system call for large folios
by PTE-batching. No issues were observed with mm-selftests, build
tested on x86_64.

We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:

Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages

Average execution time on arm64, Apple M3:
Before the patchset:
T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds

After the patchset:
T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds

Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.

Here is the test program:

 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #define SIZE (1024*1024*1024)

unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);

static void pte_map_thps(char *mem, size_t size)
{
	size_t offs;
	int ret = 0;


	/* PTE-map each THP by temporarily splitting the VMAs. */
	for (offs = 0; offs < size; offs += pmdsize) {
		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
	}

	if (ret) {
		fprintf(stderr, "ERROR: mprotect() failed\n");
		exit(1);
	}
}

int main(int argc, char *argv[])
{
	char *p;
        int ret = 0;
	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if (p != (1UL << 30)) {
		perror("mmap");
		return 1;
	}



	memset(p, 0, SIZE);
	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
		perror("madvise");
	explicit_bzero(p, SIZE);
	pte_map_thps(p, SIZE);

	for (int loops = 0; loops < 40; loops++) {
		if (mprotect(p, SIZE, PROT_READ))
			perror("mprotect"), exit(1);
		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
			perror("mprotect"), exit(1);
		explicit_bzero(p, SIZE);
	}
}

---
The patchset is rebased onto Saturday's mm-new.

v3->v4:
 - Refactor skipping logic into a new function, edit patch 1 subject
   to highlight it is only for MM_CP_PROT_NUMA case (David H)
 - Refactor the optimization logic, add more documentation to the generic
   batched functions, do not add clear_flush_ptes, squash patch 4
   and 5 (Ryan)

v2->v3:
 - Add comments for the new APIs (Ryan, Lorenzo)
 - Instead of refactoring, use a "skip_batch" label
 - Move arm64 patches at the end (Ryan)
 - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
 - Resolve implicit declaration; tested build on x86 (Lance Yang)

v1->v2:
 - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
 - Abridge the anon-exclusive condition (Lance Yang)

Dev Jain (4):
  mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
  mm: Add batched versions of ptep_modify_prot_start/commit
  mm: Optimize mprotect() by PTE-batching
  arm64: Add batched versions of ptep_modify_prot_start/commit

 arch/arm64/include/asm/pgtable.h |  10 ++
 arch/arm64/mm/mmu.c              |  28 +++-
 include/linux/pgtable.h          |  83 +++++++++-
 mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
 4 files changed, 315 insertions(+), 75 deletions(-)

-- 
2.30.2

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Andrew Morton 3 months, 1 week ago

On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:

> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching. No issues were observed with mm-selftests, build
> tested on x86_64.

um what.  Seems to claim that "selftests still compiles after I messed
with stuff", which isn't very impressive ;)  Please clarify?

> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
> 
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
> 
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
> 
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds

Well that's tasty.

> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement, albeit the trade-off being a slight
> degradation in the small folio case.
>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Dev Jain 3 months, 1 week ago

On 30/06/25 4:35 am, Andrew Morton wrote:
> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>
>> This patchset optimizes the mprotect() system call for large folios
>> by PTE-batching. No issues were observed with mm-selftests, build
>> tested on x86_64.
> um what.  Seems to claim that "selftests still compiles after I messed
> with stuff", which isn't very impressive ;)  Please clarify?

Sorry I mean to say that the mm-selftests pass.

>
>> We use the following test cases to measure performance, mprotect()'ing
>> the mapped memory to read-only then read-write 40 times:
>>
>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>> pte-mapping those THPs
>> Test case 2: Mapping 1G of memory with 64K mTHPs
>> Test case 3: Mapping 1G of memory with 4K pages
>>
>> Average execution time on arm64, Apple M3:
>> Before the patchset:
>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>
>> After the patchset:
>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
> Well that's tasty.
>
>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>> introduced by ptep_get() on a contpte block. And, for large folios we get
>> an almost 74% performance improvement, albeit the trade-off being a slight
>> degradation in the small folio case.
>>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Ryan Roberts 3 months, 1 week ago

On 30/06/2025 04:33, Dev Jain wrote:
> 
> On 30/06/25 4:35 am, Andrew Morton wrote:
>> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>>
>>> This patchset optimizes the mprotect() system call for large folios
>>> by PTE-batching. No issues were observed with mm-selftests, build
>>> tested on x86_64.
>> um what.  Seems to claim that "selftests still compiles after I messed
>> with stuff", which isn't very impressive ;)  Please clarify?
> 
> Sorry I mean to say that the mm-selftests pass.

I think you're saying you both compiled and ran the mm selftests for arm64. And
additionally you compiled for x86_64? (Just trying to help clarify).


> 
>>
>>> We use the following test cases to measure performance, mprotect()'ing
>>> the mapped memory to read-only then read-write 40 times:
>>>
>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>> pte-mapping those THPs
>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>> Test case 3: Mapping 1G of memory with 4K pages
>>>
>>> Average execution time on arm64, Apple M3:
>>> Before the patchset:
>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>
>>> After the patchset:
>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>> Well that's tasty.
>>
>>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>>> introduced by ptep_get() on a contpte block. And, for large folios we get
>>> an almost 74% performance improvement, albeit the trade-off being a slight
>>> degradation in the small folio case.
>>>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Dev Jain 3 months, 1 week ago

On 30/06/25 4:15 pm, Ryan Roberts wrote:
> On 30/06/2025 04:33, Dev Jain wrote:
>> On 30/06/25 4:35 am, Andrew Morton wrote:
>>> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>> This patchset optimizes the mprotect() system call for large folios
>>>> by PTE-batching. No issues were observed with mm-selftests, build
>>>> tested on x86_64.
>>> um what.  Seems to claim that "selftests still compiles after I messed
>>> with stuff", which isn't very impressive ;)  Please clarify?
>> Sorry I mean to say that the mm-selftests pass.
> I think you're saying you both compiled and ran the mm selftests for arm64. And
> additionally you compiled for x86_64? (Just trying to help clarify).

Yes, ran mm-selftests on arm64, and build-tested the patches for x86.

>
>>>> We use the following test cases to measure performance, mprotect()'ing
>>>> the mapped memory to read-only then read-write 40 times:
>>>>
>>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>>>> pte-mapping those THPs
>>>> Test case 2: Mapping 1G of memory with 64K mTHPs
>>>> Test case 3: Mapping 1G of memory with 4K pages
>>>>
>>>> Average execution time on arm64, Apple M3:
>>>> Before the patchset:
>>>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>>>
>>>> After the patchset:
>>>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>>> Well that's tasty.
>>>
>>>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>>>> introduced by ptep_get() on a contpte block. And, for large folios we get
>>>> an almost 74% performance improvement, albeit the trade-off being a slight
>>>> degradation in the small folio case.
>>>>
>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote:
> This patchset optimizes the mprotect() system call for large folios
> by PTE-batching. No issues were observed with mm-selftests, build
> tested on x86_64.

Should also be tested on x86-64 not only build tested :)

You are still not really giving details here, so same comment as your mremap()
series, please explain why you're doing this, what for, what benefits you expect
to achieve, where etc.

E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see
benefits on amd64 also and for intel there should be no impact'.

It's probably also worth actually going and checking to make sure that this is
the case re: other arches. See below on that...

>
> We use the following test cases to measure performance, mprotect()'ing
> the mapped memory to read-only then read-write 40 times:
>
> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> pte-mapping those THPs
> Test case 2: Mapping 1G of memory with 64K mTHPs
> Test case 3: Mapping 1G of memory with 4K pages
>
> Average execution time on arm64, Apple M3:
> Before the patchset:
> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>
> After the patchset:
> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>
> Observing T1/T2 and T3 before the patchset, we also remove the regression
> introduced by ptep_get() on a contpte block. And, for large folios we get
> an almost 74% performance improvement, albeit the trade-off being a slight
> degradation in the small folio case.

This is nice, though order-0 is probably going to be your bread and butter no?

Having said that, mprotect() is not a hot path, this delta is small enough to
quite possibly just be noise, and personally I'm not all that bothered.

But let's run this same test on x86-64 too please and get some before/after
numbers just to confirm no major impact.

Thanks for including code.

>
> Here is the test program:
>
>  #define _GNU_SOURCE
>  #include <sys/mman.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <stdio.h>
>  #include <unistd.h>
>
>  #define SIZE (1024*1024*1024)
>
> unsigned long pmdsize = (1UL << 21);
> unsigned long pagesize = (1UL << 12);
>
> static void pte_map_thps(char *mem, size_t size)
> {
> 	size_t offs;
> 	int ret = 0;
>
>
> 	/* PTE-map each THP by temporarily splitting the VMAs. */
> 	for (offs = 0; offs < size; offs += pmdsize) {
> 		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
> 		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
> 	}
>
> 	if (ret) {
> 		fprintf(stderr, "ERROR: mprotect() failed\n");
> 		exit(1);
> 	}
> }
>
> int main(int argc, char *argv[])
> {
> 	char *p;
>         int ret = 0;
> 	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if (p != (1UL << 30)) {
> 		perror("mmap");
> 		return 1;
> 	}
>
>
>
> 	memset(p, 0, SIZE);
> 	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
> 		perror("madvise");
> 	explicit_bzero(p, SIZE);
> 	pte_map_thps(p, SIZE);
>
> 	for (int loops = 0; loops < 40; loops++) {
> 		if (mprotect(p, SIZE, PROT_READ))
> 			perror("mprotect"), exit(1);
> 		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
> 			perror("mprotect"), exit(1);
> 		explicit_bzero(p, SIZE);
> 	}
> }
>
> ---
> The patchset is rebased onto Saturday's mm-new.
>
> v3->v4:
>  - Refactor skipping logic into a new function, edit patch 1 subject
>    to highlight it is only for MM_CP_PROT_NUMA case (David H)
>  - Refactor the optimization logic, add more documentation to the generic
>    batched functions, do not add clear_flush_ptes, squash patch 4
>    and 5 (Ryan)
>
> v2->v3:
>  - Add comments for the new APIs (Ryan, Lorenzo)
>  - Instead of refactoring, use a "skip_batch" label
>  - Move arm64 patches at the end (Ryan)
>  - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
>  - Resolve implicit declaration; tested build on x86 (Lance Yang)
>
> v1->v2:
>  - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>  - Abridge the anon-exclusive condition (Lance Yang)
>
> Dev Jain (4):
>   mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
>   mm: Add batched versions of ptep_modify_prot_start/commit
>   mm: Optimize mprotect() by PTE-batching
>   arm64: Add batched versions of ptep_modify_prot_start/commit
>
>  arch/arm64/include/asm/pgtable.h |  10 ++
>  arch/arm64/mm/mmu.c              |  28 +++-
>  include/linux/pgtable.h          |  83 +++++++++-
>  mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
>  4 files changed, 315 insertions(+), 75 deletions(-)
>
> --
> 2.30.2
>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Dev Jain 3 months, 1 week ago

On 30/06/25 4:47 pm, Lorenzo Stoakes wrote:
> On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote:
>> This patchset optimizes the mprotect() system call for large folios
>> by PTE-batching. No issues were observed with mm-selftests, build
>> tested on x86_64.
> Should also be tested on x86-64 not only build tested :)
>
> You are still not really giving details here, so same comment as your mremap()
> series, please explain why you're doing this, what for, what benefits you expect
> to achieve, where etc.
>
> E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see
> benefits on amd64 also and for intel there should be no impact'.

Okay.

>
> It's probably also worth actually going and checking to make sure that this is
> the case re: other arches. See below on that...
>
>> We use the following test cases to measure performance, mprotect()'ing
>> the mapped memory to read-only then read-write 40 times:
>>
>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
>> pte-mapping those THPs
>> Test case 2: Mapping 1G of memory with 64K mTHPs
>> Test case 3: Mapping 1G of memory with 4K pages
>>
>> Average execution time on arm64, Apple M3:
>> Before the patchset:
>> T1: 7.9 seconds   T2: 7.9 seconds   T3: 4.2 seconds
>>
>> After the patchset:
>> T1: 2.1 seconds   T2: 2.2 seconds   T3: 4.3 seconds
>>
>> Observing T1/T2 and T3 before the patchset, we also remove the regression
>> introduced by ptep_get() on a contpte block. And, for large folios we get
>> an almost 74% performance improvement, albeit the trade-off being a slight
>> degradation in the small folio case.
> This is nice, though order-0 is probably going to be your bread and butter no?
>
> Having said that, mprotect() is not a hot path, this delta is small enough to
> quite possibly just be noise, and personally I'm not all that bothered.

It is only the vm_normal_folio() + folio_test_large() overhead. Trying to avoid
this by the horrible maybe_contiguous_pte_pfns() I introduced somewhere else
is not worth it : )

>
> But let's run this same test on x86-64 too please and get some before/after
> numbers just to confirm no major impact.
>
> Thanks for including code.
>
>> Here is the test program:
>>
>>   #define _GNU_SOURCE
>>   #include <sys/mman.h>
>>   #include <stdlib.h>
>>   #include <string.h>
>>   #include <stdio.h>
>>   #include <unistd.h>
>>
>>   #define SIZE (1024*1024*1024)
>>
>> unsigned long pmdsize = (1UL << 21);
>> unsigned long pagesize = (1UL << 12);
>>
>> static void pte_map_thps(char *mem, size_t size)
>> {
>> 	size_t offs;
>> 	int ret = 0;
>>
>>
>> 	/* PTE-map each THP by temporarily splitting the VMAs. */
>> 	for (offs = 0; offs < size; offs += pmdsize) {
>> 		ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
>> 		ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
>> 	}
>>
>> 	if (ret) {
>> 		fprintf(stderr, "ERROR: mprotect() failed\n");
>> 		exit(1);
>> 	}
>> }
>>
>> int main(int argc, char *argv[])
>> {
>> 	char *p;
>>          int ret = 0;
>> 	p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 	if (p != (1UL << 30)) {
>> 		perror("mmap");
>> 		return 1;
>> 	}
>>
>>
>>
>> 	memset(p, 0, SIZE);
>> 	if (madvise(p, SIZE, MADV_NOHUGEPAGE))
>> 		perror("madvise");
>> 	explicit_bzero(p, SIZE);
>> 	pte_map_thps(p, SIZE);
>>
>> 	for (int loops = 0; loops < 40; loops++) {
>> 		if (mprotect(p, SIZE, PROT_READ))
>> 			perror("mprotect"), exit(1);
>> 		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
>> 			perror("mprotect"), exit(1);
>> 		explicit_bzero(p, SIZE);
>> 	}
>> }
>>
>> ---
>> The patchset is rebased onto Saturday's mm-new.
>>
>> v3->v4:
>>   - Refactor skipping logic into a new function, edit patch 1 subject
>>     to highlight it is only for MM_CP_PROT_NUMA case (David H)
>>   - Refactor the optimization logic, add more documentation to the generic
>>     batched functions, do not add clear_flush_ptes, squash patch 4
>>     and 5 (Ryan)
>>
>> v2->v3:
>>   - Add comments for the new APIs (Ryan, Lorenzo)
>>   - Instead of refactoring, use a "skip_batch" label
>>   - Move arm64 patches at the end (Ryan)
>>   - In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
>>   - Resolve implicit declaration; tested build on x86 (Lance Yang)
>>
>> v1->v2:
>>   - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
>>   - Abridge the anon-exclusive condition (Lance Yang)
>>
>> Dev Jain (4):
>>    mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
>>    mm: Add batched versions of ptep_modify_prot_start/commit
>>    mm: Optimize mprotect() by PTE-batching
>>    arm64: Add batched versions of ptep_modify_prot_start/commit
>>
>>   arch/arm64/include/asm/pgtable.h |  10 ++
>>   arch/arm64/mm/mmu.c              |  28 +++-
>>   include/linux/pgtable.h          |  83 +++++++++-
>>   mm/mprotect.c                    | 269 +++++++++++++++++++++++--------
>>   4 files changed, 315 insertions(+), 75 deletions(-)
>>
>> --
>> 2.30.2
>>

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Lorenzo Stoakes 3 months, 1 week ago

To reiterate what I said on 1/4 - overall since this series conflicts with
David's changes - can we hold off on any respin please until David's
settles and lands in mm-new at least?

Thanks.

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Dev Jain 3 months, 1 week ago

On 30/06/25 4:57 pm, Lorenzo Stoakes wrote:
> To reiterate what I said on 1/4 - overall since this series conflicts with
> David's changes - can we hold off on any respin please until David's
> settles and lands in mm-new at least?
>
> Thanks.

I agree, David's series should get stable by the time I get ready to post
the next version, @Andrew could you remove this from mm-new please?

Re: [PATCH v4 0/4] Optimize mprotect() for large folios

Posted by Andrew Morton 3 months, 1 week ago

On Mon, 30 Jun 2025 17:13:07 +0530 Dev Jain <dev.jain@arm.com> wrote:

> @Andrew could you remove this from mm-new please?

Sure.