[PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages

Usama Arif posted 4 patches 4 weeks, 1 day ago
arch/arm64/include/asm/pgtable.h |  9 ++--
fs/binfmt_elf.c                  | 15 +++++++
mm/filemap.c                     | 72 +++++++++++++++++---------------
mm/huge_memory.c                 | 17 ++++++++
4 files changed, 75 insertions(+), 38 deletions(-)
[PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 4 weeks, 1 day ago
On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
into a single iTLB entry, reducing iTLB pressure for large executable
mappings.

exec_folio_order() was introduced [1] to request readahead at an
arch-preferred folio order for executable memory, enabling contpte
mapping on the fault path.

However, several things prevent this from working optimally on 16K and
64K page configurations:

1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
   produces the optimal contpte order for 4K pages. For 16K pages it
   returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
   returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
   using ilog2(CONT_PTES) which evaluates to the optimal order for all
   page sizes.

2. Even with the optimal order, the mmap_miss heuristic in
   do_sync_mmap_readahead() silently disables exec readahead after 100
   page faults. The mmap_miss counter tracks whether readahead is useful
   for mmap'd file access:

   - Incremented by 1 in do_sync_mmap_readahead() on every page cache
     miss (page needed IO).

   - Decremented by N in filemap_map_pages() for N pages successfully
     mapped via fault-around (pages found in cache without faulting,
     evidence that readahead was useful). Only non-workingset pages
     count and recently evicted and re-read pages don't count as hits.

   - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
     marker page is found (indicates sequential consumption of readahead
     pages).

   When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
   disabled. On 64K pages, both decrement paths are inactive:

   - filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

   - do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

   With no decrements, mmap_miss monotonically increases past
   MMAP_LOTSAMISS after 100 faults, disabling exec readahead
   for the remainder of the mapping.
   Patch 2 fixes this by moving the VM_EXEC readahead block
   above the mmap_miss check, since exec readahead is targeted (one
   folio at the fault location, async_size=0) not speculative prefetch.

3. Even with correct folio order and readahead, contpte mapping requires
   the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
   The readahead path aligns file offsets and the buddy allocator aligns
   physical memory, but the virtual address depends on the VMA start.
   For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
   granularity, giving only a 1/32 chance of 2M alignment. When
   misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
   any folio in the VMA, resulting in zero iTLB coalescing benefit.

   Patch 3 fixes this for the main binary by bumping the ELF loader's
   alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.

   Patch 4 fixes this for shared libraries by adding a contpte-size
   alignment fallback in thp_get_unmapped_area_vmflags(). The existing
   PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
   libraries, so this smaller fallback (2M) succeeds where PMD fails.

I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Usama Arif (4):
  arm64: request contpte-sized folios for exec memory
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  elf: align ET_DYN base to exec folio order for contpte mapping
  mm: align file-backed mmap to exec folio order in
    thp_get_unmapped_area

 arch/arm64/include/asm/pgtable.h |  9 ++--
 fs/binfmt_elf.c                  | 15 +++++++
 mm/filemap.c                     | 72 +++++++++++++++++---------------
 mm/huge_memory.c                 | 17 ++++++++
 4 files changed, 75 insertions(+), 38 deletions(-)

-- 
2.47.3
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by WANG Rui 3 weeks, 4 days ago
I only just realized your focus was on 64K normal pages, what I was
referring to here is AArch64 with 4K normal pages.

Sorry about the earlier numbers. They were a bit low precision.
RK3399 has pretty limited PMU events, and it looks like it can’t
collect events from the A53 and A72 clusters at the same time, so
I reran the measurements on the A53.

Even though the A53 backend isn’t very wide, we can still see the
impact from iTLB pressure. With 4K pages, aligning the code to PMD
size (2M) performs slightly better than 64K.

Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)

Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
Loop: 5

                Base                 Patchset [1]         Patchset [2]
instructions    1,994,512,163,037    1,994,528,896,322    1,994,536,148,574
cpu-cycles      6,890,054,789,351    6,870,685,379,047    6,720,442,248,967
                                              ~ -0.28%             ~ -2.46%
itlb-misses           579,692,117          455,848,211           43,814,795
                                             ~ -21.36%            ~ -92.44%
time elapsed            1331.15 s            1325.50 s            1296.35 s
                                              ~ -0.42%             ~ -2.61%

Maybe we could make exec_folio_order() choose differently folio size
depending on the configuration and conditional in some way, for example
based on the size of the code segment?

[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Thanks,
Rui
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 3 weeks ago

On 14/03/2026 12:50, WANG Rui wrote:
> I only just realized your focus was on 64K normal pages, what I was
> referring to here is AArch64 with 4K normal pages.
> 
> Sorry about the earlier numbers. They were a bit low precision.
> RK3399 has pretty limited PMU events, and it looks like it can’t
> collect events from the A53 and A72 clusters at the same time, so
> I reran the measurements on the A53.
> 
> Even though the A53 backend isn’t very wide, we can still see the
> impact from iTLB pressure. With 4K pages, aligning the code to PMD
> size (2M) performs slightly better than 64K.
> 
> Binutils: 2.46
> GCC: 15.2.1 (--enable-host-pie)
> 
> Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
> Loop: 5
> 
>                 Base                 Patchset [1]         Patchset [2]
> instructions    1,994,512,163,037    1,994,528,896,322    1,994,536,148,574
> cpu-cycles      6,890,054,789,351    6,870,685,379,047    6,720,442,248,967
>                                               ~ -0.28%             ~ -2.46%
> itlb-misses           579,692,117          455,848,211           43,814,795
>                                              ~ -21.36%            ~ -92.44%
> time elapsed            1331.15 s            1325.50 s            1296.35 s
>                                               ~ -0.42%             ~ -2.61%
> 


Thanks for running these! Just wanted to check what is the base page size
of this experiment?

Ofcourse PMD is going to perform better than TLB coalescing (pagefault
itself will be one less page table level). But its a tradeoff between
memory pressure + reduced ASLR vs performance. As Ryan pointed out in
[1], even 2M for 16K base page size might introduce too much of memory
pressure for android phones, and the PMD size for 16K is 32M! 

[1] https://lore.kernel.org/all/cfdfca9c-4752-4037-a289-03e6e7a00d47@arm.com/

> Maybe we could make exec_folio_order() choose differently folio size
> depending on the configuration and conditional in some way, for example
> based on the size of the code segment?

Yeah I think introducing Kconfig might be an option.

> 
> [1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> 
> Thanks,
> Rui

Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by WANG Rui 3 weeks ago
> Thanks for running these! Just wanted to check what is the base page size
> of this experiment?

base page size: 4K

> Yeah I think introducing Kconfig might be an option.

I wonder if it would make sense for exec_folio_order() to vary the
order based on the code size, instead of always returning a fixed
value for a given architecture and base page size.

For example, on AArch64 with 4K base pages, in the load_elf_binary()
case: if exec_folio_order() only ever returns cont-PTE (64K), we may
miss the opportunity to use PMD mappings. On the other hand, if it
always returns PMD (2M), then for binaries smaller than 2M we end up
reducing ASLR entropy.

Maybe something along these lines would work better:

unsigned int exec_folio_order(size_t code_size)
{
#if PAGE_SIZE == 4096
    if (code_size >= PMD_SIZE)
        return ilog2(SZ_2M >> PAGE_SHIFT);
    else if (code_size >= SZ_64K)
        return ilog2(SZ_64K >> PAGE_SHIFT);
    else
        return 0;
#elif PAGE_SIZE == 16384
    ...
#elif PAGE_SIZE == ...
    /* let the arch cap the max order here, rather
       than hard-coding it at the use sites */
#endif
}

Thanks,
Rui
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by hev 3 weeks, 5 days ago
From: WANG Rui <r@hev.cc>

Hi,

I ran a quick bench on RK3399:

Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)

Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.

                Base                 Patchset [1]         Patchset [2]
instructions    3,115,852,636,773    3,194,533,947,809    3,235,417,205,947
cpu-cycles      8,374,429,970,450    8,457,398,871,141    8,323,881,987,768
itlb-misses         9,250,336,037        8,033,415,293        2,946,152,935
                                             ~ -13.16%            ~ -68.15%
time elapsed             610.51 s             605.12 s             593.83 s

[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Should we prefer to PMD aligned here?

Thanks,
Rui
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Ryan Roberts 3 weeks, 5 days ago
On 10/03/2026 14:51, Usama Arif wrote:
> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> into a single iTLB entry, reducing iTLB pressure for large executable
> mappings.
> 
> exec_folio_order() was introduced [1] to request readahead at an
> arch-preferred folio order for executable memory, enabling contpte
> mapping on the fault path.
> 
> However, several things prevent this from working optimally on 16K and
> 64K page configurations:
> 
> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>    produces the optimal contpte order for 4K pages. For 16K pages it
>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>    returns order 0 (64K) instead of order 5 (2M). 

This was deliberate, although perhaps a bit conservative. I was concerned about
the possibility of read amplification; pointlessly reading in a load of memory
that never actually gets used. And that is independent of page size.

2M seems quite big as a default IMHO, I could imagine Android might complain
about memory pressure in their 16K config, for example.

Additionally, ELF files are normally only aligned to 64K and you can only get
the TLB benefits if the memory is aligned in physical and virtual memory.

> Patch 1 fixes this by
>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>    page sizes.
> 
> 2. Even with the optimal order, the mmap_miss heuristic in
>    do_sync_mmap_readahead() silently disables exec readahead after 100
>    page faults. The mmap_miss counter tracks whether readahead is useful
>    for mmap'd file access:
> 
>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>      miss (page needed IO).
> 
>    - Decremented by N in filemap_map_pages() for N pages successfully
>      mapped via fault-around (pages found in cache without faulting,
>      evidence that readahead was useful). Only non-workingset pages
>      count and recently evicted and re-read pages don't count as hits.
> 
>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>      marker page is found (indicates sequential consumption of readahead
>      pages).
> 
>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>    disabled. On 64K pages, both decrement paths are inactive:
> 
>    - filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>    - do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
>    With no decrements, mmap_miss monotonically increases past
>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>    for the remainder of the mapping.
>    Patch 2 fixes this by moving the VM_EXEC readahead block
>    above the mmap_miss check, since exec readahead is targeted (one
>    folio at the fault location, async_size=0) not speculative prefetch.

Interesting!

> 
> 3. Even with correct folio order and readahead, contpte mapping requires
>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>    The readahead path aligns file offsets and the buddy allocator aligns
>    physical memory, but the virtual address depends on the VMA start.
>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>    granularity, giving only a 1/32 chance of 2M alignment. When
>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> 
>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> 
>    Patch 4 fixes this for shared libraries by adding a contpte-size
>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>    libraries, so this smaller fallback (2M) succeeds where PMD fails.

I don't see how you can reliably influence this from the kernel? The ELF file
alignment is, by default, 64K (16K on Android) and there is no guarrantee that
the text section is the first section in the file. You need to align the start
of the text section to the 2M boundary and to do that, you'll need to align the
start of the file to some 64K boundary at a specific offset to the 2M boundary,
based on the size of any sections before the text section. That's a job for the
dynamic loader I think? Perhaps I've misunderstood what you're doing...

> 
> I created a benchmark that mmaps a large executable file and calls
> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> fault + readahead cost. "Random" first faults in all pages with a
> sequential sweep (not measured), then measures time for calling random
> offsets, isolating iTLB miss cost for scattered execution.
> 
> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> 512MB executable file on ext4, averaged over 3 runs:
> 
>   Phase      | Baseline     | Patched      | Improvement
>   -----------|--------------|--------------|------------------
>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>   Random     | 76.0 ms      | 58.3 ms      | 23% faster

I think the proper way to do this is to link the text section with 2M alignment
and have the dynamic linker mark the region with MADV_HUGEPAGE?

Thanks,
Ryan


> 
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> 
> Usama Arif (4):
>   arm64: request contpte-sized folios for exec memory
>   mm: bypass mmap_miss heuristic for VM_EXEC readahead
>   elf: align ET_DYN base to exec folio order for contpte mapping
>   mm: align file-backed mmap to exec folio order in
>     thp_get_unmapped_area
> 
>  arch/arm64/include/asm/pgtable.h |  9 ++--
>  fs/binfmt_elf.c                  | 15 +++++++
>  mm/filemap.c                     | 72 +++++++++++++++++---------------
>  mm/huge_memory.c                 | 17 ++++++++
>  4 files changed, 75 insertions(+), 38 deletions(-)
>
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by WANG Rui 3 weeks, 4 days ago
Hi Ryan,

> I don't see how you can reliably influence this from the kernel? The ELF file
> alignment is, by default, 64K (16K on Android) and there is no guarrantee that
> the text section is the first section in the file. You need to align the start
> of the text section to the 2M boundary and to do that, you'll need to align the
> start of the file to some 64K boundary at a specific offset to the 2M boundary,
> based on the size of any sections before the text section. That's a job for the
> dynamic loader I think? Perhaps I've misunderstood what you're doing...

On Arch Linux for AArch64 and LoongArch64 I've observed that most
binaries place the executable segment in the first PT_LOAD. In that
case both the virtual address and file offset are 0, which happens to
satisfy the alignment requirements for PMD-sized or large folio
mappings.

x86 looks quite different. The executable segment is usually not the
first one.

After digging into this I realized this mostly comes from the linker
defaults. With GNU ld, -z noseparate-code merges the read-only and
read-only-executable segments into one, while -z separate-code
splits them, placing a non-executable read-only segment first. The
latter is the default on x86, partly to avoid making the ELF headers
executable when mappings start from the beginning of the file.

Other architectures tend to default to -z noseparate-code, which
makes it more likely that the text segment is the first PT_LOAD.

LLVM lld behaves differently again: --rosegment (equivalent to
-z separate-code) is enabled by default on all architectures, which
similarly tends to place the executable segment after an initial
read-only one. That default is clearly less friendly to large-page
text mappings.

Thanks,
Rui
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 3 weeks, 5 days ago
On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

> On 10/03/2026 14:51, Usama Arif wrote:
> > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> > into a single iTLB entry, reducing iTLB pressure for large executable
> > mappings.
> > 
> > exec_folio_order() was introduced [1] to request readahead at an
> > arch-preferred folio order for executable memory, enabling contpte
> > mapping on the fault path.
> > 
> > However, several things prevent this from working optimally on 16K and
> > 64K page configurations:
> > 
> > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
> >    produces the optimal contpte order for 4K pages. For 16K pages it
> >    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
> >    returns order 0 (64K) instead of order 5 (2M). 
> 
> This was deliberate, although perhaps a bit conservative. I was concerned about
> the possibility of read amplification; pointlessly reading in a load of memory
> that never actually gets used. And that is independent of page size.
> 
> 2M seems quite big as a default IMHO, I could imagine Android might complain
> about memory pressure in their 16K config, for example.
> 

The force_thp_readahead path in do_sync_mmap_readahead() reads at
HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
non VM_RAND_READ mappings (ra->size *= 2), with async readahead
enabled. exec_folio_order() is more conservative. a single 2M folio
with async_size=0, no speculative prefetch. So I think the memory
pressure would not be worse than what x86 has?

For memory pressure on Android 16K: the readahead is clamped to VMA
boundaries, so a small shared library won't read 2M.
page_cache_ra_order() reduces folio order near EOF and on allocation
failure, so the 2M order is a preference, not a guarantee with the
current code?

> Additionally, ELF files are normally only aligned to 64K and you can only get
> the TLB benefits if the memory is aligned in physical and virtual memory.
> 
> > Patch 1 fixes this by
> >    using ilog2(CONT_PTES) which evaluates to the optimal order for all
> >    page sizes.
> > 
> > 2. Even with the optimal order, the mmap_miss heuristic in
> >    do_sync_mmap_readahead() silently disables exec readahead after 100
> >    page faults. The mmap_miss counter tracks whether readahead is useful
> >    for mmap'd file access:
> > 
> >    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
> >      miss (page needed IO).
> > 
> >    - Decremented by N in filemap_map_pages() for N pages successfully
> >      mapped via fault-around (pages found in cache without faulting,
> >      evidence that readahead was useful). Only non-workingset pages
> >      count and recently evicted and re-read pages don't count as hits.
> > 
> >    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
> >      marker page is found (indicates sequential consumption of readahead
> >      pages).
> > 
> >    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
> >    disabled. On 64K pages, both decrement paths are inactive:
> > 
> >    - filemap_map_pages() is never called because fault_around_pages
> >      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
> >      requires fault_around_pages > 1. With only 1 page in the
> >      fault-around window, there is nothing "around" to map.
> > 
> >    - do_async_mmap_readahead() never fires for exec mappings because
> >      exec readahead sets async_size = 0, so no PG_readahead markers
> >      are placed.
> > 
> >    With no decrements, mmap_miss monotonically increases past
> >    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
> >    for the remainder of the mapping.
> >    Patch 2 fixes this by moving the VM_EXEC readahead block
> >    above the mmap_miss check, since exec readahead is targeted (one
> >    folio at the fault location, async_size=0) not speculative prefetch.
> 
> Interesting!
> 
> > 
> > 3. Even with correct folio order and readahead, contpte mapping requires
> >    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
> >    The readahead path aligns file offsets and the buddy allocator aligns
> >    physical memory, but the virtual address depends on the VMA start.
> >    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
> >    granularity, giving only a 1/32 chance of 2M alignment. When
> >    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
> >    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> > 
> >    Patch 3 fixes this for the main binary by bumping the ELF loader's
> >    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> > 
> >    Patch 4 fixes this for shared libraries by adding a contpte-size
> >    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
> >    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
> >    libraries, so this smaller fallback (2M) succeeds where PMD fails.
> 
> I don't see how you can reliably influence this from the kernel? The ELF file
> alignment is, by default, 64K (16K on Android) and there is no guarrantee that
> the text section is the first section in the file. You need to align the start
> of the text section to the 2M boundary and to do that, you'll need to align the
> start of the file to some 64K boundary at a specific offset to the 2M boundary,
> based on the size of any sections before the text section. That's a job for the
> dynamic loader I think? Perhaps I've misunderstood what you're doing...
>

I only started looking into how this works a few days before sending these
patches, so I could be wrong (please do correct me if thats the case!)

For the main binary (patch 3): load_elf_binary() controls load_bias.
Each PT_LOAD segment is mapped at load_bias + p_vaddr via elf_map().
The alignment variable feeds directly into load_bias calculation.
If p_vaddr=0 and p_offset=0, mapped_addr = load_bias + 0 = load_bias. By
ensuring load_bias is folio size aligned, the text segment's virtual address
is also folio size aligned.

For shared libraries (patch 4): ld.so loads these via mmap(), and the
kernel's get_unmapped_area callback (thp_get_unmapped_area for ext4,
xfs, btrfs) picks the virtual address. The existing code tries
PMD_SIZE alignment first (512M on 64K pages), which is too large for
typical shared libraries and always fails. Patch 4 adds a fallback
that tries folio-size alignment (2M), which is small enough to succeed
for most libraries.

> > 
> > I created a benchmark that mmaps a large executable file and calls
> > RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> > fault + readahead cost. "Random" first faults in all pages with a
> > sequential sweep (not measured), then measures time for calling random
> > offsets, isolating iTLB miss cost for scattered execution.
> > 
> > The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> > 512MB executable file on ext4, averaged over 3 runs:
> > 
> >   Phase      | Baseline     | Patched      | Improvement
> >   -----------|--------------|--------------|------------------
> >   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
> >   Random     | 76.0 ms      | 58.3 ms      | 23% faster
> 
> I think the proper way to do this is to link the text section with 2M alignment
> and have the dynamic linker mark the region with MADV_HUGEPAGE?
> 

On arm64 with 64K pages, the force_thp_readahead path triggered by
MADV_HUGEPAGE reads at HPAGE_PMD_ORDER (512M). Even with file and
anon khugepaged support aded for khugpaged, the collapse won't happen
form the start.

Yes I think dynamic linker is also a good alternate approach from Wangs
patches [1]. But doing it in the kernel would be more transparent?

[1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html

> Thanks,
> Ryan
> 
> 
> > 
> > [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> > 
> > Usama Arif (4):
> >   arm64: request contpte-sized folios for exec memory
> >   mm: bypass mmap_miss heuristic for VM_EXEC readahead
> >   elf: align ET_DYN base to exec folio order for contpte mapping
> >   mm: align file-backed mmap to exec folio order in
> >     thp_get_unmapped_area
> > 
> >  arch/arm64/include/asm/pgtable.h |  9 ++--
> >  fs/binfmt_elf.c                  | 15 +++++++
> >  mm/filemap.c                     | 72 +++++++++++++++++---------------
> >  mm/huge_memory.c                 | 17 ++++++++
> >  4 files changed, 75 insertions(+), 38 deletions(-)
> > 
> 
>
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 3 weeks ago
On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> > On 10/03/2026 14:51, Usama Arif wrote:
> > > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> > > into a single iTLB entry, reducing iTLB pressure for large executable
> > > mappings.
> > > 
> > > exec_folio_order() was introduced [1] to request readahead at an
> > > arch-preferred folio order for executable memory, enabling contpte
> > > mapping on the fault path.
> > > 
> > > However, several things prevent this from working optimally on 16K and
> > > 64K page configurations:
> > > 
> > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
> > >    produces the optimal contpte order for 4K pages. For 16K pages it
> > >    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
> > >    returns order 0 (64K) instead of order 5 (2M). 
> > 
> > This was deliberate, although perhaps a bit conservative. I was concerned about
> > the possibility of read amplification; pointlessly reading in a load of memory
> > that never actually gets used. And that is independent of page size.
> > 
> > 2M seems quite big as a default IMHO, I could imagine Android might complain
> > about memory pressure in their 16K config, for example.
> > 
> 
> The force_thp_readahead path in do_sync_mmap_readahead() reads at
> HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
> non VM_RAND_READ mappings (ra->size *= 2), with async readahead
> enabled. exec_folio_order() is more conservative. a single 2M folio
> with async_size=0, no speculative prefetch. So I think the memory
> pressure would not be worse than what x86 has?
> 
> For memory pressure on Android 16K: the readahead is clamped to VMA
> boundaries, so a small shared library won't read 2M.
> page_cache_ra_order() reduces folio order near EOF and on allocation
> failure, so the 2M order is a preference, not a guarantee with the
> current code?
> 

I am not a big fan of introducing Kconfig options, but would
CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better
solution? Or maybe a default of 64K for 4K and 16K base page size,
but 2M for 64K page size as 64K base page size is mostly for servers.

Using a default value of 64K would mean no change in behaviour.
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by David Hildenbrand (Arm) 2 weeks, 6 days ago
On 3/18/26 11:52, Usama Arif wrote:
> On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>>
>>> This was deliberate, although perhaps a bit conservative. I was concerned about
>>> the possibility of read amplification; pointlessly reading in a load of memory
>>> that never actually gets used. And that is independent of page size.
>>>
>>> 2M seems quite big as a default IMHO, I could imagine Android might complain
>>> about memory pressure in their 16K config, for example.
>>>
>>
>> The force_thp_readahead path in do_sync_mmap_readahead() reads at
>> HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
>> non VM_RAND_READ mappings (ra->size *= 2), with async readahead
>> enabled. exec_folio_order() is more conservative. a single 2M folio
>> with async_size=0, no speculative prefetch. So I think the memory
>> pressure would not be worse than what x86 has?
>>
>> For memory pressure on Android 16K: the readahead is clamped to VMA
>> boundaries, so a small shared library won't read 2M.
>> page_cache_ra_order() reduces folio order near EOF and on allocation
>> failure, so the 2M order is a preference, not a guarantee with the
>> current code?
>>
> 
> I am not a big fan of introducing Kconfig options, but would
> CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better
> solution? Or maybe a default of 64K for 4K and 16K base page size,
> but 2M for 64K page size as 64K base page size is mostly for servers.
> 
> Using a default value of 64K would mean no change in behaviour.

I don't think such a tunable is the right approach. We should try to do
something smarter in the kernel.

We should have access to the mapping size and whether there is currently
real memory pressure. We might even know the estimated speed of the
device we're loading data from :)

-- 
Cheers,

David
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by David Hildenbrand (Arm) 3 weeks, 5 days ago
On 3/10/26 15:51, Usama Arif wrote:
> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> into a single iTLB entry, reducing iTLB pressure for large executable
> mappings.
> 
> exec_folio_order() was introduced [1] to request readahead at an
> arch-preferred folio order for executable memory, enabling contpte
> mapping on the fault path.
> 
> However, several things prevent this from working optimally on 16K and
> 64K page configurations:
> 
> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>    produces the optimal contpte order for 4K pages. For 16K pages it
>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>    page sizes.
> 
> 2. Even with the optimal order, the mmap_miss heuristic in
>    do_sync_mmap_readahead() silently disables exec readahead after 100
>    page faults. The mmap_miss counter tracks whether readahead is useful
>    for mmap'd file access:
> 
>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>      miss (page needed IO).
> 
>    - Decremented by N in filemap_map_pages() for N pages successfully
>      mapped via fault-around (pages found in cache without faulting,
>      evidence that readahead was useful). Only non-workingset pages
>      count and recently evicted and re-read pages don't count as hits.
> 
>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>      marker page is found (indicates sequential consumption of readahead
>      pages).
> 
>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>    disabled. On 64K pages, both decrement paths are inactive:
> 
>    - filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>    - do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
>    With no decrements, mmap_miss monotonically increases past
>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>    for the remainder of the mapping.
>    Patch 2 fixes this by moving the VM_EXEC readahead block
>    above the mmap_miss check, since exec readahead is targeted (one
>    folio at the fault location, async_size=0) not speculative prefetch.
> 
> 3. Even with correct folio order and readahead, contpte mapping requires
>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>    The readahead path aligns file offsets and the buddy allocator aligns
>    physical memory, but the virtual address depends on the VMA start.
>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>    granularity, giving only a 1/32 chance of 2M alignment. When
>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> 
>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> 
>    Patch 4 fixes this for shared libraries by adding a contpte-size
>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
> 
> I created a benchmark that mmaps a large executable file and calls
> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> fault + readahead cost. "Random" first faults in all pages with a
> sequential sweep (not measured), then measures time for calling random
> offsets, isolating iTLB miss cost for scattered execution.
> 
> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> 512MB executable file on ext4, averaged over 3 runs:
> 
>   Phase      | Baseline     | Patched      | Improvement
>   -----------|--------------|--------------|------------------
>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>   Random     | 76.0 ms      | 58.3 ms      | 23% faster

I'm curious: is a single order really what we want?

I'd instead assume that we might want to make decisions based on the
mapping size.

Assume you have a 128M mapping, wouldn't we want to use a different
alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?

-- 
Cheers,

David
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 3 weeks, 5 days ago

On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
> On 3/10/26 15:51, Usama Arif wrote:
>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>> into a single iTLB entry, reducing iTLB pressure for large executable
>> mappings.
>>
>> exec_folio_order() was introduced [1] to request readahead at an
>> arch-preferred folio order for executable memory, enabling contpte
>> mapping on the fault path.
>>
>> However, several things prevent this from working optimally on 16K and
>> 64K page configurations:
>>
>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>    page sizes.
>>
>> 2. Even with the optimal order, the mmap_miss heuristic in
>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>    for mmap'd file access:
>>
>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>      miss (page needed IO).
>>
>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>      mapped via fault-around (pages found in cache without faulting,
>>      evidence that readahead was useful). Only non-workingset pages
>>      count and recently evicted and re-read pages don't count as hits.
>>
>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>      marker page is found (indicates sequential consumption of readahead
>>      pages).
>>
>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>    disabled. On 64K pages, both decrement paths are inactive:
>>
>>    - filemap_map_pages() is never called because fault_around_pages
>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>      requires fault_around_pages > 1. With only 1 page in the
>>      fault-around window, there is nothing "around" to map.
>>
>>    - do_async_mmap_readahead() never fires for exec mappings because
>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>      are placed.
>>
>>    With no decrements, mmap_miss monotonically increases past
>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>    for the remainder of the mapping.
>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>    above the mmap_miss check, since exec readahead is targeted (one
>>    folio at the fault location, async_size=0) not speculative prefetch.
>>
>> 3. Even with correct folio order and readahead, contpte mapping requires
>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>    The readahead path aligns file offsets and the buddy allocator aligns
>>    physical memory, but the virtual address depends on the VMA start.
>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>
>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>
>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>
>> I created a benchmark that mmaps a large executable file and calls
>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>> fault + readahead cost. "Random" first faults in all pages with a
>> sequential sweep (not measured), then measures time for calling random
>> offsets, isolating iTLB miss cost for scattered execution.
>>
>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>> 512MB executable file on ext4, averaged over 3 runs:
>>
>>   Phase      | Baseline     | Patched      | Improvement
>>   -----------|--------------|--------------|------------------
>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
> 
> I'm curious: is a single order really what we want?
> 
> I'd instead assume that we might want to make decisions based on the
> mapping size.
> 
> Assume you have a 128M mapping, wouldn't we want to use a different
> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
> 

So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
faults are not that big of a deal? If the text section is hot, it wont
get flushed after faulting in. So the real benefit comes from improved
iTLB coverage.

For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
to something larger (say 128M) wouldn't give any additional TLB
coalescing, each 2M-aligned region independently qualifies for contpte.

Mappings smaller than 2M can't benefit from contpte regardless of
alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
hardware boundary and adds complexity without TLB benefit?
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by David Hildenbrand (Arm) 3 weeks, 2 days ago
On 3/13/26 20:59, Usama Arif wrote:
> 
> 
> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
>> On 3/10/26 15:51, Usama Arif wrote:
>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>>> into a single iTLB entry, reducing iTLB pressure for large executable
>>> mappings.
>>>
>>> exec_folio_order() was introduced [1] to request readahead at an
>>> arch-preferred folio order for executable memory, enabling contpte
>>> mapping on the fault path.
>>>
>>> However, several things prevent this from working optimally on 16K and
>>> 64K page configurations:
>>>
>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>>    page sizes.
>>>
>>> 2. Even with the optimal order, the mmap_miss heuristic in
>>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>>    for mmap'd file access:
>>>
>>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>>      miss (page needed IO).
>>>
>>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>>      mapped via fault-around (pages found in cache without faulting,
>>>      evidence that readahead was useful). Only non-workingset pages
>>>      count and recently evicted and re-read pages don't count as hits.
>>>
>>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>>      marker page is found (indicates sequential consumption of readahead
>>>      pages).
>>>
>>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>>    disabled. On 64K pages, both decrement paths are inactive:
>>>
>>>    - filemap_map_pages() is never called because fault_around_pages
>>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>>      requires fault_around_pages > 1. With only 1 page in the
>>>      fault-around window, there is nothing "around" to map.
>>>
>>>    - do_async_mmap_readahead() never fires for exec mappings because
>>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>>      are placed.
>>>
>>>    With no decrements, mmap_miss monotonically increases past
>>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>>    for the remainder of the mapping.
>>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>>    above the mmap_miss check, since exec readahead is targeted (one
>>>    folio at the fault location, async_size=0) not speculative prefetch.
>>>
>>> 3. Even with correct folio order and readahead, contpte mapping requires
>>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>>    The readahead path aligns file offsets and the buddy allocator aligns
>>>    physical memory, but the virtual address depends on the VMA start.
>>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>>
>>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>>
>>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>>
>>> I created a benchmark that mmaps a large executable file and calls
>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>>> fault + readahead cost. "Random" first faults in all pages with a
>>> sequential sweep (not measured), then measures time for calling random
>>> offsets, isolating iTLB miss cost for scattered execution.
>>>
>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>>> 512MB executable file on ext4, averaged over 3 runs:
>>>
>>>   Phase      | Baseline     | Patched      | Improvement
>>>   -----------|--------------|--------------|------------------
>>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
>>
>> I'm curious: is a single order really what we want?
>>
>> I'd instead assume that we might want to make decisions based on the
>> mapping size.
>>
>> Assume you have a 128M mapping, wouldn't we want to use a different
>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
>>
> 
> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
> faults are not that big of a deal? If the text section is hot, it wont
> get flushed after faulting in. So the real benefit comes from improved
> iTLB coverage.
> 
> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
> to something larger (say 128M) wouldn't give any additional TLB
> coalescing, each 2M-aligned region independently qualifies for contpte.
> 
> Mappings smaller than 2M can't benefit from contpte regardless of
> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
> hardware boundary and adds complexity without TLB benefit?

I might be wrong, but I think you are mixing two things here:

(1) "Minimum" folio size (exec_folio_order())

(2) VMA alignment.


(2) should certainly be as large as (1), but assume we can get a 2M
folio on arm64 4k, why shouldn't we align it to 2M if the region is
reasonably sized, and use a PMD?


-- 
Cheers,

David
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by Usama Arif 3 weeks ago

On 16/03/2026 19:06, David Hildenbrand (Arm) wrote:
> On 3/13/26 20:59, Usama Arif wrote:
>>
>>
>> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
>>> On 3/10/26 15:51, Usama Arif wrote:
>>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>>>> into a single iTLB entry, reducing iTLB pressure for large executable
>>>> mappings.
>>>>
>>>> exec_folio_order() was introduced [1] to request readahead at an
>>>> arch-preferred folio order for executable memory, enabling contpte
>>>> mapping on the fault path.
>>>>
>>>> However, several things prevent this from working optimally on 16K and
>>>> 64K page configurations:
>>>>
>>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>>>    page sizes.
>>>>
>>>> 2. Even with the optimal order, the mmap_miss heuristic in
>>>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>>>    for mmap'd file access:
>>>>
>>>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>>>      miss (page needed IO).
>>>>
>>>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>>>      mapped via fault-around (pages found in cache without faulting,
>>>>      evidence that readahead was useful). Only non-workingset pages
>>>>      count and recently evicted and re-read pages don't count as hits.
>>>>
>>>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>>>      marker page is found (indicates sequential consumption of readahead
>>>>      pages).
>>>>
>>>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>>>    disabled. On 64K pages, both decrement paths are inactive:
>>>>
>>>>    - filemap_map_pages() is never called because fault_around_pages
>>>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>>>      requires fault_around_pages > 1. With only 1 page in the
>>>>      fault-around window, there is nothing "around" to map.
>>>>
>>>>    - do_async_mmap_readahead() never fires for exec mappings because
>>>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>>>      are placed.
>>>>
>>>>    With no decrements, mmap_miss monotonically increases past
>>>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>>>    for the remainder of the mapping.
>>>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>>>    above the mmap_miss check, since exec readahead is targeted (one
>>>>    folio at the fault location, async_size=0) not speculative prefetch.
>>>>
>>>> 3. Even with correct folio order and readahead, contpte mapping requires
>>>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>>>    The readahead path aligns file offsets and the buddy allocator aligns
>>>>    physical memory, but the virtual address depends on the VMA start.
>>>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>>>
>>>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>>>
>>>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>>>
>>>> I created a benchmark that mmaps a large executable file and calls
>>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>>>> fault + readahead cost. "Random" first faults in all pages with a
>>>> sequential sweep (not measured), then measures time for calling random
>>>> offsets, isolating iTLB miss cost for scattered execution.
>>>>
>>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>>>> 512MB executable file on ext4, averaged over 3 runs:
>>>>
>>>>   Phase      | Baseline     | Patched      | Improvement
>>>>   -----------|--------------|--------------|------------------
>>>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
>>>
>>> I'm curious: is a single order really what we want?
>>>
>>> I'd instead assume that we might want to make decisions based on the
>>> mapping size.
>>>
>>> Assume you have a 128M mapping, wouldn't we want to use a different
>>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
>>>
>>
>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
>> faults are not that big of a deal? If the text section is hot, it wont
>> get flushed after faulting in. So the real benefit comes from improved
>> iTLB coverage.
>>
>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
>> to something larger (say 128M) wouldn't give any additional TLB
>> coalescing, each 2M-aligned region independently qualifies for contpte.
>>
>> Mappings smaller than 2M can't benefit from contpte regardless of
>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
>> hardware boundary and adds complexity without TLB benefit?
> 
> I might be wrong, but I think you are mixing two things here:
> 
> (1) "Minimum" folio size (exec_folio_order())
> 
> (2) VMA alignment.
> 
> 
> (2) should certainly be as large as (1), but assume we can get a 2M
> folio on arm64 4k, why shouldn't we align it to 2M if the region is
> reasonably sized, and use a PMD?
> 
> 

So this series is tackling both (1) and (2). When I started making changes
to the code, what I wanted was 2M folios at fault with 64K base page size
to reduce iTLB misses. This is what patch 1 (and 2) will achieve.

Yes, completely agree, (2) should be as large as (1). I didn't think about
PMD size on 4K which you pointed out. do_sync_mmap_readahead can give
that with force_thp_readahead, so this should be supported.

But we shouldn't align to PMD size for all base page sizes. As Rui pointed
out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement
to 2M?

[1] https://lore.kernel.org/all/20260313144213.95686-1-r@hev.cc/
Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Posted by David Hildenbrand (Arm) 3 weeks ago
On 3/18/26 11:41, Usama Arif wrote:
> 
> 
> On 16/03/2026 19:06, David Hildenbrand (Arm) wrote:
>> On 3/13/26 20:59, Usama Arif wrote:
>>>
>>>
>>>
>>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
>>> faults are not that big of a deal? If the text section is hot, it wont
>>> get flushed after faulting in. So the real benefit comes from improved
>>> iTLB coverage.
>>>
>>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
>>> to something larger (say 128M) wouldn't give any additional TLB
>>> coalescing, each 2M-aligned region independently qualifies for contpte.
>>>
>>> Mappings smaller than 2M can't benefit from contpte regardless of
>>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
>>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
>>> hardware boundary and adds complexity without TLB benefit?
>>
>> I might be wrong, but I think you are mixing two things here:
>>
>> (1) "Minimum" folio size (exec_folio_order())
>>
>> (2) VMA alignment.
>>
>>
>> (2) should certainly be as large as (1), but assume we can get a 2M
>> folio on arm64 4k, why shouldn't we align it to 2M if the region is
>> reasonably sized, and use a PMD?
>>
>>
> 
> So this series is tackling both (1) and (2). When I started making changes
> to the code, what I wanted was 2M folios at fault with 64K base page size
> to reduce iTLB misses. This is what patch 1 (and 2) will achieve.
> 
> Yes, completely agree, (2) should be as large as (1). I didn't think about
> PMD size on 4K which you pointed out. do_sync_mmap_readahead can give
> that with force_thp_readahead, so this should be supported.

In particular, imagine if hw starts optimizing transparently on other
granularity, then the "smallest granularity" (exec_folio_order())
decision will soon be wrong.

> 
> But we shouldn't align to PMD size for all base page sizes. As Rui pointed
> out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement
> to 2M?

That's why I said that likely, as an input, we'd want to use the mapping
size or other heuristics.

We wouldn't want to align a 4k mapping to either 64k or 2M.

Long story short: the change in thp_get_unmapped_area_vmflags() needs
some thought IMHO.

-- 
Cheers,

David