arch/arm64/include/asm/pgtable.h | 9 ++-- fs/binfmt_elf.c | 15 +++++++ mm/filemap.c | 72 +++++++++++++++++--------------- mm/huge_memory.c | 17 ++++++++ 4 files changed, 75 insertions(+), 38 deletions(-)
On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
into a single iTLB entry, reducing iTLB pressure for large executable
mappings.
exec_folio_order() was introduced [1] to request readahead at an
arch-preferred folio order for executable memory, enabling contpte
mapping on the fault path.
However, several things prevent this from working optimally on 16K and
64K page configurations:
1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
produces the optimal contpte order for 4K pages. For 16K pages it
returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
using ilog2(CONT_PTES) which evaluates to the optimal order for all
page sizes.
2. Even with the optimal order, the mmap_miss heuristic in
do_sync_mmap_readahead() silently disables exec readahead after 100
page faults. The mmap_miss counter tracks whether readahead is useful
for mmap'd file access:
- Incremented by 1 in do_sync_mmap_readahead() on every page cache
miss (page needed IO).
- Decremented by N in filemap_map_pages() for N pages successfully
mapped via fault-around (pages found in cache without faulting,
evidence that readahead was useful). Only non-workingset pages
count and recently evicted and re-read pages don't count as hits.
- Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
marker page is found (indicates sequential consumption of readahead
pages).
When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
disabled. On 64K pages, both decrement paths are inactive:
- filemap_map_pages() is never called because fault_around_pages
(65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
requires fault_around_pages > 1. With only 1 page in the
fault-around window, there is nothing "around" to map.
- do_async_mmap_readahead() never fires for exec mappings because
exec readahead sets async_size = 0, so no PG_readahead markers
are placed.
With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 faults, disabling exec readahead
for the remainder of the mapping.
Patch 2 fixes this by moving the VM_EXEC readahead block
above the mmap_miss check, since exec readahead is targeted (one
folio at the fault location, async_size=0) not speculative prefetch.
3. Even with correct folio order and readahead, contpte mapping requires
the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
The readahead path aligns file offsets and the buddy allocator aligns
physical memory, but the virtual address depends on the VMA start.
For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
granularity, giving only a 1/32 chance of 2M alignment. When
misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
any folio in the VMA, resulting in zero iTLB coalescing benefit.
Patch 3 fixes this for the main binary by bumping the ELF loader's
alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
Patch 4 fixes this for shared libraries by adding a contpte-size
alignment fallback in thp_get_unmapped_area_vmflags(). The existing
PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
libraries, so this smaller fallback (2M) succeeds where PMD fails.
I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.
The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:
Phase | Baseline | Patched | Improvement
-----------|--------------|--------------|------------------
Cold fault | 83.4 ms | 41.3 ms | 50% faster
Random | 76.0 ms | 58.3 ms | 23% faster
[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
Usama Arif (4):
arm64: request contpte-sized folios for exec memory
mm: bypass mmap_miss heuristic for VM_EXEC readahead
elf: align ET_DYN base to exec folio order for contpte mapping
mm: align file-backed mmap to exec folio order in
thp_get_unmapped_area
arch/arm64/include/asm/pgtable.h | 9 ++--
fs/binfmt_elf.c | 15 +++++++
mm/filemap.c | 72 +++++++++++++++++---------------
mm/huge_memory.c | 17 ++++++++
4 files changed, 75 insertions(+), 38 deletions(-)
--
2.47.3
I only just realized your focus was on 64K normal pages, what I was
referring to here is AArch64 with 4K normal pages.
Sorry about the earlier numbers. They were a bit low precision.
RK3399 has pretty limited PMU events, and it looks like it can’t
collect events from the A53 and A72 clusters at the same time, so
I reran the measurements on the A53.
Even though the A53 backend isn’t very wide, we can still see the
impact from iTLB pressure. With 4K pages, aligning the code to PMD
size (2M) performs slightly better than 64K.
Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)
Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
Loop: 5
Base Patchset [1] Patchset [2]
instructions 1,994,512,163,037 1,994,528,896,322 1,994,536,148,574
cpu-cycles 6,890,054,789,351 6,870,685,379,047 6,720,442,248,967
~ -0.28% ~ -2.46%
itlb-misses 579,692,117 455,848,211 43,814,795
~ -21.36% ~ -92.44%
time elapsed 1331.15 s 1325.50 s 1296.35 s
~ -0.42% ~ -2.61%
Maybe we could make exec_folio_order() choose differently folio size
depending on the configuration and conditional in some way, for example
based on the size of the code segment?
[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
Thanks,
Rui
On 14/03/2026 12:50, WANG Rui wrote: > I only just realized your focus was on 64K normal pages, what I was > referring to here is AArch64 with 4K normal pages. > > Sorry about the earlier numbers. They were a bit low precision. > RK3399 has pretty limited PMU events, and it looks like it can’t > collect events from the A53 and A72 clusters at the same time, so > I reran the measurements on the A53. > > Even though the A53 backend isn’t very wide, we can still see the > impact from iTLB pressure. With 4K pages, aligning the code to PMD > size (2M) performs slightly better than 64K. > > Binutils: 2.46 > GCC: 15.2.1 (--enable-host-pie) > > Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig. > Loop: 5 > > Base Patchset [1] Patchset [2] > instructions 1,994,512,163,037 1,994,528,896,322 1,994,536,148,574 > cpu-cycles 6,890,054,789,351 6,870,685,379,047 6,720,442,248,967 > ~ -0.28% ~ -2.46% > itlb-misses 579,692,117 455,848,211 43,814,795 > ~ -21.36% ~ -92.44% > time elapsed 1331.15 s 1325.50 s 1296.35 s > ~ -0.42% ~ -2.61% > Thanks for running these! Just wanted to check what is the base page size of this experiment? Ofcourse PMD is going to perform better than TLB coalescing (pagefault itself will be one less page table level). But its a tradeoff between memory pressure + reduced ASLR vs performance. As Ryan pointed out in [1], even 2M for 16K base page size might introduce too much of memory pressure for android phones, and the PMD size for 16K is 32M! [1] https://lore.kernel.org/all/cfdfca9c-4752-4037-a289-03e6e7a00d47@arm.com/ > Maybe we could make exec_folio_order() choose differently folio size > depending on the configuration and conditional in some way, for example > based on the size of the code segment? Yeah I think introducing Kconfig might be an option. > > [1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev > [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc > > Thanks, > Rui
> Thanks for running these! Just wanted to check what is the base page size
> of this experiment?
base page size: 4K
> Yeah I think introducing Kconfig might be an option.
I wonder if it would make sense for exec_folio_order() to vary the
order based on the code size, instead of always returning a fixed
value for a given architecture and base page size.
For example, on AArch64 with 4K base pages, in the load_elf_binary()
case: if exec_folio_order() only ever returns cont-PTE (64K), we may
miss the opportunity to use PMD mappings. On the other hand, if it
always returns PMD (2M), then for binaries smaller than 2M we end up
reducing ASLR entropy.
Maybe something along these lines would work better:
unsigned int exec_folio_order(size_t code_size)
{
#if PAGE_SIZE == 4096
if (code_size >= PMD_SIZE)
return ilog2(SZ_2M >> PAGE_SHIFT);
else if (code_size >= SZ_64K)
return ilog2(SZ_64K >> PAGE_SHIFT);
else
return 0;
#elif PAGE_SIZE == 16384
...
#elif PAGE_SIZE == ...
/* let the arch cap the max order here, rather
than hard-coding it at the use sites */
#endif
}
Thanks,
Rui
From: WANG Rui <r@hev.cc>
Hi,
I ran a quick bench on RK3399:
Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)
Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
Base Patchset [1] Patchset [2]
instructions 3,115,852,636,773 3,194,533,947,809 3,235,417,205,947
cpu-cycles 8,374,429,970,450 8,457,398,871,141 8,323,881,987,768
itlb-misses 9,250,336,037 8,033,415,293 2,946,152,935
~ -13.16% ~ -68.15%
time elapsed 610.51 s 605.12 s 593.83 s
[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
Should we prefer to PMD aligned here?
Thanks,
Rui
On 10/03/2026 14:51, Usama Arif wrote: > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs > into a single iTLB entry, reducing iTLB pressure for large executable > mappings. > > exec_folio_order() was introduced [1] to request readahead at an > arch-preferred folio order for executable memory, enabling contpte > mapping on the fault path. > > However, several things prevent this from working optimally on 16K and > 64K page configurations: > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only > produces the optimal contpte order for 4K pages. For 16K pages it > returns order 2 (64K) instead of order 7 (2M), and for 64K pages it > returns order 0 (64K) instead of order 5 (2M). This was deliberate, although perhaps a bit conservative. I was concerned about the possibility of read amplification; pointlessly reading in a load of memory that never actually gets used. And that is independent of page size. 2M seems quite big as a default IMHO, I could imagine Android might complain about memory pressure in their 16K config, for example. Additionally, ELF files are normally only aligned to 64K and you can only get the TLB benefits if the memory is aligned in physical and virtual memory. > Patch 1 fixes this by > using ilog2(CONT_PTES) which evaluates to the optimal order for all > page sizes. > > 2. Even with the optimal order, the mmap_miss heuristic in > do_sync_mmap_readahead() silently disables exec readahead after 100 > page faults. The mmap_miss counter tracks whether readahead is useful > for mmap'd file access: > > - Incremented by 1 in do_sync_mmap_readahead() on every page cache > miss (page needed IO). > > - Decremented by N in filemap_map_pages() for N pages successfully > mapped via fault-around (pages found in cache without faulting, > evidence that readahead was useful). Only non-workingset pages > count and recently evicted and re-read pages don't count as hits. > > - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead > marker page is found (indicates sequential consumption of readahead > pages). > > When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is > disabled. On 64K pages, both decrement paths are inactive: > > - filemap_map_pages() is never called because fault_around_pages > (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which > requires fault_around_pages > 1. With only 1 page in the > fault-around window, there is nothing "around" to map. > > - do_async_mmap_readahead() never fires for exec mappings because > exec readahead sets async_size = 0, so no PG_readahead markers > are placed. > > With no decrements, mmap_miss monotonically increases past > MMAP_LOTSAMISS after 100 faults, disabling exec readahead > for the remainder of the mapping. > Patch 2 fixes this by moving the VM_EXEC readahead block > above the mmap_miss check, since exec readahead is targeted (one > folio at the fault location, async_size=0) not speculative prefetch. Interesting! > > 3. Even with correct folio order and readahead, contpte mapping requires > the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). > The readahead path aligns file offsets and the buddy allocator aligns > physical memory, but the virtual address depends on the VMA start. > For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) > granularity, giving only a 1/32 chance of 2M alignment. When > misaligned, contpte_set_ptes() never sets the contiguous PTE bit for > any folio in the VMA, resulting in zero iTLB coalescing benefit. > > Patch 3 fixes this for the main binary by bumping the ELF loader's > alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. > > Patch 4 fixes this for shared libraries by adding a contpte-size > alignment fallback in thp_get_unmapped_area_vmflags(). The existing > PMD_SIZE alignment (512M on 64K pages) is too large for typical shared > libraries, so this smaller fallback (2M) succeeds where PMD fails. I don't see how you can reliably influence this from the kernel? The ELF file alignment is, by default, 64K (16K on Android) and there is no guarrantee that the text section is the first section in the file. You need to align the start of the text section to the 2M boundary and to do that, you'll need to align the start of the file to some 64K boundary at a specific offset to the 2M boundary, based on the size of any sections before the text section. That's a job for the dynamic loader I think? Perhaps I've misunderstood what you're doing... > > I created a benchmark that mmaps a large executable file and calls > RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures > fault + readahead cost. "Random" first faults in all pages with a > sequential sweep (not measured), then measures time for calling random > offsets, isolating iTLB miss cost for scattered execution. > > The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, > 512MB executable file on ext4, averaged over 3 runs: > > Phase | Baseline | Patched | Improvement > -----------|--------------|--------------|------------------ > Cold fault | 83.4 ms | 41.3 ms | 50% faster > Random | 76.0 ms | 58.3 ms | 23% faster I think the proper way to do this is to link the text section with 2M alignment and have the dynamic linker mark the region with MADV_HUGEPAGE? Thanks, Ryan > > [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/ > > Usama Arif (4): > arm64: request contpte-sized folios for exec memory > mm: bypass mmap_miss heuristic for VM_EXEC readahead > elf: align ET_DYN base to exec folio order for contpte mapping > mm: align file-backed mmap to exec folio order in > thp_get_unmapped_area > > arch/arm64/include/asm/pgtable.h | 9 ++-- > fs/binfmt_elf.c | 15 +++++++ > mm/filemap.c | 72 +++++++++++++++++--------------- > mm/huge_memory.c | 17 ++++++++ > 4 files changed, 75 insertions(+), 38 deletions(-) >
Hi Ryan, > I don't see how you can reliably influence this from the kernel? The ELF file > alignment is, by default, 64K (16K on Android) and there is no guarrantee that > the text section is the first section in the file. You need to align the start > of the text section to the 2M boundary and to do that, you'll need to align the > start of the file to some 64K boundary at a specific offset to the 2M boundary, > based on the size of any sections before the text section. That's a job for the > dynamic loader I think? Perhaps I've misunderstood what you're doing... On Arch Linux for AArch64 and LoongArch64 I've observed that most binaries place the executable segment in the first PT_LOAD. In that case both the virtual address and file offset are 0, which happens to satisfy the alignment requirements for PMD-sized or large folio mappings. x86 looks quite different. The executable segment is usually not the first one. After digging into this I realized this mostly comes from the linker defaults. With GNU ld, -z noseparate-code merges the read-only and read-only-executable segments into one, while -z separate-code splits them, placing a non-executable read-only segment first. The latter is the default on x86, partly to avoid making the ELF headers executable when mappings start from the beginning of the file. Other architectures tend to default to -z noseparate-code, which makes it more likely that the text segment is the first PT_LOAD. LLVM lld behaves differently again: --rosegment (equivalent to -z separate-code) is enabled by default on all architectures, which similarly tends to place the executable segment after an initial read-only one. That default is clearly less friendly to large-page text mappings. Thanks, Rui
On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote: > On 10/03/2026 14:51, Usama Arif wrote: > > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs > > into a single iTLB entry, reducing iTLB pressure for large executable > > mappings. > > > > exec_folio_order() was introduced [1] to request readahead at an > > arch-preferred folio order for executable memory, enabling contpte > > mapping on the fault path. > > > > However, several things prevent this from working optimally on 16K and > > 64K page configurations: > > > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only > > produces the optimal contpte order for 4K pages. For 16K pages it > > returns order 2 (64K) instead of order 7 (2M), and for 64K pages it > > returns order 0 (64K) instead of order 5 (2M). > > This was deliberate, although perhaps a bit conservative. I was concerned about > the possibility of read amplification; pointlessly reading in a load of memory > that never actually gets used. And that is independent of page size. > > 2M seems quite big as a default IMHO, I could imagine Android might complain > about memory pressure in their 16K config, for example. > The force_thp_readahead path in do_sync_mmap_readahead() reads at HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for non VM_RAND_READ mappings (ra->size *= 2), with async readahead enabled. exec_folio_order() is more conservative. a single 2M folio with async_size=0, no speculative prefetch. So I think the memory pressure would not be worse than what x86 has? For memory pressure on Android 16K: the readahead is clamped to VMA boundaries, so a small shared library won't read 2M. page_cache_ra_order() reduces folio order near EOF and on allocation failure, so the 2M order is a preference, not a guarantee with the current code? > Additionally, ELF files are normally only aligned to 64K and you can only get > the TLB benefits if the memory is aligned in physical and virtual memory. > > > Patch 1 fixes this by > > using ilog2(CONT_PTES) which evaluates to the optimal order for all > > page sizes. > > > > 2. Even with the optimal order, the mmap_miss heuristic in > > do_sync_mmap_readahead() silently disables exec readahead after 100 > > page faults. The mmap_miss counter tracks whether readahead is useful > > for mmap'd file access: > > > > - Incremented by 1 in do_sync_mmap_readahead() on every page cache > > miss (page needed IO). > > > > - Decremented by N in filemap_map_pages() for N pages successfully > > mapped via fault-around (pages found in cache without faulting, > > evidence that readahead was useful). Only non-workingset pages > > count and recently evicted and re-read pages don't count as hits. > > > > - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead > > marker page is found (indicates sequential consumption of readahead > > pages). > > > > When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is > > disabled. On 64K pages, both decrement paths are inactive: > > > > - filemap_map_pages() is never called because fault_around_pages > > (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which > > requires fault_around_pages > 1. With only 1 page in the > > fault-around window, there is nothing "around" to map. > > > > - do_async_mmap_readahead() never fires for exec mappings because > > exec readahead sets async_size = 0, so no PG_readahead markers > > are placed. > > > > With no decrements, mmap_miss monotonically increases past > > MMAP_LOTSAMISS after 100 faults, disabling exec readahead > > for the remainder of the mapping. > > Patch 2 fixes this by moving the VM_EXEC readahead block > > above the mmap_miss check, since exec readahead is targeted (one > > folio at the fault location, async_size=0) not speculative prefetch. > > Interesting! > > > > > 3. Even with correct folio order and readahead, contpte mapping requires > > the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). > > The readahead path aligns file offsets and the buddy allocator aligns > > physical memory, but the virtual address depends on the VMA start. > > For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) > > granularity, giving only a 1/32 chance of 2M alignment. When > > misaligned, contpte_set_ptes() never sets the contiguous PTE bit for > > any folio in the VMA, resulting in zero iTLB coalescing benefit. > > > > Patch 3 fixes this for the main binary by bumping the ELF loader's > > alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. > > > > Patch 4 fixes this for shared libraries by adding a contpte-size > > alignment fallback in thp_get_unmapped_area_vmflags(). The existing > > PMD_SIZE alignment (512M on 64K pages) is too large for typical shared > > libraries, so this smaller fallback (2M) succeeds where PMD fails. > > I don't see how you can reliably influence this from the kernel? The ELF file > alignment is, by default, 64K (16K on Android) and there is no guarrantee that > the text section is the first section in the file. You need to align the start > of the text section to the 2M boundary and to do that, you'll need to align the > start of the file to some 64K boundary at a specific offset to the 2M boundary, > based on the size of any sections before the text section. That's a job for the > dynamic loader I think? Perhaps I've misunderstood what you're doing... > I only started looking into how this works a few days before sending these patches, so I could be wrong (please do correct me if thats the case!) For the main binary (patch 3): load_elf_binary() controls load_bias. Each PT_LOAD segment is mapped at load_bias + p_vaddr via elf_map(). The alignment variable feeds directly into load_bias calculation. If p_vaddr=0 and p_offset=0, mapped_addr = load_bias + 0 = load_bias. By ensuring load_bias is folio size aligned, the text segment's virtual address is also folio size aligned. For shared libraries (patch 4): ld.so loads these via mmap(), and the kernel's get_unmapped_area callback (thp_get_unmapped_area for ext4, xfs, btrfs) picks the virtual address. The existing code tries PMD_SIZE alignment first (512M on 64K pages), which is too large for typical shared libraries and always fails. Patch 4 adds a fallback that tries folio-size alignment (2M), which is small enough to succeed for most libraries. > > > > I created a benchmark that mmaps a large executable file and calls > > RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures > > fault + readahead cost. "Random" first faults in all pages with a > > sequential sweep (not measured), then measures time for calling random > > offsets, isolating iTLB miss cost for scattered execution. > > > > The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, > > 512MB executable file on ext4, averaged over 3 runs: > > > > Phase | Baseline | Patched | Improvement > > -----------|--------------|--------------|------------------ > > Cold fault | 83.4 ms | 41.3 ms | 50% faster > > Random | 76.0 ms | 58.3 ms | 23% faster > > I think the proper way to do this is to link the text section with 2M alignment > and have the dynamic linker mark the region with MADV_HUGEPAGE? > On arm64 with 64K pages, the force_thp_readahead path triggered by MADV_HUGEPAGE reads at HPAGE_PMD_ORDER (512M). Even with file and anon khugepaged support aded for khugpaged, the collapse won't happen form the start. Yes I think dynamic linker is also a good alternate approach from Wangs patches [1]. But doing it in the kernel would be more transparent? [1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html > Thanks, > Ryan > > > > > > [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/ > > > > Usama Arif (4): > > arm64: request contpte-sized folios for exec memory > > mm: bypass mmap_miss heuristic for VM_EXEC readahead > > elf: align ET_DYN base to exec folio order for contpte mapping > > mm: align file-backed mmap to exec folio order in > > thp_get_unmapped_area > > > > arch/arm64/include/asm/pgtable.h | 9 ++-- > > fs/binfmt_elf.c | 15 +++++++ > > mm/filemap.c | 72 +++++++++++++++++--------------- > > mm/huge_memory.c | 17 ++++++++ > > 4 files changed, 75 insertions(+), 38 deletions(-) > > > >
On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote: > On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote: > > > On 10/03/2026 14:51, Usama Arif wrote: > > > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs > > > into a single iTLB entry, reducing iTLB pressure for large executable > > > mappings. > > > > > > exec_folio_order() was introduced [1] to request readahead at an > > > arch-preferred folio order for executable memory, enabling contpte > > > mapping on the fault path. > > > > > > However, several things prevent this from working optimally on 16K and > > > 64K page configurations: > > > > > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only > > > produces the optimal contpte order for 4K pages. For 16K pages it > > > returns order 2 (64K) instead of order 7 (2M), and for 64K pages it > > > returns order 0 (64K) instead of order 5 (2M). > > > > This was deliberate, although perhaps a bit conservative. I was concerned about > > the possibility of read amplification; pointlessly reading in a load of memory > > that never actually gets used. And that is independent of page size. > > > > 2M seems quite big as a default IMHO, I could imagine Android might complain > > about memory pressure in their 16K config, for example. > > > > The force_thp_readahead path in do_sync_mmap_readahead() reads at > HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for > non VM_RAND_READ mappings (ra->size *= 2), with async readahead > enabled. exec_folio_order() is more conservative. a single 2M folio > with async_size=0, no speculative prefetch. So I think the memory > pressure would not be worse than what x86 has? > > For memory pressure on Android 16K: the readahead is clamped to VMA > boundaries, so a small shared library won't read 2M. > page_cache_ra_order() reduces folio order near EOF and on allocation > failure, so the 2M order is a preference, not a guarantee with the > current code? > I am not a big fan of introducing Kconfig options, but would CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better solution? Or maybe a default of 64K for 4K and 16K base page size, but 2M for 64K page size as 64K base page size is mostly for servers. Using a default value of 64K would mean no change in behaviour.
On 3/18/26 11:52, Usama Arif wrote: > On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote: > >> On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote: >> >>> >>> This was deliberate, although perhaps a bit conservative. I was concerned about >>> the possibility of read amplification; pointlessly reading in a load of memory >>> that never actually gets used. And that is independent of page size. >>> >>> 2M seems quite big as a default IMHO, I could imagine Android might complain >>> about memory pressure in their 16K config, for example. >>> >> >> The force_thp_readahead path in do_sync_mmap_readahead() reads at >> HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for >> non VM_RAND_READ mappings (ra->size *= 2), with async readahead >> enabled. exec_folio_order() is more conservative. a single 2M folio >> with async_size=0, no speculative prefetch. So I think the memory >> pressure would not be worse than what x86 has? >> >> For memory pressure on Android 16K: the readahead is clamped to VMA >> boundaries, so a small shared library won't read 2M. >> page_cache_ra_order() reduces folio order near EOF and on allocation >> failure, so the 2M order is a preference, not a guarantee with the >> current code? >> > > I am not a big fan of introducing Kconfig options, but would > CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better > solution? Or maybe a default of 64K for 4K and 16K base page size, > but 2M for 64K page size as 64K base page size is mostly for servers. > > Using a default value of 64K would mean no change in behaviour. I don't think such a tunable is the right approach. We should try to do something smarter in the kernel. We should have access to the mapping size and whether there is currently real memory pressure. We might even know the estimated speed of the device we're loading data from :) -- Cheers, David
On 3/10/26 15:51, Usama Arif wrote: > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs > into a single iTLB entry, reducing iTLB pressure for large executable > mappings. > > exec_folio_order() was introduced [1] to request readahead at an > arch-preferred folio order for executable memory, enabling contpte > mapping on the fault path. > > However, several things prevent this from working optimally on 16K and > 64K page configurations: > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only > produces the optimal contpte order for 4K pages. For 16K pages it > returns order 2 (64K) instead of order 7 (2M), and for 64K pages it > returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by > using ilog2(CONT_PTES) which evaluates to the optimal order for all > page sizes. > > 2. Even with the optimal order, the mmap_miss heuristic in > do_sync_mmap_readahead() silently disables exec readahead after 100 > page faults. The mmap_miss counter tracks whether readahead is useful > for mmap'd file access: > > - Incremented by 1 in do_sync_mmap_readahead() on every page cache > miss (page needed IO). > > - Decremented by N in filemap_map_pages() for N pages successfully > mapped via fault-around (pages found in cache without faulting, > evidence that readahead was useful). Only non-workingset pages > count and recently evicted and re-read pages don't count as hits. > > - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead > marker page is found (indicates sequential consumption of readahead > pages). > > When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is > disabled. On 64K pages, both decrement paths are inactive: > > - filemap_map_pages() is never called because fault_around_pages > (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which > requires fault_around_pages > 1. With only 1 page in the > fault-around window, there is nothing "around" to map. > > - do_async_mmap_readahead() never fires for exec mappings because > exec readahead sets async_size = 0, so no PG_readahead markers > are placed. > > With no decrements, mmap_miss monotonically increases past > MMAP_LOTSAMISS after 100 faults, disabling exec readahead > for the remainder of the mapping. > Patch 2 fixes this by moving the VM_EXEC readahead block > above the mmap_miss check, since exec readahead is targeted (one > folio at the fault location, async_size=0) not speculative prefetch. > > 3. Even with correct folio order and readahead, contpte mapping requires > the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). > The readahead path aligns file offsets and the buddy allocator aligns > physical memory, but the virtual address depends on the VMA start. > For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) > granularity, giving only a 1/32 chance of 2M alignment. When > misaligned, contpte_set_ptes() never sets the contiguous PTE bit for > any folio in the VMA, resulting in zero iTLB coalescing benefit. > > Patch 3 fixes this for the main binary by bumping the ELF loader's > alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. > > Patch 4 fixes this for shared libraries by adding a contpte-size > alignment fallback in thp_get_unmapped_area_vmflags(). The existing > PMD_SIZE alignment (512M on 64K pages) is too large for typical shared > libraries, so this smaller fallback (2M) succeeds where PMD fails. > > I created a benchmark that mmaps a large executable file and calls > RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures > fault + readahead cost. "Random" first faults in all pages with a > sequential sweep (not measured), then measures time for calling random > offsets, isolating iTLB miss cost for scattered execution. > > The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, > 512MB executable file on ext4, averaged over 3 runs: > > Phase | Baseline | Patched | Improvement > -----------|--------------|--------------|------------------ > Cold fault | 83.4 ms | 41.3 ms | 50% faster > Random | 76.0 ms | 58.3 ms | 23% faster I'm curious: is a single order really what we want? I'd instead assume that we might want to make decisions based on the mapping size. Assume you have a 128M mapping, wouldn't we want to use a different alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping? -- Cheers, David
On 13/03/2026 16:20, David Hildenbrand (Arm) wrote: > On 3/10/26 15:51, Usama Arif wrote: >> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs >> into a single iTLB entry, reducing iTLB pressure for large executable >> mappings. >> >> exec_folio_order() was introduced [1] to request readahead at an >> arch-preferred folio order for executable memory, enabling contpte >> mapping on the fault path. >> >> However, several things prevent this from working optimally on 16K and >> 64K page configurations: >> >> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only >> produces the optimal contpte order for 4K pages. For 16K pages it >> returns order 2 (64K) instead of order 7 (2M), and for 64K pages it >> returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by >> using ilog2(CONT_PTES) which evaluates to the optimal order for all >> page sizes. >> >> 2. Even with the optimal order, the mmap_miss heuristic in >> do_sync_mmap_readahead() silently disables exec readahead after 100 >> page faults. The mmap_miss counter tracks whether readahead is useful >> for mmap'd file access: >> >> - Incremented by 1 in do_sync_mmap_readahead() on every page cache >> miss (page needed IO). >> >> - Decremented by N in filemap_map_pages() for N pages successfully >> mapped via fault-around (pages found in cache without faulting, >> evidence that readahead was useful). Only non-workingset pages >> count and recently evicted and re-read pages don't count as hits. >> >> - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead >> marker page is found (indicates sequential consumption of readahead >> pages). >> >> When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is >> disabled. On 64K pages, both decrement paths are inactive: >> >> - filemap_map_pages() is never called because fault_around_pages >> (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which >> requires fault_around_pages > 1. With only 1 page in the >> fault-around window, there is nothing "around" to map. >> >> - do_async_mmap_readahead() never fires for exec mappings because >> exec readahead sets async_size = 0, so no PG_readahead markers >> are placed. >> >> With no decrements, mmap_miss monotonically increases past >> MMAP_LOTSAMISS after 100 faults, disabling exec readahead >> for the remainder of the mapping. >> Patch 2 fixes this by moving the VM_EXEC readahead block >> above the mmap_miss check, since exec readahead is targeted (one >> folio at the fault location, async_size=0) not speculative prefetch. >> >> 3. Even with correct folio order and readahead, contpte mapping requires >> the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). >> The readahead path aligns file offsets and the buddy allocator aligns >> physical memory, but the virtual address depends on the VMA start. >> For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) >> granularity, giving only a 1/32 chance of 2M alignment. When >> misaligned, contpte_set_ptes() never sets the contiguous PTE bit for >> any folio in the VMA, resulting in zero iTLB coalescing benefit. >> >> Patch 3 fixes this for the main binary by bumping the ELF loader's >> alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. >> >> Patch 4 fixes this for shared libraries by adding a contpte-size >> alignment fallback in thp_get_unmapped_area_vmflags(). The existing >> PMD_SIZE alignment (512M on 64K pages) is too large for typical shared >> libraries, so this smaller fallback (2M) succeeds where PMD fails. >> >> I created a benchmark that mmaps a large executable file and calls >> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures >> fault + readahead cost. "Random" first faults in all pages with a >> sequential sweep (not measured), then measures time for calling random >> offsets, isolating iTLB miss cost for scattered execution. >> >> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, >> 512MB executable file on ext4, averaged over 3 runs: >> >> Phase | Baseline | Patched | Improvement >> -----------|--------------|--------------|------------------ >> Cold fault | 83.4 ms | 41.3 ms | 50% faster >> Random | 76.0 ms | 58.3 ms | 23% faster > > I'm curious: is a single order really what we want? > > I'd instead assume that we might want to make decisions based on the > mapping size. > > Assume you have a 128M mapping, wouldn't we want to use a different > alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping? > So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page faults are not that big of a deal? If the text section is hot, it wont get flushed after faulting in. So the real benefit comes from improved iTLB coverage. For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning to something larger (say 128M) wouldn't give any additional TLB coalescing, each 2M-aligned region independently qualifies for contpte. Mappings smaller than 2M can't benefit from contpte regardless of alignment, so falling back to PAGE_SIZE would be the optimal behaviour. Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any hardware boundary and adds complexity without TLB benefit?
On 3/13/26 20:59, Usama Arif wrote: > > > On 13/03/2026 16:20, David Hildenbrand (Arm) wrote: >> On 3/10/26 15:51, Usama Arif wrote: >>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs >>> into a single iTLB entry, reducing iTLB pressure for large executable >>> mappings. >>> >>> exec_folio_order() was introduced [1] to request readahead at an >>> arch-preferred folio order for executable memory, enabling contpte >>> mapping on the fault path. >>> >>> However, several things prevent this from working optimally on 16K and >>> 64K page configurations: >>> >>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only >>> produces the optimal contpte order for 4K pages. For 16K pages it >>> returns order 2 (64K) instead of order 7 (2M), and for 64K pages it >>> returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by >>> using ilog2(CONT_PTES) which evaluates to the optimal order for all >>> page sizes. >>> >>> 2. Even with the optimal order, the mmap_miss heuristic in >>> do_sync_mmap_readahead() silently disables exec readahead after 100 >>> page faults. The mmap_miss counter tracks whether readahead is useful >>> for mmap'd file access: >>> >>> - Incremented by 1 in do_sync_mmap_readahead() on every page cache >>> miss (page needed IO). >>> >>> - Decremented by N in filemap_map_pages() for N pages successfully >>> mapped via fault-around (pages found in cache without faulting, >>> evidence that readahead was useful). Only non-workingset pages >>> count and recently evicted and re-read pages don't count as hits. >>> >>> - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead >>> marker page is found (indicates sequential consumption of readahead >>> pages). >>> >>> When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is >>> disabled. On 64K pages, both decrement paths are inactive: >>> >>> - filemap_map_pages() is never called because fault_around_pages >>> (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which >>> requires fault_around_pages > 1. With only 1 page in the >>> fault-around window, there is nothing "around" to map. >>> >>> - do_async_mmap_readahead() never fires for exec mappings because >>> exec readahead sets async_size = 0, so no PG_readahead markers >>> are placed. >>> >>> With no decrements, mmap_miss monotonically increases past >>> MMAP_LOTSAMISS after 100 faults, disabling exec readahead >>> for the remainder of the mapping. >>> Patch 2 fixes this by moving the VM_EXEC readahead block >>> above the mmap_miss check, since exec readahead is targeted (one >>> folio at the fault location, async_size=0) not speculative prefetch. >>> >>> 3. Even with correct folio order and readahead, contpte mapping requires >>> the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). >>> The readahead path aligns file offsets and the buddy allocator aligns >>> physical memory, but the virtual address depends on the VMA start. >>> For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) >>> granularity, giving only a 1/32 chance of 2M alignment. When >>> misaligned, contpte_set_ptes() never sets the contiguous PTE bit for >>> any folio in the VMA, resulting in zero iTLB coalescing benefit. >>> >>> Patch 3 fixes this for the main binary by bumping the ELF loader's >>> alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. >>> >>> Patch 4 fixes this for shared libraries by adding a contpte-size >>> alignment fallback in thp_get_unmapped_area_vmflags(). The existing >>> PMD_SIZE alignment (512M on 64K pages) is too large for typical shared >>> libraries, so this smaller fallback (2M) succeeds where PMD fails. >>> >>> I created a benchmark that mmaps a large executable file and calls >>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures >>> fault + readahead cost. "Random" first faults in all pages with a >>> sequential sweep (not measured), then measures time for calling random >>> offsets, isolating iTLB miss cost for scattered execution. >>> >>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, >>> 512MB executable file on ext4, averaged over 3 runs: >>> >>> Phase | Baseline | Patched | Improvement >>> -----------|--------------|--------------|------------------ >>> Cold fault | 83.4 ms | 41.3 ms | 50% faster >>> Random | 76.0 ms | 58.3 ms | 23% faster >> >> I'm curious: is a single order really what we want? >> >> I'd instead assume that we might want to make decisions based on the >> mapping size. >> >> Assume you have a 128M mapping, wouldn't we want to use a different >> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping? >> > > So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page > faults are not that big of a deal? If the text section is hot, it wont > get flushed after faulting in. So the real benefit comes from improved > iTLB coverage. > > For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning > to something larger (say 128M) wouldn't give any additional TLB > coalescing, each 2M-aligned region independently qualifies for contpte. > > Mappings smaller than 2M can't benefit from contpte regardless of > alignment, so falling back to PAGE_SIZE would be the optimal behaviour. > Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any > hardware boundary and adds complexity without TLB benefit? I might be wrong, but I think you are mixing two things here: (1) "Minimum" folio size (exec_folio_order()) (2) VMA alignment. (2) should certainly be as large as (1), but assume we can get a 2M folio on arm64 4k, why shouldn't we align it to 2M if the region is reasonably sized, and use a PMD? -- Cheers, David
On 16/03/2026 19:06, David Hildenbrand (Arm) wrote: > On 3/13/26 20:59, Usama Arif wrote: >> >> >> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote: >>> On 3/10/26 15:51, Usama Arif wrote: >>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs >>>> into a single iTLB entry, reducing iTLB pressure for large executable >>>> mappings. >>>> >>>> exec_folio_order() was introduced [1] to request readahead at an >>>> arch-preferred folio order for executable memory, enabling contpte >>>> mapping on the fault path. >>>> >>>> However, several things prevent this from working optimally on 16K and >>>> 64K page configurations: >>>> >>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only >>>> produces the optimal contpte order for 4K pages. For 16K pages it >>>> returns order 2 (64K) instead of order 7 (2M), and for 64K pages it >>>> returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by >>>> using ilog2(CONT_PTES) which evaluates to the optimal order for all >>>> page sizes. >>>> >>>> 2. Even with the optimal order, the mmap_miss heuristic in >>>> do_sync_mmap_readahead() silently disables exec readahead after 100 >>>> page faults. The mmap_miss counter tracks whether readahead is useful >>>> for mmap'd file access: >>>> >>>> - Incremented by 1 in do_sync_mmap_readahead() on every page cache >>>> miss (page needed IO). >>>> >>>> - Decremented by N in filemap_map_pages() for N pages successfully >>>> mapped via fault-around (pages found in cache without faulting, >>>> evidence that readahead was useful). Only non-workingset pages >>>> count and recently evicted and re-read pages don't count as hits. >>>> >>>> - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead >>>> marker page is found (indicates sequential consumption of readahead >>>> pages). >>>> >>>> When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is >>>> disabled. On 64K pages, both decrement paths are inactive: >>>> >>>> - filemap_map_pages() is never called because fault_around_pages >>>> (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which >>>> requires fault_around_pages > 1. With only 1 page in the >>>> fault-around window, there is nothing "around" to map. >>>> >>>> - do_async_mmap_readahead() never fires for exec mappings because >>>> exec readahead sets async_size = 0, so no PG_readahead markers >>>> are placed. >>>> >>>> With no decrements, mmap_miss monotonically increases past >>>> MMAP_LOTSAMISS after 100 faults, disabling exec readahead >>>> for the remainder of the mapping. >>>> Patch 2 fixes this by moving the VM_EXEC readahead block >>>> above the mmap_miss check, since exec readahead is targeted (one >>>> folio at the fault location, async_size=0) not speculative prefetch. >>>> >>>> 3. Even with correct folio order and readahead, contpte mapping requires >>>> the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages). >>>> The readahead path aligns file offsets and the buddy allocator aligns >>>> physical memory, but the virtual address depends on the VMA start. >>>> For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K) >>>> granularity, giving only a 1/32 chance of 2M alignment. When >>>> misaligned, contpte_set_ptes() never sets the contiguous PTE bit for >>>> any folio in the VMA, resulting in zero iTLB coalescing benefit. >>>> >>>> Patch 3 fixes this for the main binary by bumping the ELF loader's >>>> alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries. >>>> >>>> Patch 4 fixes this for shared libraries by adding a contpte-size >>>> alignment fallback in thp_get_unmapped_area_vmflags(). The existing >>>> PMD_SIZE alignment (512M on 64K pages) is too large for typical shared >>>> libraries, so this smaller fallback (2M) succeeds where PMD fails. >>>> >>>> I created a benchmark that mmaps a large executable file and calls >>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures >>>> fault + readahead cost. "Random" first faults in all pages with a >>>> sequential sweep (not measured), then measures time for calling random >>>> offsets, isolating iTLB miss cost for scattered execution. >>>> >>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages, >>>> 512MB executable file on ext4, averaged over 3 runs: >>>> >>>> Phase | Baseline | Patched | Improvement >>>> -----------|--------------|--------------|------------------ >>>> Cold fault | 83.4 ms | 41.3 ms | 50% faster >>>> Random | 76.0 ms | 58.3 ms | 23% faster >>> >>> I'm curious: is a single order really what we want? >>> >>> I'd instead assume that we might want to make decisions based on the >>> mapping size. >>> >>> Assume you have a 128M mapping, wouldn't we want to use a different >>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping? >>> >> >> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page >> faults are not that big of a deal? If the text section is hot, it wont >> get flushed after faulting in. So the real benefit comes from improved >> iTLB coverage. >> >> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning >> to something larger (say 128M) wouldn't give any additional TLB >> coalescing, each 2M-aligned region independently qualifies for contpte. >> >> Mappings smaller than 2M can't benefit from contpte regardless of >> alignment, so falling back to PAGE_SIZE would be the optimal behaviour. >> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any >> hardware boundary and adds complexity without TLB benefit? > > I might be wrong, but I think you are mixing two things here: > > (1) "Minimum" folio size (exec_folio_order()) > > (2) VMA alignment. > > > (2) should certainly be as large as (1), but assume we can get a 2M > folio on arm64 4k, why shouldn't we align it to 2M if the region is > reasonably sized, and use a PMD? > > So this series is tackling both (1) and (2). When I started making changes to the code, what I wanted was 2M folios at fault with 64K base page size to reduce iTLB misses. This is what patch 1 (and 2) will achieve. Yes, completely agree, (2) should be as large as (1). I didn't think about PMD size on 4K which you pointed out. do_sync_mmap_readahead can give that with force_thp_readahead, so this should be supported. But we shouldn't align to PMD size for all base page sizes. As Rui pointed out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement to 2M? [1] https://lore.kernel.org/all/20260313144213.95686-1-r@hev.cc/
On 3/18/26 11:41, Usama Arif wrote: > > > On 16/03/2026 19:06, David Hildenbrand (Arm) wrote: >> On 3/13/26 20:59, Usama Arif wrote: >>> >>> >>> >>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page >>> faults are not that big of a deal? If the text section is hot, it wont >>> get flushed after faulting in. So the real benefit comes from improved >>> iTLB coverage. >>> >>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning >>> to something larger (say 128M) wouldn't give any additional TLB >>> coalescing, each 2M-aligned region independently qualifies for contpte. >>> >>> Mappings smaller than 2M can't benefit from contpte regardless of >>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour. >>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any >>> hardware boundary and adds complexity without TLB benefit? >> >> I might be wrong, but I think you are mixing two things here: >> >> (1) "Minimum" folio size (exec_folio_order()) >> >> (2) VMA alignment. >> >> >> (2) should certainly be as large as (1), but assume we can get a 2M >> folio on arm64 4k, why shouldn't we align it to 2M if the region is >> reasonably sized, and use a PMD? >> >> > > So this series is tackling both (1) and (2). When I started making changes > to the code, what I wanted was 2M folios at fault with 64K base page size > to reduce iTLB misses. This is what patch 1 (and 2) will achieve. > > Yes, completely agree, (2) should be as large as (1). I didn't think about > PMD size on 4K which you pointed out. do_sync_mmap_readahead can give > that with force_thp_readahead, so this should be supported. In particular, imagine if hw starts optimizing transparently on other granularity, then the "smallest granularity" (exec_folio_order()) decision will soon be wrong. > > But we shouldn't align to PMD size for all base page sizes. As Rui pointed > out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement > to 2M? That's why I said that likely, as an input, we'd want to use the mapping size or other heuristics. We wouldn't want to align a 4k mapping to either 64k or 2M. Long story short: the change in thp_get_unmapped_area_vmflags() needs some thought IMHO. -- Cheers, David
© 2016 - 2026 Red Hat, Inc.