[PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping

Usama Arif posted 4 patches 4 weeks, 1 day ago
[PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
Posted by Usama Arif 4 weeks, 1 day ago
For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.

Without proper virtual address alignment, the readahead patches that
allocate 2M folios with 2M-aligned file offsets and physical addresses
cannot benefit from contpte mapping. The contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned, and since the misalignment from vma->vm_start is constant
across all folios in the VMA, no folio gets the contiguous PTE bit
set, resulting in zero iTLB coalescing benefit.

Fix this by bumping the ELF alignment to PAGE_SIZE << exec_folio_order()
when the arch defines a non-zero exec_folio_order(). This ensures
load_bias is aligned to the folio size, so that file-offset-aligned
folios map to properly aligned virtual addresses.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/binfmt_elf.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8e89cc5b28200..2d2b3e9fd474f 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -49,6 +49,7 @@
 #include <uapi/linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
+#include <linux/pgtable.h>
 
 #ifndef ELF_COMPAT
 #define ELF_COMPAT 0
@@ -1106,6 +1107,20 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			/* Calculate any requested alignment. */
 			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
 
+			/*
+			 * If the arch requested large folios for exec
+			 * memory via exec_folio_order(), ensure the
+			 * binary is mapped with sufficient alignment so
+			 * that virtual addresses of exec pages are
+			 * aligned to the folio boundary. Without this,
+			 * the hardware cannot coalesce PTEs (e.g. arm64
+			 * contpte) even though the physical memory and
+			 * file offset are correctly aligned.
+			 */
+			if (exec_folio_order())
+				alignment = max(alignment,
+					(unsigned long)PAGE_SIZE << exec_folio_order());
+
 			/**
 			 * DOC: PIE handling
 			 *
-- 
2.47.3
[PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
Posted by WANG Rui 3 weeks, 5 days ago
Hi Usama,

Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].

> +			if (exec_folio_order())
> +				alignment = max(alignment,
> +					(unsigned long)PAGE_SIZE << exec_folio_order());

I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.

[1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Thanks,
Rui
Re: [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
Posted by Usama Arif 3 weeks, 5 days ago

On 13/03/2026 17:42, WANG Rui wrote:
> Hi Usama,
> 

Hello!

> Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].

For us it came about because we use 64K page size on ARM, and none of the
text sections were getting hugified (because PMD size is 512M). I went with
exec_folio_order() = cont-pte size (2M) for 16K and 64K as we can get both page
fault benefit (which might not be that important) and iTLB coverage (due to
cont-pte).
x86 already faults in at 2M (HPAGE_PMD_ORDER) due to force_thp_readahead path in
do_sync_mmap_readahead() so the memory pressure introduced in ARM won't be worse
than what already exists in x86.

> 
>> +			if (exec_folio_order())
>> +				alignment = max(alignment,
>> +					(unsigned long)PAGE_SIZE << exec_folio_order());
> 
> I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.
> 

Yes I think this makes sense!

Although maybe we should check all segments with PT_LOAD. So maybe something
like below over this series?

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 2d2b3e9fd474f..a0e83b541a7d8 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1116,10 +1116,30 @@ static int load_elf_binary(struct linux_binprm *bprm)
                         * the hardware cannot coalesce PTEs (e.g. arm64
                         * contpte) even though the physical memory and
                         * file offset are correctly aligned.
+                        *
+                        * Only increase alignment when at least one
+                        * PT_LOAD segment is large enough to contain a
+                        * full folio and has its file offset and virtual
+                        * address folio-aligned. This avoids reducing
+                        * ASLR entropy for small binaries that cannot
+                        * benefit from contpte mapping.
                         */
-                       if (exec_folio_order())
-                               alignment = max(alignment,
-                                       (unsigned long)PAGE_SIZE << exec_folio_order());
+                       if (exec_folio_order()) {
+                               unsigned long folio_sz = PAGE_SIZE << exec_folio_order();
+
+                               for (i = 0; i < elf_ex->e_phnum; i++) {
+                                       if (elf_phdata[i].p_type != PT_LOAD)
+                                               continue;
+                                       if (elf_phdata[i].p_filesz < folio_sz)
+                                               continue;
+                                       if (!IS_ALIGNED(elf_phdata[i].p_vaddr, folio_sz))
+                                               continue;
+                                       if (!IS_ALIGNED(elf_phdata[i].p_offset, folio_sz))
+                                               continue;
+                                       alignment = max(alignment, folio_sz);
+                                       break;
+                               }
+                       }
 
                        /**
                         * DOC: PIE handling

> [1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> 
> Thanks,
> Rui

Re: [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
Posted by hev 3 weeks, 5 days ago
On Sat, Mar 14, 2026 at 3:47 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 13/03/2026 17:42, WANG Rui wrote:
> > Hi Usama,
> >
>
> Hello!
>
> > Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].
>
> For us it came about because we use 64K page size on ARM, and none of the
> text sections were getting hugified (because PMD size is 512M). I went with
> exec_folio_order() = cont-pte size (2M) for 16K and 64K as we can get both page
> fault benefit (which might not be that important) and iTLB coverage (due to
> cont-pte).
> x86 already faults in at 2M (HPAGE_PMD_ORDER) due to force_thp_readahead path in
> do_sync_mmap_readahead() so the memory pressure introduced in ARM won't be worse
> than what already exists in x86.
>
> >
> >> +                    if (exec_folio_order())
> >> +                            alignment = max(alignment,
> >> +                                    (unsigned long)PAGE_SIZE << exec_folio_order());
> >
> > I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.
> >
>
> Yes I think this makes sense!
>
> Although maybe we should check all segments with PT_LOAD. So maybe something
> like below over this series?
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 2d2b3e9fd474f..a0e83b541a7d8 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1116,10 +1116,30 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                          * the hardware cannot coalesce PTEs (e.g. arm64
>                          * contpte) even though the physical memory and
>                          * file offset are correctly aligned.
> +                        *
> +                        * Only increase alignment when at least one
> +                        * PT_LOAD segment is large enough to contain a
> +                        * full folio and has its file offset and virtual
> +                        * address folio-aligned. This avoids reducing
> +                        * ASLR entropy for small binaries that cannot
> +                        * benefit from contpte mapping.
>                          */
> -                       if (exec_folio_order())
> -                               alignment = max(alignment,
> -                                       (unsigned long)PAGE_SIZE << exec_folio_order());
> +                       if (exec_folio_order()) {
> +                               unsigned long folio_sz = PAGE_SIZE << exec_folio_order();
> +
> +                               for (i = 0; i < elf_ex->e_phnum; i++) {
> +                                       if (elf_phdata[i].p_type != PT_LOAD)
> +                                               continue;
> +                                       if (elf_phdata[i].p_filesz < folio_sz)
> +                                               continue;
> +                                       if (!IS_ALIGNED(elf_phdata[i].p_vaddr, folio_sz))
> +                                               continue;
> +                                       if (!IS_ALIGNED(elf_phdata[i].p_offset, folio_sz))
> +                                               continue;
> +                                       alignment = max(alignment, folio_sz);
> +                                       break;
> +                               }
> +                       }

I think this logic should live in maximum_alignment(), so we don't
have to walk the segments twice. It might be better to move it into a
separate helper, something like should_align_to_exec_folio()?

>
>                         /**
>                          * DOC: PIE handling
>
> > [1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
> > [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> >
> > Thanks,
> > Rui
>