[PATCH v4 0/2] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP

WANG Rui posted 2 patches 1 month ago
There is a newer version of this series
fs/binfmt_elf.c         | 29 +++++++++++++++++++++++++++++
include/linux/huge_mm.h | 10 ++++++++++
2 files changed, 39 insertions(+)
[PATCH v4 0/2] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
Posted by WANG Rui 1 month ago
Changes since [v3]:
* Fixed compilation failure under !CONFIG_TRANSPARENT_HUGEPAGE.
* No functional changes otherwise.

Changes since [v2]:
* Renamed align_to_pmd() to should_align_to_pmd().
* Added benchmark results to the commit message.

Changes since [v1]:
* Dropped the Kconfig option CONFIG_ELF_RO_LOAD_THP_ALIGNMENT.
* Moved the alignment logic into a helper align_to_pmd() for clarity.
* Improved the comment explaining why we skip the optimization
  when PMD_SIZE > 32MB.

When Transparent Huge Pages (THP) are enabled in "always" mode,
file-backed read-only mappings can be backed by PMD-sized huge pages
if they meet the alignment and size requirements.

For ELF executables loaded by the kernel ELF binary loader, PT_LOAD
segments are normally aligned according to p_align, which is often
only page-sized. As a result, large read-only segments that are
otherwise eligible may fail to be mapped using PMD-sized THP.

A segment is considered eligible if:

* THP is in "always" mode,
* it is not writable,
* both p_vaddr and p_offset are PMD-aligned,
* its file size is at least PMD_SIZE, and
* its existing p_align is smaller than PMD_SIZE.

To avoid excessive address space padding on systems with very large
PMD_SIZE values, this optimization is applied only when PMD_SIZE <= 32MB,
since requiring larger alignments would be unreasonable, especially on
32-bit systems with a much more limited virtual address space.

This increases the likelihood that large text segments of ELF
executables are backed by PMD-sized THP, reducing TLB pressure and
improving performance for large binaries.

This only affects ELF executables loaded directly by the kernel
binary loader. Shared libraries loaded by user space (e.g. via the
dynamic linker) are not affected.

Benchmark

Machine: AMD Ryzen 9 7950X (x86_64)
Binutils: 2.46
GCC: 15.2.1 (built with -z,noseparate-code + --enable-host-pie)

Workload: building Linux v7.0-rc1 vmlinux with x86_64_defconfig.

                Without patch        With patch
instructions    8,246,133,611,932    8,246,025,137,750
cpu-cycles      8,001,028,142,928    7,565,925,107,502
itlb-misses     3,672,158,331        26,821,242
time elapsed    64.66 s              61.97 s

Instructions are basically unchanged. iTLB misses drop from ~3.67B to
~26M (~99.27% reduction), which results in about a ~5.44% reduction in
cycles and ~4.18% shorter wall time for this workload.

[v3]: https://lore.kernel.org/linux-fsdevel/20260310013958.103636-1-r@hev.cc
[v2]: https://lore.kernel.org/linux-fsdevel/20260304114727.384416-1-r@hev.cc
[v1]: https://lore.kernel.org/linux-fsdevel/20260302155046.286650-1-r@hev.cc

WANG Rui (2):
  huge_mm: add stubs for THP-disabled configs
  binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for
    THP

 fs/binfmt_elf.c         | 29 +++++++++++++++++++++++++++++
 include/linux/huge_mm.h | 10 ++++++++++
 2 files changed, 39 insertions(+)

-- 
2.53.0
Re: [PATCH v4 0/2] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
Posted by Baolin Wang 3 weeks, 5 days ago
CC Usama

On 3/10/26 11:11 AM, WANG Rui wrote:
> Changes since [v3]:
> * Fixed compilation failure under !CONFIG_TRANSPARENT_HUGEPAGE.
> * No functional changes otherwise.
> 
> Changes since [v2]:
> * Renamed align_to_pmd() to should_align_to_pmd().
> * Added benchmark results to the commit message.
> 
> Changes since [v1]:
> * Dropped the Kconfig option CONFIG_ELF_RO_LOAD_THP_ALIGNMENT.
> * Moved the alignment logic into a helper align_to_pmd() for clarity.
> * Improved the comment explaining why we skip the optimization
>    when PMD_SIZE > 32MB.
> 
> When Transparent Huge Pages (THP) are enabled in "always" mode,
> file-backed read-only mappings can be backed by PMD-sized huge pages
> if they meet the alignment and size requirements.
> 
> For ELF executables loaded by the kernel ELF binary loader, PT_LOAD
> segments are normally aligned according to p_align, which is often
> only page-sized. As a result, large read-only segments that are
> otherwise eligible may fail to be mapped using PMD-sized THP.
> 
> A segment is considered eligible if:
> 
> * THP is in "always" mode,
> * it is not writable,
> * both p_vaddr and p_offset are PMD-aligned,
> * its file size is at least PMD_SIZE, and
> * its existing p_align is smaller than PMD_SIZE.
> 
> To avoid excessive address space padding on systems with very large
> PMD_SIZE values, this optimization is applied only when PMD_SIZE <= 32MB,
> since requiring larger alignments would be unreasonable, especially on
> 32-bit systems with a much more limited virtual address space.
> 
> This increases the likelihood that large text segments of ELF
> executables are backed by PMD-sized THP, reducing TLB pressure and
> improving performance for large binaries.
> 
> This only affects ELF executables loaded directly by the kernel
> binary loader. Shared libraries loaded by user space (e.g. via the
> dynamic linker) are not affected.

Usama posted a similar patchset[1], and I think using exec_folio_order() 
for exec-segment alignment is reasonable. In your case, you can override 
exec_folio_order() to return a PMD‑sized order.

[1] 
https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/

> Benchmark
> 
> Machine: AMD Ryzen 9 7950X (x86_64)
> Binutils: 2.46
> GCC: 15.2.1 (built with -z,noseparate-code + --enable-host-pie)
> 
> Workload: building Linux v7.0-rc1 vmlinux with x86_64_defconfig.
> 
>                  Without patch        With patch
> instructions    8,246,133,611,932    8,246,025,137,750
> cpu-cycles      8,001,028,142,928    7,565,925,107,502
> itlb-misses     3,672,158,331        26,821,242
> time elapsed    64.66 s              61.97 s
> 
> Instructions are basically unchanged. iTLB misses drop from ~3.67B to
> ~26M (~99.27% reduction), which results in about a ~5.44% reduction in
> cycles and ~4.18% shorter wall time for this workload.
> 
> [v3]: https://lore.kernel.org/linux-fsdevel/20260310013958.103636-1-r@hev.cc
> [v2]: https://lore.kernel.org/linux-fsdevel/20260304114727.384416-1-r@hev.cc
> [v1]: https://lore.kernel.org/linux-fsdevel/20260302155046.286650-1-r@hev.cc
> 
> WANG Rui (2):
>    huge_mm: add stubs for THP-disabled configs
>    binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for
>      THP
> 
>   fs/binfmt_elf.c         | 29 +++++++++++++++++++++++++++++
>   include/linux/huge_mm.h | 10 ++++++++++
>   2 files changed, 39 insertions(+)
> 

Re: [PATCH v4 0/2] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
Posted by hev 3 weeks, 5 days ago
On Fri, Mar 13, 2026 at 4:41 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> CC Usama
>
> Usama posted a similar patchset[1], and I think using exec_folio_order()
> for exec-segment alignment is reasonable. In your case, you can override
> exec_folio_order() to return a PMD‑sized order.
>
> [1]
> https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/

Thanks for the pointer!

Cheers,
Rui
Re: [PATCH v4 0/2] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
Posted by Usama Arif 3 weeks, 5 days ago

On 13/03/2026 11:41, Baolin Wang wrote:
> CC Usama
> 
> On 3/10/26 11:11 AM, WANG Rui wrote:
>> Changes since [v3]:
>> * Fixed compilation failure under !CONFIG_TRANSPARENT_HUGEPAGE.
>> * No functional changes otherwise.
>>
>> Changes since [v2]:
>> * Renamed align_to_pmd() to should_align_to_pmd().
>> * Added benchmark results to the commit message.
>>
>> Changes since [v1]:
>> * Dropped the Kconfig option CONFIG_ELF_RO_LOAD_THP_ALIGNMENT.
>> * Moved the alignment logic into a helper align_to_pmd() for clarity.
>> * Improved the comment explaining why we skip the optimization
>>    when PMD_SIZE > 32MB.
>>
>> When Transparent Huge Pages (THP) are enabled in "always" mode,
>> file-backed read-only mappings can be backed by PMD-sized huge pages
>> if they meet the alignment and size requirements.
>>
>> For ELF executables loaded by the kernel ELF binary loader, PT_LOAD
>> segments are normally aligned according to p_align, which is often
>> only page-sized. As a result, large read-only segments that are
>> otherwise eligible may fail to be mapped using PMD-sized THP.
>>
>> A segment is considered eligible if:
>>
>> * THP is in "always" mode,
>> * it is not writable,
>> * both p_vaddr and p_offset are PMD-aligned,
>> * its file size is at least PMD_SIZE, and
>> * its existing p_align is smaller than PMD_SIZE.
>>
>> To avoid excessive address space padding on systems with very large
>> PMD_SIZE values, this optimization is applied only when PMD_SIZE <= 32MB,
>> since requiring larger alignments would be unreasonable, especially on
>> 32-bit systems with a much more limited virtual address space.
>>
>> This increases the likelihood that large text segments of ELF
>> executables are backed by PMD-sized THP, reducing TLB pressure and
>> improving performance for large binaries.
>>
>> This only affects ELF executables loaded directly by the kernel
>> binary loader. Shared libraries loaded by user space (e.g. via the
>> dynamic linker) are not affected.
> 
> Usama posted a similar patchset[1], and I think using exec_folio_order() for exec-segment alignment is reasonable. In your case, you can override exec_folio_order() to return a PMD‑sized order.
> 
> [1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
> 

Thanks for the CC Baolin! Happy to see someone else noticed the same issue!

Yeah I agree, I think piggybacking off exec_folio_order() as done in 1 should be
the right appproach.

I also think there is maybe a bug in do_sync_mmap_readahead that needs to be fixed
when it comes to mmap_miss counter [2].


[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
[2] https://lore.kernel.org/all/20260310145406.3073394-3-usama.arif@linux.dev/

>> Benchmark
>>
>> Machine: AMD Ryzen 9 7950X (x86_64)
>> Binutils: 2.46
>> GCC: 15.2.1 (built with -z,noseparate-code + --enable-host-pie)
>>
>> Workload: building Linux v7.0-rc1 vmlinux with x86_64_defconfig.
>>
>>                  Without patch        With patch
>> instructions    8,246,133,611,932    8,246,025,137,750
>> cpu-cycles      8,001,028,142,928    7,565,925,107,502
>> itlb-misses     3,672,158,331        26,821,242
>> time elapsed    64.66 s              61.97 s
>>
>> Instructions are basically unchanged. iTLB misses drop from ~3.67B to
>> ~26M (~99.27% reduction), which results in about a ~5.44% reduction in
>> cycles and ~4.18% shorter wall time for this workload.
>>
>> [v3]: https://lore.kernel.org/linux-fsdevel/20260310013958.103636-1-r@hev.cc
>> [v2]: https://lore.kernel.org/linux-fsdevel/20260304114727.384416-1-r@hev.cc
>> [v1]: https://lore.kernel.org/linux-fsdevel/20260302155046.286650-1-r@hev.cc
>>
>> WANG Rui (2):
>>    huge_mm: add stubs for THP-disabled configs
>>    binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for
>>      THP
>>
>>   fs/binfmt_elf.c         | 29 +++++++++++++++++++++++++++++
>>   include/linux/huge_mm.h | 10 ++++++++++
>>   2 files changed, 39 insertions(+)
>>
>