From nobody Thu Apr 9 05:46:21 2026 Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E19902E5B09 for ; Tue, 10 Mar 2026 14:54:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.186 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154483; cv=none; b=jGEcKWFZtmOtVqc+UC9QvAh/tLhChJjDl1qSL84ITHAxbN1dnpYFfBh3Ar/HNxjeeJHzfXiy5e+rpv1vUhY/o761ysA7wqc345J2j6zHhGqpvoiYNt1jMP1tCQA6NKQ+FpH7uy37StHGbo8Q7atgFc8EpGTcmrq/mhpYDEteJpQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154483; c=relaxed/simple; bh=UNdhYhcWjV3ZZ8KNfNO1NqUySGXdFU0kugLx3JPYp10=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=aQfx7YDf51Si2VvFDUKRcGNVwrdDGqdt5YoM2uArUAn7IJuXW9y5K5dSWqRH7q1L0KLuG5A2xyzg0R7Sg5ZSQIuagkqUMjPufwK2gv26F92taFDVTBvMHAgrOE2RKRBcDLgbbOPcTIyZho0HmO/wMGqjKyc8/JEOaKD9kdf3c6w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=taHmAdo3; arc=none smtp.client-ip=91.218.175.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="taHmAdo3" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773154479; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EGWvFNNHbAVQyTUs3YvCDY0GyMiml7VfROhcweab0g8=; b=taHmAdo34oCwjQFIaRdmRojINXQm081BWLVpiG89JRPghIcpldJ2rwOWU0KURRrhy5yjpf 19sWjb7mkMYRCePOGCQXpG9Wc5IKYu1fi/V6HuLIH/qnuv0qdwjurQ0tE2yqYewmgWxhBS i69FLwkpsJPKnJP2wwNrpkeBXTBIoAo= From: Usama Arif To: Andrew Morton , ryan.roberts@arm.com, david@kernel.org Cc: ajd@linux.ibm.com, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, brauner@kernel.org, catalin.marinas@arm.com, dev.jain@arm.com, jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com, lance.yang@linux.dev, Liam.Howlett@oracle.com, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, npache@redhat.com, rmclure@linux.ibm.com, Al Viro , will@kernel.org, willy@infradead.org, ziy@nvidia.com, hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com, Usama Arif Subject: [PATCH 1/4] arm64: request contpte-sized folios for exec memory Date: Tue, 10 Mar 2026 07:51:14 -0700 Message-ID: <20260310145406.3073394-2-usama.arif@linux.dev> In-Reply-To: <20260310145406.3073394-1-usama.arif@linux.dev> References: <20260310145406.3073394-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" exec_folio_order() was introduced [1] to request readahead of executable file-backed pages at an arch-preferred folio order, so that the hardware can coalesce contiguous PTEs into fewer iTLB entries (contpte). The current implementation uses ilog2(SZ_64K >> PAGE_SHIFT), which requests 64K folios. This is optimal for 4K base pages (where CONT_PTES =3D 16, contpte size =3D 64K), but suboptimal for 16K and 64K base pages: Page size | Before (order) | After (order) | contpte ----------|----------------|---------------|-------- 4K | 4 (64K) | 4 (64K) | Yes (unchanged) 16K | 2 (64K) | 7 (2M) | Yes (new) 64K | 0 (64K) | 5 (2M) | Yes (new) For 16K pages, CONT_PTES =3D 128 and the contpte size is 2M (order 7). For 64K pages, CONT_PTES =3D 32 and the contpte size is 2M (order 5). Use ilog2(CONT_PTES) instead, which directly evaluates to contpte-aligned order for all page sizes. The worst-case waste is bounded to one folio (up to 2MB - 64KB) at the end of the file, since page_cache_ra_order() reduces the folio order near EOF to avoid allocating past i_size. [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.c= om/ Signed-off-by: Usama Arif --- arch/arm64/include/asm/pgtable.h | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index b3e58735c49bd..a1110a33acb35 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1600,12 +1600,11 @@ static inline void update_mmu_cache_range(struct vm= _fault *vmf, #define arch_wants_old_prefaulted_pte cpu_has_hw_af =20 /* - * Request exec memory is read into pagecache in at least 64K folios. This= size - * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iT= LB - * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base - * pages are in use. + * Request exec memory is read into pagecache in contpte-sized folios. The + * contpte size is the number of contiguous PTEs that the hardware can coa= lesce + * into a single iTLB entry: 64K for 4K pages, 2M for 16K and 64K pages. */ -#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) +#define exec_folio_order() ilog2(CONT_PTES) =20 static inline bool pud_sect_supported(void) { --=20 2.47.3 From nobody Thu Apr 9 05:46:21 2026 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 074583064A3 for ; Tue, 10 Mar 2026 14:54:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.184 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154489; cv=none; b=oDIE0lcqlRYb5B01YipaWOyNF0Dk9tglD1Mf+HIRFJHh7MD6p8OeZ/oop4Uzb8l/oJjirFNLHM8FwchLn2vdFcJmRQSzuryIf07IsQsySuraO5KNnsu3AO/u5hZH5wZ4oC3lM6cmlpNiwT97KmgvrVbjeqyVLso7wrHMrKJn0A4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154489; c=relaxed/simple; bh=4laApc8NnGxTpddKFozyyVwqGUlAQh5OJ1FeFhbLxZo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qvfFeC24EIXQCYlnmmwmemBaSETqFABI+QQ6Aes3Gewclv5vh+Ke7XS3+WNXrSYB2e0xQo7Uunlrqp152GGJ/4x+gzj8CAV7zWfkdLLKJVQ2z5jPn0z5q05fPCJ9HSa8ZCqVxlW68HM+wMTa/Hwno1Uc+6vkiFbGwo+pwKjm7I0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=LmrODPSV; arc=none smtp.client-ip=91.218.175.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="LmrODPSV" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773154485; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cGK3lb1Gl4HCpee4Gph+9eiHmaXQeXqY6khXIDpX7y0=; b=LmrODPSVrXpm2EmXagV1g0RUXMxioQTzpSwL5eNBUJ9rkDCjz6bNRfo8KV+CqZRbnHjE68 I8WEXf5qsZGH5CDbqxlCSz2yMBlvWrR6/6lzTS3KHT+IOo4Pw70p5ooXE//x8R2klmN4qF c7PhpqQjMaabhHUGcyJnaSnjqP8H69o= From: Usama Arif To: Andrew Morton , ryan.roberts@arm.com, david@kernel.org Cc: ajd@linux.ibm.com, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, brauner@kernel.org, catalin.marinas@arm.com, dev.jain@arm.com, jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com, lance.yang@linux.dev, Liam.Howlett@oracle.com, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, npache@redhat.com, rmclure@linux.ibm.com, Al Viro , will@kernel.org, willy@infradead.org, ziy@nvidia.com, hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com, Usama Arif Subject: [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Date: Tue, 10 Mar 2026 07:51:15 -0700 Message-ID: <20260310145406.3073394-3-usama.arif@linux.dev> In-Reply-To: <20260310145406.3073394-1-usama.arif@linux.dev> References: <20260310145406.3073394-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" The mmap_miss counter in do_sync_mmap_readahead() tracks whether readahead is useful for mmap'd file access. It is incremented by 1 on every page cache miss in do_sync_mmap_readahead(), and decremented in two places: - filemap_map_pages(): decremented by N for each of N pages successfully mapped via fault-around (pages found already in cache, evidence readahead was useful). Only pages not in the workingset count as hits. - do_async_mmap_readahead(): decremented by 1 when a page with PG_readahead is found in cache. When the counter exceeds MMAP_LOTSAMISS (100), all readahead is disabled, including the targeted VM_EXEC readahead [1] that requests arch-preferred folio orders for contpte mapping. On arm64 with 64K base pages, both decrement paths are inactive: 1. filemap_map_pages() is never called because fault_around_pages (65536 >> PAGE_SHIFT =3D 1) disables should_fault_around(), which requires fault_around_pages > 1. With only 1 page in the fault-around window, there is nothing "around" to map. 2. do_async_mmap_readahead() never fires for exec mappings because exec readahead sets async_size =3D 0, so no PG_readahead markers are placed. With no decrements, mmap_miss monotonically increases past MMAP_LOTSAMISS after 100 page faults, disabling all subsequent exec readahead. Fix this by moving the VM_EXEC readahead block above the mmap_miss check. The exec readahead path is targeted. It reads a single folio at the fault location with async_size=3D0, not speculative prefetch, so the mmap_miss heuristic designed to throttle wasteful speculative readahead should not gate it. The page would need to be faulted in regardless, the only question is at what order. [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.c= om/ Signed-off-by: Usama Arif --- mm/filemap.c | 72 ++++++++++++++++++++++++++++------------------------ 1 file changed, 39 insertions(+), 33 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 6cd7974d4adab..c064f31ecec5a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3331,6 +3331,37 @@ static struct file *do_sync_mmap_readahead(struct vm= _fault *vmf) } } =20 + if (vm_flags & VM_EXEC) { + /* + * Allow arch to request a preferred minimum folio order for + * executable memory. This can often be beneficial to + * performance if (e.g.) arm64 can contpte-map the folio. + * Executable memory rarely benefits from readahead, due to its + * random access nature, so set async_size to 0. + * + * Limit to the boundaries of the VMA to avoid reading in any + * pad that might exist between sections, which would be a waste + * of memory. + * + * This is targeted readahead (one folio at the fault location), + * not speculative prefetch, so bypass the mmap_miss heuristic + * which would otherwise disable it after MMAP_LOTSAMISS faults. + */ + struct vm_area_struct *vma =3D vmf->vma; + unsigned long start =3D vma->vm_pgoff; + unsigned long end =3D start + vma_pages(vma); + unsigned long ra_end; + + ra->order =3D exec_folio_order(); + ra->start =3D round_down(vmf->pgoff, 1UL << ra->order); + ra->start =3D max(ra->start, start); + ra_end =3D round_up(ra->start + ra->ra_pages, 1UL << ra->order); + ra_end =3D min(ra_end, end); + ra->size =3D ra_end - ra->start; + ra->async_size =3D 0; + goto do_readahead; + } + if (!(vm_flags & VM_SEQ_READ)) { /* Avoid banging the cache line if not needed */ mmap_miss =3D READ_ONCE(ra->mmap_miss); @@ -3361,40 +3392,15 @@ static struct file *do_sync_mmap_readahead(struct v= m_fault *vmf) return fpin; } =20 - if (vm_flags & VM_EXEC) { - /* - * Allow arch to request a preferred minimum folio order for - * executable memory. This can often be beneficial to - * performance if (e.g.) arm64 can contpte-map the folio. - * Executable memory rarely benefits from readahead, due to its - * random access nature, so set async_size to 0. - * - * Limit to the boundaries of the VMA to avoid reading in any - * pad that might exist between sections, which would be a waste - * of memory. - */ - struct vm_area_struct *vma =3D vmf->vma; - unsigned long start =3D vma->vm_pgoff; - unsigned long end =3D start + vma_pages(vma); - unsigned long ra_end; - - ra->order =3D exec_folio_order(); - ra->start =3D round_down(vmf->pgoff, 1UL << ra->order); - ra->start =3D max(ra->start, start); - ra_end =3D round_up(ra->start + ra->ra_pages, 1UL << ra->order); - ra_end =3D min(ra_end, end); - ra->size =3D ra_end - ra->start; - ra->async_size =3D 0; - } else { - /* - * mmap read-around - */ - ra->start =3D max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); - ra->size =3D ra->ra_pages; - ra->async_size =3D ra->ra_pages / 4; - ra->order =3D 0; - } + /* + * mmap read-around + */ + ra->start =3D max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); + ra->size =3D ra->ra_pages; + ra->async_size =3D ra->ra_pages / 4; + ra->order =3D 0; =20 +do_readahead: fpin =3D maybe_unlock_mmap_for_io(vmf, fpin); ractl._index =3D ra->start; page_cache_ra_order(&ractl, ra); --=20 2.47.3 From nobody Thu Apr 9 05:46:21 2026 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24F6F2857EA for ; Tue, 10 Mar 2026 14:54:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154497; cv=none; b=tt8OHxWch8UEgmvj0uaPhHhaACoxZVp9WbVR4R7EwSeOFShZ1izTFI7vv1NtYNQGvr+1/GbSGi3owlb154kBRFT/4mQ3xUUrd82PhePjnA2FM39YNUG77NXJJplsU5ELcAFwNAh+Gfv0K9QKkQ2bbc0quNTpnM5yyrIb09ro4LI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154497; c=relaxed/simple; bh=Ahdjpp5tmOKnolDDyNj+hSgqgzjJXUblKpC03uLryBE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ujs8ANb1ayKcwXQEVboaiNuD1xU58HrdqbaWrGGcZIXKdjN2kk324k4uPxdsn40yqvPIRiRGTYET38/BN1OI6sWne4dYF+do9u079QDVabeaGa6U0pYw09w+fRbW1ia98O3+ky1sK/dsPt/CE1SX08XlYFPwCs4sZ2jVKkZLdSI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=gD0zzF1Z; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="gD0zzF1Z" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773154493; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Pwf6Nc7CaGHddB3FSw2OYM6VouGq/Ldl0n4v0DGTvQM=; b=gD0zzF1ZzkpibyiRY+w+AjVdGRw9wPyUFI+y5FS68pAmCO1c1M9m41DEaxy2QyJUSTMhzg P5XvbFOUZoVyZubGSqW0JOoPKegDPZYBJmxRz5fKJnH3JtcIuHIQQfgUEkfvTqhC9I4zI7 8e/PuxvBlI2swPY3tUz0PqrlB4hUlJQ= From: Usama Arif To: Andrew Morton , ryan.roberts@arm.com, david@kernel.org Cc: ajd@linux.ibm.com, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, brauner@kernel.org, catalin.marinas@arm.com, dev.jain@arm.com, jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com, lance.yang@linux.dev, Liam.Howlett@oracle.com, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, npache@redhat.com, rmclure@linux.ibm.com, Al Viro , will@kernel.org, willy@infradead.org, ziy@nvidia.com, hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com, Usama Arif Subject: [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Date: Tue, 10 Mar 2026 07:51:16 -0700 Message-ID: <20260310145406.3073394-4-usama.arif@linux.dev> In-Reply-To: <20260310145406.3073394-1-usama.arif@linux.dev> References: <20260310145406.3073394-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this means the binary is 64K-aligned, but contpte mapping requires 2M (CONT_PTE_SIZE) alignment. Without proper virtual address alignment, the readahead patches that allocate 2M folios with 2M-aligned file offsets and physical addresses cannot benefit from contpte mapping. The contpte fold check in contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE- aligned, and since the misalignment from vma->vm_start is constant across all folios in the VMA, no folio gets the contiguous PTE bit set, resulting in zero iTLB coalescing benefit. Fix this by bumping the ELF alignment to PAGE_SIZE << exec_folio_order() when the arch defines a non-zero exec_folio_order(). This ensures load_bias is aligned to the folio size, so that file-offset-aligned folios map to properly aligned virtual addresses. Signed-off-by: Usama Arif --- fs/binfmt_elf.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 8e89cc5b28200..2d2b3e9fd474f 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -49,6 +49,7 @@ #include #include #include +#include =20 #ifndef ELF_COMPAT #define ELF_COMPAT 0 @@ -1106,6 +1107,20 @@ static int load_elf_binary(struct linux_binprm *bprm) /* Calculate any requested alignment. */ alignment =3D maximum_alignment(elf_phdata, elf_ex->e_phnum); =20 + /* + * If the arch requested large folios for exec + * memory via exec_folio_order(), ensure the + * binary is mapped with sufficient alignment so + * that virtual addresses of exec pages are + * aligned to the folio boundary. Without this, + * the hardware cannot coalesce PTEs (e.g. arm64 + * contpte) even though the physical memory and + * file offset are correctly aligned. + */ + if (exec_folio_order()) + alignment =3D max(alignment, + (unsigned long)PAGE_SIZE << exec_folio_order()); + /** * DOC: PIE handling * --=20 2.47.3 From nobody Thu Apr 9 05:46:21 2026 Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FE7E2DA76C for ; Tue, 10 Mar 2026 14:55:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.184 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154502; cv=none; b=rDZybzHTxdmQi0UYnT5ryZrA2Anj7EmY11njWZfp3aX3lLCg6XtQLLn+yEh7BLiDldTbLdFDBBW1Iz0V0jzfEz5gJsH+gVqcANDdsE6y7XEq5UFtTlcr20dO5wUQfnYwjURa8rcRvFt5vnr+9UsP4djIec1cV+4HZ5Ulj+sBA/4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773154502; c=relaxed/simple; bh=39Q8LnkGEp9VfwdObsoI4kOYowjYOmfnHABtQZKmgIE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=J468+9WUbxDgtIFzYreQBINbmJPuzcu4Awo0BfPtD9UJDypnA9/Gnh70l5/tG5oppTHU1AjX5ODdJlovLgsqPpenCaluPyn0+3AD2e5wnTZQxkuAYIjPG8ALDv5NycEZzXIY0TtALQz1SbIK8HrBnyK38BB6707UAYEGYXIjkc4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Q7J9Xayb; arc=none smtp.client-ip=91.218.175.184 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Q7J9Xayb" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773154499; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4Koap7c4nmBUn1jQcRs2TKnT3RilQxG9yZc8gbf4yHA=; b=Q7J9Xayb3tuq1RwczoR+kkYcwmbHr2IBdva+1kWyUfpOmeUqFxY8p3HhUG0VvkvFVRO7cq rcbAym+WsHWFr//SsTKMXbfjnthBN1AVMJKC4S1BQIu+Y4f+asu6YqEK9FjMKM6HNAuqFg o+kIhQjL1lj8wyLdBd5OI3jV01FKMB4= From: Usama Arif To: Andrew Morton , ryan.roberts@arm.com, david@kernel.org Cc: ajd@linux.ibm.com, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, brauner@kernel.org, catalin.marinas@arm.com, dev.jain@arm.com, jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com, lance.yang@linux.dev, Liam.Howlett@oracle.com, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, npache@redhat.com, rmclure@linux.ibm.com, Al Viro , will@kernel.org, willy@infradead.org, ziy@nvidia.com, hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev, kernel-team@meta.com, Usama Arif Subject: [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Date: Tue, 10 Mar 2026 07:51:17 -0700 Message-ID: <20260310145406.3073394-5-usama.arif@linux.dev> In-Reply-To: <20260310145406.3073394-1-usama.arif@linux.dev> References: <20260310145406.3073394-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" thp_get_unmapped_area() is the get_unmapped_area callback for filesystems like ext4, xfs, and btrfs. It attempts to align the virtual address for PMD_SIZE THP mappings, but on arm64 with 64K base pages PMD_SIZE is 512M, which is too large for typical shared library mappings, so the alignment always fails and falls back to PAGE_SIZE. This means shared libraries loaded by ld.so via mmap() get 64K-aligned virtual addresses, preventing contpte mapping even when 2M folios are allocated with properly aligned file offsets and physical addresses. Add a fallback in thp_get_unmapped_area_vmflags() that tries PAGE_SIZE << exec_folio_order() alignment (2M on arm64 64K pages) when PMD_SIZE alignment fails. This is small enough that shared libraries could qualify, enabling contpte mapping for their executable segments. This applies to all file-backed mappings (not just exec). Non-exec file-backed mappings also benefit from contpte mapping when large folios are used. Aligning all file-backed mappings ensures that any large folio in the page cache can be contpte-mapped regardless of the mapping's protection flags, reducing dTLB misses for read-heavy workloads. The fallback is gated by exec_folio_order() which returns 0 by default, making this a no-op on architectures that don't define it. Signed-off-by: Usama Arif --- mm/huge_memory.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8e2746ea74adf..1c9476a5ed51c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1242,6 +1242,23 @@ unsigned long thp_get_unmapped_area_vmflags(struct f= ile *filp, unsigned long add if (ret) return ret; =20 + /* + * If the arch requested large folios for exec memory, try to align + * to the folio size as a fallback. This is much smaller than PMD_SIZE + * (e.g. 2M vs 512M on arm64 64K pages), so it succeeds for mappings + * that are too small for PMD alignment. Proper alignment ensures that + * the hardware can coalesce PTEs (e.g. arm64 contpte) when large + * folios are mapped. + */ + if (exec_folio_order()) { + unsigned long folio_size =3D PAGE_SIZE << exec_folio_order(); + + ret =3D __thp_get_unmapped_area(filp, addr, len, off, flags, + folio_size, vm_flags); + if (ret) + return ret; + } + return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags, vm_flags); } --=20 2.47.3