From nobody Sun Jun 14 07:48:16 2026 Received: from outbound.ms.icloud.com (ms-2002e-snip4-1.eps.apple.com [57.103.74.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 823002D6409 for ; Fri, 1 May 2026 05:56:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.74.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614967; cv=none; b=MvxweE+lY1yDp4ids72oEBqQMDMC9+wjFHxwxrmreMiomNMv/3dhnx0bml04n4m9ALNnkmAou7aBDlxfgXcn43H9SWJsL07LMZz9vtQogBD5dAO+klzIejik9JuzesCB8Ph85FDlBO41IALhlQzySyY1VY7va6j82iGUaofrBSg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614967; c=relaxed/simple; bh=mSldpHJai9REeX+vx+yWlHrvE7oYWFwfDYJjROuxIwo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ifJnJhDskkXhI4SpQhr4tg7rKjG0jtilX0wPWEihAh0cHOH66AtqasnHJxEVkfU37VlAw/FJTSWh6h2EC37WoWIFOEqkWh0f0OD1hHuG4Jqavk5gX6putSicc+UcQuWs3gepRxfl3Gwiy+swrlVx0ERtKq3Zs13CnpOOixPl+SY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=RL3aQBbh; arc=none smtp.client-ip=57.103.74.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="RL3aQBbh" Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPS id 3D2811800135; Fri, 01 May 2026 05:56:02 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhAA0MFWgFeAUEdXwFLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEFEYGVg1dE0wLcwRUB10FXVZQAlpLQBMESgZNXw5eHwQXRhlVBEceXVZeHhkCURxWDVdDVARfUEkMQVBsWgBHF0gdXRlZb1BdHA4EVAddBV1WUAJaS18ZXUUPXwdZBEAMSAJAQwNCL1oXREBBWh9BFEgDWARcBUQBSwReDytGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1777614966; x=1780206966; bh=IbSC559/qUR/nGvKKtd8DwOnbSEiDUOtMZLyC6A/T58=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=RL3aQBbhZdMkY8FTaKdyZGveHZXnOv6ZLP26suA8fTGUYOk3urCFrdNIe0Tn5iES4B4VTwx1r1n94cM9XQmexmpy6rLqkhTVEn/0BVms3MFkCrFqmJe7FAa3Nfy/fs0WoSyNeghTmc1SLnSpgfiiA+EIReefMukb/BvRQRCUytyPRryOarMtJa7iD+h+hC2VQTt62wPo4K0FMQD2/AQityTg4VO4NmB4TmwvEJp+imhFmRtTHtbpKbw4uUvNN5pjUZCw6Z87LG33TzSfMFN/DlcCmM5HbErYW/m/gMRjVHEovSp5YkkkikaGB5a+EWZL8MtJkF2a7xQ/K3Aj1vX93A== Received: from [127.0.0.1] (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPSA id 6E12E1800130; Fri, 01 May 2026 05:55:56 +0000 (UTC) From: Luka Bai Date: Fri, 01 May 2026 13:55:42 +0800 Subject: [PATCH 1/5] mm: add basic madvise helpers and branch for THP setup Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260501-thp_cow-v1-1-005377483738@tencent.com> References: <20260501-thp_cow-v1-0-005377483738@tencent.com> In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com> To: linux-mm@kvack.org Cc: Jonathan Corbet , Shuah Khan , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jann Horn , Arnd Bergmann , Kairui Song , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1777614950; l=6033; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=qeYKGarfhCPQLyYwYSEdCT8XEQEucOGNFQFPA0gbSHc=; b=vX/BzBp7mBvxyfV6EAf9h1WJBBeiBTb9N1fh3Z29aoa6VlsaOOSphrrjZUqA1Ru2SsrPkYhUK BRGfhLLarZLBSghUt30uCias75yVmtIYgkRukCiZ9MsLeIWXIsKFzcA X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Authority-Info-Out: v=2.4 cv=YZywJgRf c=1 sm=1 tr=0 ts=69f44073 cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=WA886SfcoAClic5TCSgA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTAxMDA1MyBTYWx0ZWRfX3XzkaOJxZwyt ZPI9TRSWqREvmUiF2QpMg0FtW4xDNgmIjP7bbw2W8mJKYpvI3V4pjS2ExPFtko5MfdxK2++bcaq p8nF3yEOysfKP25GQaL/1m5SuZemGq6XoptcaRxHxp6DcKrydTtEfmlpKzDhWnH3EnJhOvt2qYY RY7/hWluhGUafrxKhKyiB9ST2d9qGbwd+klWTFG+01cP7DFffrf2hS9dKAhdlwtwt+REFDgEun/ KI0wGGIri4J5rYDKGFQv2YPGNTVh1RXR9GilC8QA0bcQBc9ZBjpZgDmTcGYgD+lrVFg6bpvysYJ uL6dhA7JgpM0/CUxzy54bqdUSPxP9T4CBAdmh72Lgm9pOb48mEbb9VvDs8tzIA= X-Proofpoint-ORIG-GUID: 0D-T28gQiBVQg_2gi_am4Sqyi3f_sm_4 X-Proofpoint-GUID: 0D-T28gQiBVQg_2gi_am4Sqyi3f_sm_4 From: Luka Bai Transparent huge page is now properly working with most of the mm framework, and well fused with the folio concept that can be reclaimed or allocated with a large order. However, its deed is not very "estimable". For example, a THP is easily split in many path like partially mapped, swap out or fork + COW(for child processes). In some cases, we may want it to have some concluded result. Since some workloads expect a relatively "stable" THP, while others may want to save memory more rather than the performance benifits. This patch adds some basic helpers and branch in madvise path so that we can add madvise choices on THP to conduct what we do on different types of operations like COW or swap that may split THP, on the level of vma. We transfer the type of configuration using parameters of madvise, analyze it and save the result in vma->vm_flags for later use. Currently the only operation in the list is COW. It decides whether we want to use hugepages for the child process when it writes a spot on the shared anonymous pmd so that we can make sure the THP not being split after writing. This patch only adds the basic setup helpers, the real usage will be added in the later patches. Signed-off-by: Luka Bai --- include/linux/huge_mm.h | 6 ++++++ include/linux/mm.h | 19 +++++++++++++++++++ include/uapi/asm-generic/mman-common.h | 9 +++++++++ mm/madvise.c | 25 +++++++++++++++++++++++++ 4 files changed, 59 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 48496f09909b..a0ce8c0b81f5 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -6,6 +6,7 @@ =20 #include /* only for vma_is_dax() */ #include +#include =20 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -363,6 +364,11 @@ static inline bool thp_disabled_by_hw(void) return transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTE= D); } =20 +static inline bool madv_thp_cow(int behavior) +{ + return behavior & MADV_THP_COW; +} + unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned lo= ng addr, diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d76da6e0791..8a800819cfa2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -391,6 +391,10 @@ enum { #else DECLARE_VMA_BIT_ALIAS(STACK, GROWSDOWN), #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + DECLARE_VMA_BIT(THP_SETUP_1, 43), + DECLARE_VMA_BIT_ALIAS(THP_COW, THP_SETUP_1), +#endif }; #undef DECLARE_VMA_BIT #undef DECLARE_VMA_BIT_ALIAS @@ -510,6 +514,9 @@ enum { #define VM_DROPPABLE VM_NONE #define VMA_DROPPABLE EMPTY_VMA_FLAGS #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define VM_THP_COW INIT_VM_FLAG(THP_COW) +#endif =20 /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ | VM_STACK_E= ARLY) @@ -4128,6 +4135,18 @@ extern int do_munmap(struct mm_struct *, unsigned lo= ng, size_t, struct list_head *uf); extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t le= n_in, int behavior); =20 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline bool madv_thp_behavior(int behavior) +{ + return behavior >=3D MADV_THP_SETUP_BASE && behavior < MADV_THP_SETUP_END; +} +#else +static inline bool madv_thp_behavior(int behavior) +{ + return false; +} +#endif + #ifdef CONFIG_MMU extern int __mm_populate(unsigned long addr, unsigned long len, int ignore_errors); diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-gene= ric/mman-common.h index ef1c27fa3c57..1617ed374503 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -82,6 +82,15 @@ #define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ #define MADV_GUARD_REMOVE 103 /* unguard range */ =20 +/* for THP setup */ +#define MADV_THP_SETUP_BASE 256 +enum { + MADV_THP_COW_BIT, + MADV_THP_SETUP_MAX_BIT, +}; +#define MADV_THP_COW (MADV_THP_SETUP_BASE + (1 << MADV_THP_COW_BIT)) +#define MADV_THP_SETUP_END (MADV_THP_SETUP_BASE + (1 << MADV_THP_SETUP_MAX= _BIT)) + /* compatibility flags */ #define MAP_FILE 0 =20 diff --git a/mm/madvise.c b/mm/madvise.c index 69708e953cf5..5dbfc89682d7 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1331,6 +1331,25 @@ static bool can_madvise_modify(struct madvise_behavi= or *madv_behavior) } #endif =20 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static vm_flags_t madvise_thp_setup(struct madvise_behavior *madv_behavior) +{ + int thp_behavior =3D madv_behavior->behavior - MADV_THP_SETUP_BASE; + struct vm_area_struct *vma =3D madv_behavior->vma; + vm_flags_t new_flags =3D vma->vm_flags; + + if (madv_thp_cow(thp_behavior)) + new_flags |=3D VM_THP_COW; + + return new_flags; +} +#else +static vm_flags_t madvise_thp_setup(struct madvise_behavior *madv_behavior) +{ + return madv_behavior->vma->vm_flags; +} +#endif + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its= own @@ -1427,6 +1446,10 @@ static int madvise_vma_behavior(struct madvise_behav= ior *madv_behavior) break; } =20 + /* Handle THP behaviors */ + if (madv_thp_behavior(behavior)) + new_flags =3D madvise_thp_setup(madv_behavior); + /* This is a write operation.*/ VM_WARN_ON_ONCE(madv_behavior->lock_mode !=3D MADVISE_MMAP_WRITE_LOCK); =20 @@ -1555,6 +1578,8 @@ madvise_behavior_valid(int behavior) return true; =20 default: + if (madv_thp_behavior(behavior)) + return true; return false; } } --=20 2.52.0 From nobody Sun Jun 14 07:48:16 2026 Received: from outbound.ms.icloud.com (p-west3-cluster6-host4-snip4-5.eps.apple.com [57.103.75.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A70234A3A5 for ; Fri, 1 May 2026 05:56:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.75.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614972; cv=none; b=jNRw/P8fuJVJkWEMZZBXIJEzmtRSaRGbxu+NG0xYNrzx4vEC30xOU3BCrX8vCmYOa5FKoF4T7icnV8fsP/wLY5SLHZlHAXlavp2zSnE/exGsZsmfu1k0PzT0WD5aQiFJVlLYaWharKF3mDjzENLZz2OvAr5sDKVEZ2RGCzDnI7k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614972; c=relaxed/simple; bh=tLlHvbten/FbKEjsakcXJK23FKSjMlYil0F1ztvXfy4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=F/lhF6aSnzdz1nbtDD318CAnkRxVZlvu2TFeIztcnOtKwxKVTRV88LCn8uleFiFp2PzX6VEL64x6KX5wX5MYrP3CdG5tZSSmOHWLd7eAgpMjv91gs3v0RaX38HaOtSkgcoEaoZ+hpJKg9cs0l1qhxGUiSHz0u5hpRuMtk5btGr0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=tZbUnIyh; arc=none smtp.client-ip=57.103.75.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="tZbUnIyh" Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPS id DC7F11800121; Fri, 01 May 2026 05:56:07 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhAA0MFWgFeAUEdXwFLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEFEYGVg1dE0wLcwRUB10FXVZQAlpLQBMESgZNXw5eHwQXRhlVBEceXVZeHhkCURxWDVdDVARfUEkMQVBsWgBHF0gdXRlZb1BdHA4EVAddBV1WUAJaS18ZXUUPXwdZBEAMSAJAQwNCL1oXREBBWh9CFEgDWARcBUQBSwReDytGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1777614970; x=1780206970; bh=ArO79/Yz+eapuC+lM16m1Ks/IIQXLEjawf2FnZmiyak=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=tZbUnIyhKNnexWjJln4Drt4H+gH8ytUrTUS/yrjXpdRMs5esjcPvt0CLnM+DJdzrNeO20hxCJAVIxEnrHSMjXcSIo+yy+U4rddKIn8n6frmCrucPYxm9TvDS/XQpr4pPbqmBSgRDYlqOXZwpXDRfGOHLAnq7a40pSMFnnxLBUVN310Gz0XKLVi1hk+rMGIypbTxM5x7uJBbDhuNTdbvHtXraNfMQH60lJJ9ZH5oKGZfGWzLLKnU5fTNAXgW6j/ljF1Oz/baEUZs+fW84vW/xdZKiT62t9cNHUL4C2chxOkLXvmeS1talhJi9Pp9yY831gONhwqgE2s3VEVvBujrrsQ== Received: from [127.0.0.1] (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPSA id 29DFF1800105; Fri, 01 May 2026 05:56:01 +0000 (UTC) From: Luka Bai Date: Fri, 01 May 2026 13:55:43 +0800 Subject: [PATCH 2/5] mm: add pmd level THP COW parameter in sysfs Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260501-thp_cow-v1-2-005377483738@tencent.com> References: <20260501-thp_cow-v1-0-005377483738@tencent.com> In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com> To: linux-mm@kvack.org Cc: Jonathan Corbet , Shuah Khan , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jann Horn , Arnd Bergmann , Kairui Song , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1777614950; l=5795; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=6KT+jkO3Ka4qq13XkF6ZKbrVSZCH58mT9pcv6pbi+AM=; b=6EqPYX+PuZjHQKcfgprsY7mv/V4aWtlZT5R+tTAFWFjcm3SRqGZc571FHO583ASFNuAy4VKv9 bfvO8YAzKRJAiWshdN8Yi/tyJ1YGxhd+8Aetf4bnkr4rixc/J6Epmga X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Authority-Info-Out: v=2.4 cv=eLQeTXp1 c=1 sm=1 tr=0 ts=69f44079 cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=TpG2_MQqjxLYGlnCjREA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: QJ_jTGQgd7VU-fuApa1pEAzmAa2JxrXm X-Proofpoint-ORIG-GUID: QJ_jTGQgd7VU-fuApa1pEAzmAa2JxrXm X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTAxMDA1MyBTYWx0ZWRfXzDCdLi85io6x zw1xwnafwpI6yxWcbQmNiFR+UXZuYuFr5NuHXEhRjZdobR/qUemadHorvBwAvPrJyag7CYxQi7/ X2v6J8DI9M1ZoQW1Pif5PlO83fOAEAhlLAsvvItzBXGuNFzpWdpJpYaGpZKwkHSjaphdiYghllp eJNyST2LeJ9wPVc3HiKl91Mxppi5xd7ZzpdeNnHj4G1h35SLwnREeCUXWDWgQX159MnSNc2E12j /pc3n27XvqEAGqUCGfxi+nknsxpzjbkbVHoJUtvbZVZW3im4HSCEUku3Q+wHZdINfJ334c0qaxe dSm6N5FX2Aro0TcIrwxpRn2v3u8uvak6+ZbQ9MEy1j/rtaxV7Pp7KYw7XUhYxo= From: Luka Bai We would like to use similar logic of huge anonymous page or huge shmem pages for THP COW: to categorize the strategies into three types: always, never, madvise. If setting up to always, then we always do THP COW for all the existing THPs. If setting up to never, then we never do THP COW. If setting up to madvise, then we follow the setup we introduced in last commit to decide whether we do COW for each individual vma. We add TRANSPARENT_HUGEPAGE_COW_FLAG and TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG that are very similar to the TRANSPARENT_HUGEPAGE_FLAG and TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG which are used to decide whether we do anonymous huge page fault when it permits. And we add sysfs attribute thp_cow_attr as the interface to choose from the three strategies we mentioned before. Signed-off-by: Luka Bai --- .../testing/sysfs-kernel-mm-transparent-hugepage | 1 + Documentation/admin-guide/mm/transhuge.rst | 27 +++++++++++++++ include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 39 ++++++++++++++++++= ++++ 4 files changed, 69 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage= b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage index 7bfbb9cc2c11..43a1af13efe0 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage +++ b/Documentation/ABI/testing/sysfs-kernel-mm-transparent-hugepage @@ -11,6 +11,7 @@ Description: - khugepaged - shmem_enabled - use_zero_page + - thp_cow - subdirectories of the form hugepages-kB, where is the page size of the hugepages supported by the kernel/CPU combination. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 0ef13c451ac8..0926651bad0d 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -226,6 +226,33 @@ to "always" or "madvise"), and it'll be automatically = shutdown when all THP sizes are disabled (when both the per-size anon control and the top-level control are "never") =20 +Some workloads may want to do copy on write on the pmd size to acquire the +tlb benifit when it tries to write on a shared anonymous pmd sized entry. +They can do so by setting up the thp_cow control. The control is only enab= led +when the global THP controls are set to "always" or "madvise" for the +specific memory region:: + +:: + + echo always >/sys/kernel/mm/transparent_hugepage/thp_cow + echo madvise >/sys/kernel/mm/transparent_hugepage/thp_cow + echo never >/sys/kernel/mm/transparent_hugepage/thp_cow + +always + means that the writing process will always do copy on write on + the pmd size. If there is no pmd sized folio available, it will + fallback to the pte size. + +madvise + will do things like ``always`` but only for regions that have + used madvise(MADV_THP_COW). + +never + will not do copy on write on the pmd size no matter what setup + is done using madvise. When a process writes on a shared anonymous + pmd sized entry, it will just allocate a pte sized page and do copy + on write on the pte size. + process THP controls -------------------- =20 diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a0ce8c0b81f5..2a62f0f92f68 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -57,6 +57,8 @@ enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG, TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, + TRANSPARENT_HUGEPAGE_COW_FLAG, + TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG, }; =20 struct kobject; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1f0d0b780943..babca060feca 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -531,6 +531,44 @@ static ssize_t split_underused_thp_store(struct kobjec= t *kobj, static struct kobj_attribute split_underused_thp_attr =3D __ATTR( shrink_underused, 0644, split_underused_thp_show, split_underused_thp_sto= re); =20 +static ssize_t thp_cow_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + const char *output; + + if (test_bit(TRANSPARENT_HUGEPAGE_COW_FLAG, &transparent_hugepage_flags)) + output =3D "[always] madvise never"; + else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG, + &transparent_hugepage_flags)) + output =3D "always [madvise] never"; + else + output =3D "always madvise [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t thp_cow_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret =3D count; + + if (sysfs_streq(buf, "always")) { + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG, &transparent_hugepage_= flags); + set_bit(TRANSPARENT_HUGEPAGE_COW_FLAG, &transparent_hugepage_flags); + } else if (sysfs_streq(buf, "madvise")) { + clear_bit(TRANSPARENT_HUGEPAGE_COW_FLAG, &transparent_hugepage_flags); + set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG, &transparent_hugepage_fl= ags); + } else if (sysfs_streq(buf, "never")) { + clear_bit(TRANSPARENT_HUGEPAGE_COW_FLAG, &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_COW_FLAG, &transparent_hugepage_= flags); + } else + ret =3D -EINVAL; + + return ret; +} +static struct kobj_attribute thp_cow_attr =3D __ATTR_RW(thp_cow); + static struct attribute *hugepage_attr[] =3D { &enabled_attr.attr, &defrag_attr.attr, @@ -540,6 +578,7 @@ static struct attribute *hugepage_attr[] =3D { &shmem_enabled_attr.attr, #endif &split_underused_thp_attr.attr, + &thp_cow_attr.attr, NULL, }; =20 --=20 2.52.0 From nobody Sun Jun 14 07:48:16 2026 Received: from outbound.ms.icloud.com (p-west3-cluster6-host11-snip4-10.eps.apple.com [57.103.75.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D928F346E51 for ; Fri, 1 May 2026 05:56:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.75.33 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614979; cv=none; b=YHsdiafEiW43TFm2br/VxWC9UZcr+h/8TUUmgM6QlLdIIdUqygQ0upp2IZfoiAhCKpa3/QU/SNXZWJxpcBS3CrCgmEs3wKLOaHruKdL+zSU6Y62UfqjbwZ6kWtLmLihzZF+VGOU4SMsqD+mUxGTXNBtnztCzaclmedosDALFS6A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614979; c=relaxed/simple; bh=fe5qUdEup0HLb2HrLDxT1324Rwy5SIK9P92A+sggjFw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=VOBxAK5VgSiwZcyHBq8w6NoDQX41bnyOX/ZliExLfovSTWJpchg9+CrRqfsCEMrcZ738ySsnOhi4X5vaaOzp1jmKmljw+VV+VYFkxUeAiFZMB+coGxHjU/wnW5LIUMU3CfVXRXIvhCUOs2annLpRnD34c5W7H/nOiE4nvgl8qAQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=EwI+l9V3; arc=none smtp.client-ip=57.103.75.33 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="EwI+l9V3" Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPS id B115A180013D; Fri, 01 May 2026 05:56:13 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhAA0MFWgFeAUEdXwFLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEFEYGVg1dE0wLcwRUB10FXVZQAlpLQBMESgZNXw5eHwQXRhlVBEceXVZeHhkCURxWDVdDVARfUEkMQVBsWgBHF0gdXRlZb1BdHA4EVAddBV1WUAJaS18ZXUUPXwdZBEAMSAJAQwNCL1oXREBBWh9DFEgDWARcBUQBSwReDytGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1777614977; x=1780206977; bh=i/MDV/SdEyba8CAl8JXsfJJlaWhqdQgOiCYK9KUK3R4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=EwI+l9V3Eq8DZJruLYfJo2dpxYir0zhLiB6TNlieOp7ifXH23lRVEDJC2EU/hq2wz3FO6ybNlfeB5udpXES9bVb4DDPjRx5qGhQeG0RGpS4aeYJlQVqt9gMmBBN+OMqNrOjOHmlA5zrDnza9xoLCQ3Zt37WtiivxOz+wlzcSE/lQsBRuRKtpOJa8Xve9bT7AyzxRExricclw1cf7QwdUFqYSbxqL27mRqA+uuLwh4JFiZvxAMyxdRzSXF29n50bgq85Gcp07zwj4wGta2LzNMkxiF8wkih1e//DMnTvkPyfovmUaHWPCPe6x3NrUb6QI6aD2R4Dxb0NFmsnDNoTkXQ== Received: from [127.0.0.1] (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPSA id DB5BF1800108; Fri, 01 May 2026 05:56:07 +0000 (UTC) From: Luka Bai Date: Fri, 01 May 2026 13:55:44 +0800 Subject: [PATCH 3/5] mm: add pmd level THP COW judgement helpers Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260501-thp_cow-v1-3-005377483738@tencent.com> References: <20260501-thp_cow-v1-0-005377483738@tencent.com> In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com> To: linux-mm@kvack.org Cc: Jonathan Corbet , Shuah Khan , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jann Horn , Arnd Bergmann , Kairui Song , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1777614950; l=1685; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=l3c8ZOwbATo+2PT4NYOAUghe4sk42dQjWN37S7JdcBg=; b=TSo2qM8r3KPPphClijLCEpGTnjI/pFC9+a8AQTQmdnuguNqhfa6BphAbJgkEVtcBJUMnH5Yi1 ukFPpvM8SE3DKy0cbu0hqLnkQgMtbzyqiZoO3sLbQG5tG9zz+vGRHPR X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Authority-Info-Out: v=2.4 cv=No3cssdJ c=1 sm=1 tr=0 ts=69f4407f cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=JogxrxylMUJFqwMgd1oA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: hB3IhlfY35IDEelkOKd30xeVKZ4dDTn9 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTAxMDA1MyBTYWx0ZWRfX//GBnMBXJ+C+ gHIiUEpz5ykqv7rnL7hYpfRWgqfJ+r3gKMPEjydZWch8zDPRgnyic8lbsqmz7W3j9NORpehDbpm HKuQctCjN/YpeWJWQpUmbYiEkaC5JrIxyy1VYLmK2OeCTZUFp/ZJTLlnSzwEPF9GWO0BvAqBwZH 92Ej9piDwed/XU5gqf3lLyK9lyoPcVTUFAwXQRhq6IyviOvKDCRzX5nIdOLVel7x3tE0paAqprv knKjDu7zC5UVTLsD/eKxfTzl6wKAa+LQOLDzaRlo39GVGxC/5cSWeEIKnzNoLLM02pYqJyPLP3f NLHOvfOkNVMRy2QeQMYWg8GVvr1m0kohvEJxV0b1t4BN80PB9t1gM0vd7O3QsI= X-Proofpoint-GUID: hB3IhlfY35IDEelkOKd30xeVKZ4dDTn9 From: Luka Bai We add hugepage_cow_always and hugepage_cow_madvise as two convenient helpers to decide whether we want to do THP COW under each specific circumstance. Also, we add a helper hugepage_cow_enabled to help us know the setup more easily. THP COW is only opened when hugepage is globally enabled or madvise enabled. Signed-off-by: Luka Bai --- include/linux/huge_mm.h | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2a62f0f92f68..3e5c6da3905b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -203,6 +203,38 @@ static inline bool hugepage_global_always(void) (1<vm_flags; + + /* anonymous THP need to be enabled first */ + if (!hugepage_global_always() && + (!hugepage_global_enabled() || !(vm_flags & VM_HUGEPAGE))) + return false; + + /* always enables all the THP COW */ + if (hugepage_cow_always()) + return true; + + /* madvise enables THP cow only when vm_flags says so */ + if (hugepage_cow_madvise() && (vm_flags & VM_THP_COW)) + return true; + + return false; +} + static inline int highest_order(unsigned long orders) { return fls_long(orders) - 1; --=20 2.52.0 From nobody Sun Jun 14 07:48:16 2026 Received: from outbound.ms.icloud.com (p-west3-cluster6-host2-snip4-10.eps.apple.com [57.103.75.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 421C034A3AC for ; Fri, 1 May 2026 05:56:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.75.83 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614982; cv=none; b=iSOmE8tdQYVk/eD3+RSOSE0fB7mkIfi0o89wkYH6GDAFtqH2k6Xs8attvK3BhlD782x0uFQzOTw9TBSEixtUoZ+Oy/H2I1JxRlMoqA4prgDt+5Ks2zby/ml15ypp/XF7ZOfhQtYEd/V6VlgUkUaEBnBH7lJbiBrTOy0r7KhMMjY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614982; c=relaxed/simple; bh=a3RFubiq7IhZ3xsI1raAo6V8s3QxKouh1863euzloOI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=upwN5sunDXh7rqkuONlcDJ2NGnovnfYD1/pRBwmws3xsTWnXODhmaLv5Dw3efw63frNzNG8ZGPD3qOuHE0VsMisSSnzfHnc0zhac/d7wFyiho75GatcYoX5DU1w35cmc5xsIqmFJUC+TuDL14ii9v5aNaSjVspbNnuG0JfU2is4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=uUSHoXHu; arc=none smtp.client-ip=57.103.75.83 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="uUSHoXHu" Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPS id 25775180013C; Fri, 01 May 2026 05:56:19 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhAA0MFWgFeAUEdXwFLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEFEYGVg1dE0wLcwRUB10FXVZQAlpLQBMESgZNXw5eHwQXRhlVBEceXVZeHhkCURxWDVdDVARfUEkMQVBsWgBHF0gdXRlZb1BdHA4EVAddBV1WUAJaS18ZXUUPXwdZBEAMSAJAQwNCL1oXREBBWh9EFEgDWARcBUQBSwReDytGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1777614980; x=1780206980; bh=NTc75O6iqIlRbMB1yKe5C3Mh2lstCHeNt4Mg/bpL3vw=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=uUSHoXHuJYBtBBLMFZ0mT71iSIQKAlT7OmyDAPpC/9pqXS5wGxAOnZTs5MWruPL/XTsLH88XtoXc4ypeefbWSKFEe9Gm4zQorK851OQ7FIJvKdEyEpuMT3L4hwa3hPpX2HmskD9Y38c43h3g+OVPQPlV55Bdlmil2r8eXpkWiTnQmGZCU0/IEe5U+BEch0IELcyIDlHF++IJfeitK+C0g0SiuI7um4MqcovRuqsZt7kkEU3Oct8fWofpbw+O1D0rnvgD7cm1RGaHk7oQ3CwCZxt7P3/RC+9/piZvQe5WNbP0dti3hYWOOfg5epZ6pQmcJYk68nGsrEcYhNuR55D9Xw== Received: from [127.0.0.1] (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPSA id 95CF7180013B; Fri, 01 May 2026 05:56:13 +0000 (UTC) From: Luka Bai Date: Fri, 01 May 2026 13:55:45 +0800 Subject: [PATCH 4/5] mm: enable map_anon_folio_pmd_nopf to handle unshare Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260501-thp_cow-v1-4-005377483738@tencent.com> References: <20260501-thp_cow-v1-0-005377483738@tencent.com> In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com> To: linux-mm@kvack.org Cc: Jonathan Corbet , Shuah Khan , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jann Horn , Arnd Bergmann , Kairui Song , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1777614950; l=5934; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=+oyohCA+JyGfxecKQ+uLxgtNgUJn9qwmjDlmDdf3vA4=; b=o4S5WSyWmT9V8olRnoso6ocnjd/DjId8IpedIoTth013Ruy23KwTts2OKjnQPnmADSAcGkBiN zsQZBJhZ+OYAcm8T7fUoN1R6VaeYZUf1wNhB/0ly3aMkSCogw3ymlIz X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Proofpoint-GUID: rUdIpohAvfkWAFbvPlg88ZZnOitWks7N X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTAxMDA1MyBTYWx0ZWRfX+BMl8P4W0kcJ fhPM1yM5aQ9ptg08zmo3MwKy+HVVZj3LguMEemdbGKyevCga4TwGVEAxV3IwOVDlhrjaRja8Ytx zTj+TkKlXXJh5J+ulbqMUkXr3cdC0eLCodoqdLxi12JgZK8RhJ1Sbki9Yede/r11K+8fyolz0Jz nzzK/+GIFdtXWiVRF+r5KaeSrCKozVIin2hMHYQIquBziWURXPCSDvjXhxtQKeQo2rE5Vmyvnjv 8cb9UsCsSHBOiBoiSj5GGpmLD/BMrP//DlimIBhtCg+rgnik7CeY8aNVtH61yM0a3fDNSO8I8C+ HcHBi1yOQlb9FQsaaOLNyiqBNphHB4Y6ycQO6JhqZm5rARL8gA9A/bcnPNSpN4= X-Proofpoint-ORIG-GUID: rUdIpohAvfkWAFbvPlg88ZZnOitWks7N X-Authority-Info-Out: v=2.4 cv=aqm/yCZV c=1 sm=1 tr=0 ts=69f44083 cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=XO4sgXVVO50CB33MxqEA:9 a=QEXdDO2ut3YA:10 From: Luka Bai Function map_anon_folio_pmd_nopf was able to map new anonymous pages. Like in function do_huge_pmd_anonymous_page, it handles all the mappings and statistics correctly in one call. However, it doesn't support FAULT_FLAG_UNSHARE. Normally, FAULT_FLAG_UNSHARE was set when we just want to separate multiple non-exclusive sharing apart, it follows the copy on write process, since it also does the checking like whether we need to copy memory, or just use the existing one, basically the same work like what COW does. But it doesn't happen because of writing on a RO pte/pmd which is actually permitted to be written to but simply for "unsharing". Hence we need to copy the same permissive and other marker flags into the copied new page table entry just like the old one when doing the duplication, without making it writable. Now, map_anon_folio_pmd_nopf only tries to make the new pmd writable that is not what unsharing wants. We add unsharing support for map_anon_folio_pmd_nopf by passing the vm_fault struct as a parameter and get the unsharing hint. If we are in the unsharing procedure, then we just copy the soft_dirty and uffd_wp flags into the new pmd instead of trying to make the new pmd writable. Signed-off-by: Luka Bai --- include/linux/huge_mm.h | 5 ++--- mm/huge_memory.c | 34 +++++++++++++++++++++++----------- mm/khugepaged.c | 8 +++++++- 3 files changed, 32 insertions(+), 15 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 3e5c6da3905b..61f0e614ca52 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -610,9 +610,8 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, = unsigned long address, pmd_t *pmd, bool freeze); bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, struct folio *folio); -void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr); - +void map_anon_folio_pmd_nopf(struct folio *folio, struct vm_fault *vmf, + bool cow); #else /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 static inline bool folio_test_pmd_mappable(struct folio *folio) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index babca060feca..1e661b411b2e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1423,13 +1423,26 @@ static struct folio *vma_alloc_anon_folio_pmd(struc= t vm_area_struct *vma, return folio; } =20 -void map_anon_folio_pmd_nopf(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) +void map_anon_folio_pmd_nopf(struct folio *folio, struct vm_fault *vmf, + bool cow) { pmd_t entry; + struct vm_area_struct *vma =3D vmf->vma; + pmd_t *pmd =3D vmf->pmd; + pmd_t orig_pmd =3D vmf->orig_pmd; + unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; + const bool unshare =3D vmf->flags & FAULT_FLAG_UNSHARE; =20 entry =3D folio_mk_pmd(folio, vma->vm_page_prot); - entry =3D maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + if (unlikely(cow && unshare)) { + VM_WARN_ON(pmd_write(orig_pmd)); + if (pmd_soft_dirty(orig_pmd)) + entry =3D pmd_mksoft_dirty(entry); + if (pmd_uffd_wp(orig_pmd)) + entry =3D pmd_mkuffd_wp(entry); + } else { + entry =3D maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); + } folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); folio_add_lru_vma(folio, vma); set_pmd_at(vma->vm_mm, haddr, pmd, entry); @@ -1437,19 +1450,18 @@ void map_anon_folio_pmd_nopf(struct folio *folio, p= md_t *pmd, deferred_split_folio(folio, false); } =20 -static void map_anon_folio_pmd_pf(struct folio *folio, pmd_t *pmd, - struct vm_area_struct *vma, unsigned long haddr) +static void map_anon_folio_pmd_pf(struct folio *folio, struct vm_fault *vm= f, + bool cow) { - map_anon_folio_pmd_nopf(folio, pmd, vma, haddr); - add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); + map_anon_folio_pmd_nopf(folio, vmf, cow); + add_mm_counter(vmf->vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); - count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); + count_memcg_event_mm(vmf->vma->vm_mm, THP_FAULT_ALLOC); } =20 static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) { - unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; struct vm_area_struct *vma =3D vmf->vma; struct folio *folio; pgtable_t pgtable; @@ -1483,7 +1495,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct= vm_fault *vmf) return ret; } pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); - map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); + map_anon_folio_pmd_pf(folio, vmf, false); mm_inc_nr_ptes(vma->vm_mm); spin_unlock(vmf->ptl); } @@ -2174,7 +2186,7 @@ static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault= *vmf) if (ret) goto release; (void)pmdp_huge_clear_flush(vma, haddr, vmf->pmd); - map_anon_folio_pmd_pf(folio, vmf->pmd, vma, haddr); + map_anon_folio_pmd_pf(folio, vmf, true); goto unlock; release: folio_put(folio); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 7d48d4fbd5f3..18d309b69d30 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1402,7 +1402,13 @@ static enum scan_result collapse_huge_page(struct mm= _struct *mm, unsigned long s if (is_pmd_order(order)) { /* PMD collapse */ pgtable =3D pmd_pgtable(_pmd); pgtable_trans_huge_deposit(mm, pmd, pgtable); - map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr); + struct vm_fault vmf =3D { + .vma =3D vma, + .flags =3D 0, + .address =3D pmd_addr, + .orig_pmd =3D pmdp_get(pmd), + }; + map_anon_folio_pmd_nopf(folio, &vmf, false); } else { /* mTHP collapse */ map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=3D*/ fals= e); smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */ --=20 2.52.0 From nobody Sun Jun 14 07:48:16 2026 Received: from outbound.ms.icloud.com (p-west3-cluster6-host11-snip4-3.eps.apple.com [57.103.75.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 89A392D6409 for ; Fri, 1 May 2026 05:56:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.75.26 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614991; cv=none; b=LCRufHnMkGMs5s7VyUQsKJN956ZbXSkfZUKc7PMxxrXu7If3nFSvBkVN4i0TWUJW7stVggVGxgz/rXSCbNMLyUUfRvPxGfPc5VT8kR13RKXdi+OeUETRgbuniBX8NQksp4G5tkE4XNr4TZfjga+D/xARCOU7VP+yYAammh45YHY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777614991; c=relaxed/simple; bh=bQLkhy/GMOtgXMDL30CjqTGiS0lt2sy/G/NQIeeXbwM=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=KHuBB+H3Fj40b6p9U+koyYwO7KVpfr3H7nffazWAMwtdv1WSQXuffzB4t7PF1g/iTi3u8RJR7NTkc/AxDdSgGhNqncM/zplY7yE0b3g/m8+cVCHeNJ4pe8UvUuDWC/ksvd3rdr/hcAJRFz+73JKGBwDReWUb5dVrwSFd8pl+lLI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=rYMz/l11; arc=none smtp.client-ip=57.103.75.26 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="rYMz/l11" Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPS id 2C8B71800400; Fri, 01 May 2026 05:56:25 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhAA0MFWgFeAUEdXwFLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEFEYGVg1dE0wLcwRUB10FXVZQAlpLQBMESgZNXw5eHwQXRhlVBEceXVZeHhkCURxWDVdDVARfUEkMQVBsWgBHF0gdXRlZb1BdHA4EVAddBV1WUAJaS18ZXUUPXwdZBEAMSAJAQwNCL1oXREBBWh9FFEgDWARcBUQBSwReDytGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1777614989; x=1780206989; bh=7N+z8NTKNzTfnk4CNz4j1nHwx7t72Y+uiy3ixa6NunI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=rYMz/l11sQIIMMiwoXLPBByVSm5dpjYPWZRJSvWpAIgolZazq6euFxPI+Jy6cZGj3YTX09Yj8Y+tGNUnIv4zGRR/mtHIw1VW5zRqwkKgK8reqJCqQRX1l5UreyKBT1KrekXN/KYn4/xDT4BdwVL43ajxYaqplk61mh+j7z/5JkarfUV+f+Tc+FTArQfIJ6WtMlMgbdJlobgyEz2yCdJaMs0QZBCxyGCWoxWwOk9ivTWwxyQZFMzrRM3bmy1BbW18Kf3C1i4eCdelH6Szz4SHaeSGfv7o2NvahqSexmAj1olEXz/ZPIhQGA5mXoQng5gkwjhB5ZGNpx9x7qwD0I0Cag== Received: from [127.0.0.1] (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-60-percent-8 (Postfix) with ESMTPSA id 576791800401; Fri, 01 May 2026 05:56:19 +0000 (UTC) From: Luka Bai Date: Fri, 01 May 2026 13:55:46 +0800 Subject: [PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry. Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260501-thp_cow-v1-5-005377483738@tencent.com> References: <20260501-thp_cow-v1-0-005377483738@tencent.com> In-Reply-To: <20260501-thp_cow-v1-0-005377483738@tencent.com> To: linux-mm@kvack.org Cc: Jonathan Corbet , Shuah Khan , Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Jann Horn , Arnd Bergmann , Kairui Song , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1777614950; l=7099; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=PFGHhjK+TnZW9Z6ghtvzoKwlLu7cEjIWo9wxf6XDzvc=; b=t+G004ry0cf1bDvr6aUyhg1633TW+1DsSigwn6pJTC2qqMq4foBJlk0IbzzyT6hS4AnxrRphz X3f1dK9FyNIBzqzvDm8d74YSHMcxPBEcSGypXJBywwXXV2qBONCbLTI X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTAxMDA1NCBTYWx0ZWRfX0akgOBOh0fYa Q+E2GaHK68Txx9IL647qqsmdwS/up6JF9dDw2h1c4yqw0OEEwue545bPA67GdZ9svbu0bDB/emT rbXu4+/7p0yCRh/QRYJw26RwRrN9POINQnnACAhfqv6SGeKqavTwQpI5e7UfHhY31Vt9hqZnWqt xzQKwEbSe4JXtnOvH+v4FmHc84tPsSQSSiNmswPSwuddoyPGc//cJUcD77cT8sFl6yYe7rgTz1K Tq/9iJQXV66RLvaeYK12FWDNKL9WNQY1WveOXcsiA8ii9vv1TRMHTANtUU8+jrKPMRIlHYvih5A CB30DllqzbyAkv9dzL3GQukV5w6KCcgQpacjqX0YYY2eRecBbJgkejRJs3/wtI= X-Authority-Info-Out: v=2.4 cv=F9Bat6hN c=1 sm=1 tr=0 ts=69f4408b cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=bJ5UqtxypIKHvNYtotEA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: KfOtRRhL_HN1WoZjM0WIgLjaKfm-9Rka X-Proofpoint-GUID: KfOtRRhL_HN1WoZjM0WIgLjaKfm-9Rka From: Luka Bai For pmd mapped anonymous folios, we currently do not do COW for the whole vma region, because we don't want to copy and unshare the full PMD range on the first write fault. That proposal holds for the most workloads, however, that also makes the pmd entry split into 512 4K ptes in the child process after we write on a part of the folio. For example, if process A and B share a pmd sized folio, if B does writing on a small region, its pmd mapping will be split into 511 4K ptes which still point to the original pmd sized folio, and 1 4K pte pointing to the new 4K page. This is quite good for memory utilization, but it also make the tlb gain caused by pmd entry suddenly "vanish" after a simple write, which causes a observable performance decrease in some workloads. And also, it adds some "uncertainty" to the THP since it does splitting transparently in the COW scenorio which sometimes can cause trouble to ones that need stable hugepages. This patch adds support for pmd sized COW of anonymous page with switch controlling. The reason we add switch is that for some scenorio, the performance matters more, but for other workloads maybe the memory waste is more unbearable. So we can use the THP setup to control this configuration, either on the vma level or the global level. The patch is relatively simple, we add function wp_huge_pmd_page_copy to do the hugepage copy on write part, and do the allocation, accouting and cache flushing just like in 4K path. We use the newly reconstructed map_anon_folio_pmd_pf to do the mapping since it can properly support FAULT_FLAG_UNSHARE right now. We remove the ref checking in do_huge_pmd_wp_page, since we have supported copying the pmd folio right now, we'll check the refcount in the following folio_ref_count to make sure if the folio can be exclusively used. If not, we can always do copy on write for this folio just like in do_wp_page when THP COW is enabled. Signed-off-by: Luka Bai --- mm/huge_memory.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++= ---- 1 file changed, 116 insertions(+), 9 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1e661b411b2e..a05a4456e5a2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -40,6 +40,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -2196,6 +2197,94 @@ static vm_fault_t do_huge_zero_wp_pmd(struct vm_faul= t *vmf) return ret; } =20 +static vm_fault_t wp_huge_pmd_page_copy(struct vm_fault *vmf, struct folio= *old_folio) +{ + struct vm_area_struct *vma =3D vmf->vma; + struct mm_struct *mm =3D vma->vm_mm; + struct folio *new_folio =3D NULL; + struct page *new_page, *old_page; + unsigned long pmd_address =3D vmf->address & HPAGE_PMD_MASK; + struct mmu_notifier_range range; + vm_fault_t ret =3D 0; + int i; + + delayacct_wpcopy_start(); + + old_page =3D folio_page(old_folio, 0); + ret =3D vmf_anon_prepare(vmf); + if (unlikely(ret)) { + if (ret !=3D VM_FAULT_RETRY) + ret =3D VM_FAULT_FALLBACK; + goto out; + } + + new_folio =3D vma_alloc_anon_folio_pmd(vma, vmf->address); + if (unlikely(!new_folio)) { + ret =3D VM_FAULT_FALLBACK; + goto out; + } + + if (copy_user_large_folio(new_folio, old_folio, + pmd_address, vma)) { + ret =3D VM_FAULT_HWPOISON; + goto out; + } + + new_page =3D folio_page(new_folio, 0); + for (i =3D 0; i < HPAGE_PMD_NR; i++) + kmsan_copy_page_meta(new_page + i, old_page + i); + + __folio_mark_uptodate(new_folio); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + pmd_address, pmd_address + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + spin_lock(vmf->ptl); + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { + update_mmu_cache_pmd(vma, pmd_address, vmf->pmd); + ret =3D 0; + goto out_unlock; + } + + flush_cache_range(vma, pmd_address, pmd_address + HPAGE_PMD_SIZE); + /* + * Clear the pmd entry and flush it first, before updating the + * pmd with the new entry, to keep TLBs on different CPUs in + * sync. + */ + (void)pmdp_huge_clear_flush(vma, pmd_address, vmf->pmd); + /* + * We just temporarily decrement the mm_counter here, and it will be adde= d back in + * map_anon_folio_pmd_pf below. + */ + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + map_anon_folio_pmd_pf(new_folio, vmf, true); + folio_remove_rmap_pmd(old_folio, old_page, vma); + + spin_unlock(vmf->ptl); + + mmu_notifier_invalidate_range_end(&range); + /* This put is for the folio_get() in the caller */ + folio_put(old_folio); + free_swap_cache(old_folio); + + /* This put is for decrementing refcount after we switch page table mappi= ng */ + folio_put(old_folio); + + delayacct_wpcopy_end(); + return 0; +out_unlock: + spin_unlock(vmf->ptl); + mmu_notifier_invalidate_range_end(&range); +out: + folio_put(old_folio); + if (new_folio) + folio_put(new_folio); + + delayacct_wpcopy_end(); + return ret; +} + vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) { const bool unshare =3D vmf->flags & FAULT_FLAG_UNSHARE; @@ -2204,12 +2293,13 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) struct page *page; unsigned long haddr =3D vmf->address & HPAGE_PMD_MASK; pmd_t orig_pmd =3D vmf->orig_pmd; + vm_fault_t ret; =20 vmf->ptl =3D pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); =20 if (is_huge_zero_pmd(orig_pmd)) { - vm_fault_t ret =3D do_huge_zero_wp_pmd(vmf); + ret =3D do_huge_zero_wp_pmd(vmf); =20 if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -2253,14 +2343,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) goto reuse; } =20 - /* - * See do_wp_page(): we can only reuse the folio exclusively if - * there are no additional references. Note that we always drain - * the LRU cache immediately after adding a THP. - */ - if (folio_ref_count(folio) > - 1 + folio_test_swapcache(folio) * folio_nr_pages(folio)) - goto unlock_fallback; if (folio_test_swapcache(folio)) folio_free_swap(folio); if (folio_ref_count(folio) =3D=3D 1) { @@ -2282,6 +2364,31 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) return 0; } =20 + /* + * Only do hugepage copy on write if the parameter setup supports it. + */ + if (!hugepage_cow_enabled(vma)) + goto unlock_fallback; + + /* + * For vma without a vm_ops(anonymous vma), there should not be VM_SHARED= or + * VM_MAYSHARE types. + */ + VM_WARN_ON_ONCE_VMA(vma->vm_flags & (VM_SHARED | VM_MAYSHARE), vma); + + folio_unlock(folio); + /* + * Copy on write branch here. + * We are about to unlock the ptl here, so we need to get folio before th= at + * in case the folio gets freed in the meantime. + */ + folio_get(folio); + spin_unlock(vmf->ptl); + ret =3D wp_huge_pmd_page_copy(vmf, folio); + if (ret & VM_FAULT_FALLBACK) + goto fallback; + return ret; + unlock_fallback: folio_unlock(folio); spin_unlock(vmf->ptl); --=20 2.52.0