From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DAB051F4E3B for ; Tue, 11 Feb 2025 11:13:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272438; cv=none; b=jnyyva8U5C3bsee7LSaspOdVgMiSdIpTlQFMKsLmm9QX1z9mBvOI7qqf11JO3OThNQwCy5yZGFYD3lml4/RKtAX9nvsdL/cqLft3OH2cXzqBEbMh4pSdWjUflceVOStw1s7cnnEJP1uRfspGeCk03l+OCK/Rm0DsPZnTeNBTRtU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272438; c=relaxed/simple; bh=HLjhzBSmShA4XHz7qpUIHs9ADaPPYAqufvdB6r/6nts=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=LEevWzg615q26HsAYaS7MGak4Jsma2/qvCtToLgD9VGbMb8O40030ytywV6RMpSOjqbNCvbhDvc0AbUlEMNN1jpQdUyyBBnT3aNkHj/NCvff1XrrPrwiKdbdvXjQUW3IUw8k5fMIAIGbMcUN21ASTrQ9hanye75sQ7J6JKz1RkE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9B3321477; Tue, 11 Feb 2025 03:14:16 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 24D5F3F5A1; Tue, 11 Feb 2025 03:13:44 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Date: Tue, 11 Feb 2025 16:43:10 +0530 Message-Id: <20250211111326.14295-2-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Pass order to alloc_charge_folio() and update mTHP statistics. Signed-off-by: Dev Jain --- include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/khugepaged.c | 17 +++++++++++------ 3 files changed, 17 insertions(+), 6 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 93e509b6c00e..ffe47785854a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -119,6 +119,8 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_COLLAPSE_ALLOC, + MTHP_STAT_COLLAPSE_ALLOC_FAILED, MTHP_STAT_ZSWPOUT, MTHP_STAT_SWPIN, MTHP_STAT_SWPIN_FALLBACK, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3d3ebdc002d5..996e802543f1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr =3D __ATTR_RO= (_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FAL= LBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC); +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAIL= ED); DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT); DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN); DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK); @@ -680,6 +682,8 @@ static struct attribute *any_stats_attrs[] =3D { #endif &split_attr.attr, &split_failed_attr.attr, + &collapse_alloc_attr.attr, + &collapse_alloc_failed_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5f0be134141e..4342003b1c33 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1074,21 +1074,26 @@ static int __collapse_huge_page_swapin(struct mm_st= ruct *mm, } =20 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, - struct collapse_control *cc) + int order, struct collapse_control *cc) { gfp_t gfp =3D (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); int node =3D hpage_collapse_find_target_node(cc); struct folio *folio; =20 - folio =3D __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); + folio =3D __folio_alloc(gfp, order, node, &cc->alloc_nmask); if (!folio) { *foliop =3D NULL; - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + if (order =3D=3D HPAGE_PMD_ORDER) + count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED); return SCAN_ALLOC_HUGE_PAGE_FAIL; } =20 - count_vm_event(THP_COLLAPSE_ALLOC); + if (order =3D=3D HPAGE_PMD_ORDER) + count_vm_event(THP_COLLAPSE_ALLOC); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC); + if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { folio_put(folio); *foliop =3D NULL; @@ -1125,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, */ mmap_read_unlock(mm); =20 - result =3D alloc_charge_folio(&folio, mm, cc); + result =3D alloc_charge_folio(&folio, mm, HPAGE_PMD_ORDER, cc); if (result !=3D SCAN_SUCCEED) goto out_nolock; =20 @@ -1851,7 +1856,7 @@ static int collapse_file(struct mm_struct *mm, unsign= ed long addr, VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); =20 - result =3D alloc_charge_folio(&new_folio, mm, cc); + result =3D alloc_charge_folio(&new_folio, mm, HPAGE_PMD_ORDER, cc); if (result !=3D SCAN_SUCCEED) goto out; =20 --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B6B9A1E5B87 for ; Tue, 11 Feb 2025 11:14:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272447; cv=none; b=LuFdNZZ2rbvKxu04Mhko1sSG3w/FoR9vo4wj/I/OuNI5Gj20pg7AoDErDtc5QLAB6u10V6AxSWkQB2tuIFlBKd7FBJnxlXKb29ogM+ykb/lGaTd8xMmqMg40X2DyReLjG9pFwG2LlhvzEtNoPv6ccZMldZB+bIXqWKT+9/752ts= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272447; c=relaxed/simple; bh=/65vJoihApvUgvDROmmUZvN0G6FgIuNpyCBBsJDqLos=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=l5GflIH4fhKKj7SFvPvI2K4NKfLU6bZerQMrQdpJTlfd7YyuYpATQv+Gq1JKNxV9lewDeXsnewaH/AqwmQbtod9PLxby4YGhuGFT/ANYT4RnbES8vYT54Hcqxmk6obfu9GUuo/s0ezGJixksq6dMU4KkGFLRtMw0EGzRAd4ZWzk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 556BF13D5; Tue, 11 Feb 2025 03:14:26 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id A73603F5A1; Tue, 11 Feb 2025 03:13:55 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate() Date: Tue, 11 Feb 2025 16:43:11 +0530 Message-Id: <20250211111326.14295-3-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Post retaking the lock, it must be checked that the VMA is suitable for our scan order. Hence, generalize hugepage_vma_revalidate(). Signed-off-by: Dev Jain --- mm/khugepaged.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4342003b1c33..3d105cacf855 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -919,7 +919,7 @@ static int hpage_collapse_find_target_node(struct colla= pse_control *cc) =20 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long add= ress, bool expect_anon, - struct vm_area_struct **vmap, + struct vm_area_struct **vmap, int order, struct collapse_control *cc) { struct vm_area_struct *vma; @@ -932,9 +932,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm= , unsigned long address, if (!vma) return SCAN_VMA_NULL; =20 - if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) + if (!thp_vma_suitable_order(vma, address, order)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1135,7 +1135,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, goto out_nolock; =20 mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORD= ER, cc); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1169,7 +1169,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, * mmap_lock. */ mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORD= ER, cc); if (result !=3D SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ @@ -2779,7 +2779,7 @@ int madvise_collapse(struct vm_area_struct *vma, stru= ct vm_area_struct **prev, mmap_read_lock(mm); mmap_locked =3D true; result =3D hugepage_vma_revalidate(mm, addr, false, &vma, - cc); + HPAGE_PMD_ORDER, cc); if (result !=3D SCAN_SUCCEED) { last_fail =3D result; goto out_nolock; --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C69EF1B86F7 for ; Tue, 11 Feb 2025 11:14:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272457; cv=none; b=sN92ctePNF4l+QJqiThFr3bEq0iLfuhbHWRxUUeRwmnGM2dxcXDyL10NWWDhQg1OQkEhJ4og2Sq5pX6MHu5JTU3wYCwQCmnq457rz/fzs9oYguoX4C5+qMqb854PkKLQ5WU55pNMurmfRakEXctdfevbaMhTLpAE42z7flymTEg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272457; c=relaxed/simple; bh=n2OUYq4x9pVNwaL+7Sz7Igqpu0iQoNzw8pLK5Jma/xw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iyACe+/9h2Y7nYNgYuU6Vf73z5kSDV+qQMwOHepgqHRbwmHUp5iqxG6239f2JMZCEY7y4rsLi4k+dONamFwNSC/Ed7fqmz5ci27gQc48k/jiscBvwvrUJwZSY46Vc3EtuFjSsN7SmwjT4myaItR3OiAF8MK/TE0GUhhvQlVHQ5I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7D2761477; Tue, 11 Feb 2025 03:14:36 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 61B503F5A1; Tue, 11 Feb 2025 03:14:05 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin() Date: Tue, 11 Feb 2025 16:43:12 +0530 Message-Id: <20250211111326.14295-4-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" If any PTE in our scan range is a swap entry, then use do_swap_page() to sw= ap-in the corresponding folio. Signed-off-by: Dev Jain --- mm/khugepaged.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3d105cacf855..221823c0d95f 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -999,17 +999,17 @@ static int check_pmd_still_valid(struct mm_struct *mm, */ static int __collapse_huge_page_swapin(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long haddr, pmd_t *pmd, - int referenced) + unsigned long addr, pmd_t *pmd, + int referenced, int order) { int swapped_in =3D 0; vm_fault_t ret =3D 0; - unsigned long address, end =3D haddr + (HPAGE_PMD_NR * PAGE_SIZE); + unsigned long address, end =3D addr + (PAGE_SIZE << order); int result; pte_t *pte =3D NULL; spinlock_t *ptl; =20 - for (address =3D haddr; address < end; address +=3D PAGE_SIZE) { + for (address =3D addr; address < end; address +=3D PAGE_SIZE) { struct vm_fault vmf =3D { .vma =3D vma, .address =3D address, @@ -1154,7 +1154,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, * that case. Continuing to collapse causes inconsistency. */ result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced); + referenced, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; } --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 316861E5B87 for ; Tue, 11 Feb 2025 11:14:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272466; cv=none; b=acmgxgtUkOuL4kDOX3OBLdbSFpKddnTMh0PWPIAKuJWOAQ3tURbqDXt4rf0ftRjbt1cK6r2o4WeD0NIvCq8GzQJ1vKxEJKCtPfhK8CG4Xv6Fu/lSIBYTOWRu8neyGuunBakfWM/NA0UseSoXBBVqMdqUqvj8Mx1lY2eNTo1qloM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272466; c=relaxed/simple; bh=rZDBMkAjRlByC8BV0StYG3t83MI1FHKY+GoDNcmZWTM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=glVQvaxIv3eTDMyCMXoJiwNr/NOCO/rcUqTDZB5VrBvv0/EZvoKCluRR6qJ+TRvUj9/79ZTzgG4JEqOCu4PwdS+c+LG7AdCEGK787zx/rzMC/6vDZVNe8vNcFPYJ5RKUOlAJmeOq2kEqz63HiecIWzdRmDoVaK9rCpAXULf2T4w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DC9E113D5; Tue, 11 Feb 2025 03:14:45 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 8E5383F5A1; Tue, 11 Feb 2025 03:14:15 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate() Date: Tue, 11 Feb 2025 16:43:13 +0530 Message-Id: <20250211111326.14295-5-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Scale down the scan range and the sysfs tunables (to be changed in subseque= nt patches) according to the scan order, and isolate the folios. Signed-off-by: Dev Jain --- mm/khugepaged.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 221823c0d95f..0ea99df115cb 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, unsigned long address, pte_t *pte, struct collapse_control *cc, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, + int order) { - struct page *page =3D NULL; struct folio *folio =3D NULL; pte_t *_pte; int none_or_zero =3D 0, shared =3D 0, result =3D SCAN_FAIL, referenced = =3D 0; bool writable =3D false; + unsigned int max_ptes_shared =3D khugepaged_max_ptes_shared >> (HPAGE_PMD= _ORDER - order); + unsigned int max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORD= ER - order); =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; + for (_pte =3D pte; _pte < pte + (1UL << order); _pte++, address +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (pte_none(pteval) || (pte_present(pteval) && @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_= struct *vma, ++none_or_zero; if (!userfaultfd_armed(vma) && (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { + none_or_zero <=3D max_ptes_none)) { continue; } else { result =3D SCAN_EXCEED_NONE_PTE; @@ -597,20 +599,19 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, result =3D SCAN_PTE_UFFD_WP; goto out; } - page =3D vm_normal_page(vma, address, pteval); - if (unlikely(!page) || unlikely(is_zone_device_page(page))) { + folio =3D vm_normal_folio(vma, address, pteval); + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) { result =3D SCAN_PAGE_NULL; goto out; } =20 - folio =3D page_folio(page); VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); =20 /* See hpage_collapse_scan_pmd(). */ if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out; @@ -1201,7 +1202,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { result =3D __collapse_huge_page_isolate(vma, address, pte, cc, - &compound_pagelist); + &compound_pagelist, HPAGE_PMD_ORDER); spin_unlock(pte_ptl); } else { result =3D SCAN_PMD_NULL; --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0093A1F8AC0 for ; Tue, 11 Feb 2025 11:14:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272476; cv=none; b=j1M+507WCIV7fY4WwwcRqFvZWOJAgoJ2zjgWCrZc/tiosgXEzLoTbLg3U1WczvP+aitdk0v+hWTIx+WrTZatH6X8WvCVCluUqouawV2DImPaP4QuRUn9Wrbsbj9D+ZHi7GFyJnEyGbf5Y3oE+CdgcKSCJppDF8Giq4BmYNVXR+Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272476; c=relaxed/simple; bh=jYCxhFEZ2M8l6axDwQKT9C5dRWtJtw7NRpFisiKkJ0Q=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=emzeKbrRNhuVyM0wHcm5OKJnm3jZebKBurx6q6aokKzs+uEMzlgO+BIq7V7Y2LnmlZAEip0n29MoVsyYPo4Yy0XU191H3kMXSrAtb+hJt56rIS581/loF3nbV0wQgb5VL4IY+bBBtst0p8vRFACGe8SStDprzIBpx1v65i0UsXA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8F15213D5; Tue, 11 Feb 2025 03:14:55 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id E91073F5A1; Tue, 11 Feb 2025 03:14:24 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy() Date: Tue, 11 Feb 2025 16:43:14 +0530 Message-Id: <20250211111326.14295-6-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Generalize folio copying, PTE clearing and the failure path. Signed-off-by: Dev Jain --- mm/khugepaged.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0ea99df115cb..99eb1f72a508 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -712,13 +712,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t= *pte, struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, + int order) { struct folio *src, *tmp; pte_t *_pte; pte_t pteval; =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; + for (_pte =3D pte; _pte < pte + (1UL << order); _pte++, address +=3D PAGE_SIZE) { pteval =3D ptep_get(_pte); if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { @@ -765,7 +766,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, + int order) { spinlock_t *pmd_ptl; =20 @@ -782,7 +784,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, * Release both raw and compound pages isolated * in __collapse_huge_page_isolate. */ - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); + release_pte_pages(pte, pte + (1UL << order), compound_pagelist); } =20 /* @@ -803,7 +805,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, int order) { unsigned int i; int result =3D SCAN_SUCCEED; @@ -811,7 +813,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct= folio *folio, /* * Copying pages' contents is subject to memory poison at any iteration. */ - for (i =3D 0; i < HPAGE_PMD_NR; i++) { + for (i =3D 0; i < (1 << order); i++) { pte_t pteval =3D ptep_get(pte + i); struct page *page =3D folio_page(folio, i); unsigned long src_addr =3D address + i * PAGE_SIZE; @@ -830,10 +832,10 @@ static int __collapse_huge_page_copy(pte_t *pte, stru= ct folio *folio, =20 if (likely(result =3D=3D SCAN_SUCCEED)) __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, - compound_pagelist); + compound_pagelist, order); else __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, - compound_pagelist); + compound_pagelist, order); =20 return result; } @@ -1232,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, =20 result =3D __collapse_huge_page_copy(pte, folio, pmd, _pmd, vma, address, pte_ptl, - &compound_pagelist); + &compound_pagelist, HPAGE_PMD_ORDER); pte_unmap(pte); if (unlikely(result !=3D SCAN_SUCCEED)) goto out_up_write; --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C250D1F8BBF for ; Tue, 11 Feb 2025 11:14:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272486; cv=none; b=qQUq4OJv56jgtr9on6Mv7oKtTrNjxthObJfz75heuB8zM5RIfciY5f2Tv66QpkHa+BM53J53cEepMa31nFPHZOTKIzmcdkaUHfZItmyJgrinTrUsX8JdlGjIf5/7jpyayutwHdSpt0MjOzxTR25Pxi0bCMl1ZokoDhy+iSKTerI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272486; c=relaxed/simple; bh=QtCGg1k8AS+ss+xeR1GSZ7MyoMI5xCjGFG+b4Kji6Dg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=hp25NZQMMA78LSUYaoWYGMJ8QInRd+G1Kv4gQ24AmvwP/Bv3rmMxF9TDqiBAegXIwYC24kQLRHAZHTYllDfDb0nUwsCqcGyW1sA8zAJbQ1ZVLoxqcJVewLLW6z7qIjVm+zAOJSnFckeU/BL09ZidxTyKYCib12QxBtEJmCBLELY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8B5BD13D5; Tue, 11 Feb 2025 03:15:05 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 9B1463F5A1; Tue, 11 Feb 2025 03:14:34 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse Date: Tue, 11 Feb 2025 16:43:15 +0530 Message-Id: <20250211111326.14295-7-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Abstract away copying page contents, and setting the PMD, into vma_collapse_anon_folio_pmd(). Signed-off-by: Dev Jain --- mm/khugepaged.c | 140 +++++++++++++++++++++++++++--------------------- 1 file changed, 78 insertions(+), 62 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 99eb1f72a508..498cb5ad9ff1 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1109,76 +1109,27 @@ static int alloc_charge_folio(struct folio **foliop= , struct mm_struct *mm, return SCAN_SUCCEED; } =20 -static int collapse_huge_page(struct mm_struct *mm, unsigned long address, - int referenced, int unmapped, - struct collapse_control *cc) +static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long= address, + struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, + struct folio *folio) { LIST_HEAD(compound_pagelist); - pmd_t *pmd, _pmd; - pte_t *pte; pgtable_t pgtable; - struct folio *folio; spinlock_t *pmd_ptl, *pte_ptl; int result =3D SCAN_FAIL; - struct vm_area_struct *vma; struct mmu_notifier_range range; + pmd_t _pmd; + pte_t *pte; =20 VM_BUG_ON(address & ~HPAGE_PMD_MASK); =20 - /* - * Before allocating the hugepage, release the mmap_lock read lock. - * The allocation can take potentially a long time if it involves - * sync compaction, and we do not need to hold the mmap_lock during - * that. We will recheck the vma after taking it again in write mode. - */ - mmap_read_unlock(mm); - - result =3D alloc_charge_folio(&folio, mm, HPAGE_PMD_ORDER, cc); - if (result !=3D SCAN_SUCCEED) - goto out_nolock; - - mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORD= ER, cc); - if (result !=3D SCAN_SUCCEED) { - mmap_read_unlock(mm); - goto out_nolock; - } - - result =3D find_pmd_or_thp_or_none(mm, address, &pmd); - if (result !=3D SCAN_SUCCEED) { - mmap_read_unlock(mm); - goto out_nolock; - } - - if (unmapped) { - /* - * __collapse_huge_page_swapin will return with mmap_lock - * released when it fails. So we jump out_nolock directly in - * that case. Continuing to collapse causes inconsistency. - */ - result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced, HPAGE_PMD_ORDER); - if (result !=3D SCAN_SUCCEED) - goto out_nolock; - } - - mmap_read_unlock(mm); - /* - * Prevent all access to pagetables with the exception of - * gup_fast later handled by the ptep_clear_flush and the VM - * handled by the anon_vma lock + PG_lock. - * - * UFFDIO_MOVE is prevented to race as well thanks to the - * mmap_lock. - */ - mmap_write_lock(mm); result =3D hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORD= ER, cc); if (result !=3D SCAN_SUCCEED) - goto out_up_write; + goto out; /* check if the pmd is still valid */ result =3D check_pmd_still_valid(mm, address, pmd); if (result !=3D SCAN_SUCCEED) - goto out_up_write; + goto out; =20 vma_start_write(vma); anon_vma_lock_write(vma->anon_vma); @@ -1223,7 +1174,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); anon_vma_unlock_write(vma->anon_vma); - goto out_up_write; + goto out; } =20 /* @@ -1237,7 +1188,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, &compound_pagelist, HPAGE_PMD_ORDER); pte_unmap(pte); if (unlikely(result !=3D SCAN_SUCCEED)) - goto out_up_write; + goto out; =20 /* * The smp_wmb() inside __folio_mark_uptodate() ensures the @@ -1260,11 +1211,76 @@ static int collapse_huge_page(struct mm_struct *mm,= unsigned long address, deferred_split_folio(folio, false); spin_unlock(pmd_ptl); =20 - folio =3D NULL; - result =3D SCAN_SUCCEED; -out_up_write: +out: + return result; +} + +static int collapse_huge_page(struct mm_struct *mm, unsigned long address, + int referenced, int unmapped, int order, + struct collapse_control *cc) +{ + struct vm_area_struct *vma; + int result =3D SCAN_FAIL; + struct folio *folio; + pmd_t *pmd; + + /* + * Before allocating the hugepage, release the mmap_lock read lock. + * The allocation can take potentially a long time if it involves + * sync compaction, and we do not need to hold the mmap_lock during + * that. We will recheck the vma after taking it again in write mode. + */ + mmap_read_unlock(mm); + + result =3D alloc_charge_folio(&folio, mm, order, cc); + if (result !=3D SCAN_SUCCEED) + goto out_nolock; + + mmap_read_lock(mm); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, order, cc); + if (result !=3D SCAN_SUCCEED) { + mmap_read_unlock(mm); + goto out_nolock; + } + + result =3D find_pmd_or_thp_or_none(mm, address, &pmd); + if (result !=3D SCAN_SUCCEED) { + mmap_read_unlock(mm); + goto out_nolock; + } + + if (unmapped) { + /* + * __collapse_huge_page_swapin will return with mmap_lock + * released when it fails. So we jump out_nolock directly in + * that case. Continuing to collapse causes inconsistency. + */ + result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, + referenced, order); + if (result !=3D SCAN_SUCCEED) + goto out_nolock; + } + + mmap_read_unlock(mm); + /* + * Prevent all access to pagetables with the exception of + * gup_fast later handled by the ptep_clear_flush and the VM + * handled by the anon_vma lock + PG_lock. + * + * UFFDIO_MOVE is prevented to race as well thanks to the + * mmap_lock. + */ + mmap_write_lock(mm); + + if (order =3D=3D HPAGE_PMD_ORDER) + result =3D vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio); + mmap_write_unlock(mm); + + if (result =3D=3D SCAN_SUCCEED) + folio =3D NULL; + out_nolock: if (folio) folio_put(folio); @@ -1440,7 +1456,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, pte_unmap_unlock(pte, ptl); if (result =3D=3D SCAN_SUCCEED) { result =3D collapse_huge_page(mm, address, referenced, - unmapped, cc); + unmapped, HPAGE_PMD_ORDER, cc); /* collapse_huge_page will return with the mmap_lock released */ *mmap_locked =3D false; } --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 019751F8AE2 for ; Tue, 11 Feb 2025 11:14:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272496; cv=none; b=WldCSOXpc257TyWzo4LAcphoMpWy0NPAJOK1BTXnzDy0vKL0nECG/Ju4kTXOYO6ww8wub5LdcWUT39jb9bPxh22LSFTTeYbhqYNVNeBWSDTY2/TxlCR0dlZIlkT8od9SjCioLu1m6gS3QqnzBjFbSLqhSjT/6P15S1MDkhOI0KA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272496; c=relaxed/simple; bh=qIe1n2fRd93WsJXbutKwSy4FavldfpbBCniyhgWdJok=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=GKkf7Q8rLSQRlw5Q/h3y5ZUz9CBZUdfcTs/D1PGuNLIQbywjyqRQAZlhqRpJFcQAnnMDFi3Hv95ihsMTykYoOjBDA83H0s+kGy0kO3qC9CPP3q/qFtGV21P8sWAiyXdsta450TAC1hPJzttAXP228WH5a4kGBnf8qJ5HoRWde5w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C658E13D5; Tue, 11 Feb 2025 03:15:15 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id AA1D23F5A1; Tue, 11 Feb 2025 03:14:44 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 07/17] khugepaged: Scan PTEs order-wise Date: Tue, 11 Feb 2025 16:43:16 +0530 Message-Id: <20250211111326.14295-8-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Scan the PTEs order-wise, using the mask of suitable orders for this VMA derived in conjunction with sysfs THP settings. Scale down the tunables (to be changed in subsequent patches); in case of collapse failure, we drop down to the next order. Otherwise, we try to jump to the highest possible order and then start a fresh scan. Note that madvise(MADV_COLLAPSE) has not been = generalized. Signed-off-by: Dev Jain --- mm/khugepaged.c | 97 ++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 83 insertions(+), 14 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 498cb5ad9ff1..fbfd8a78ef51 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -21,6 +21,7 @@ #include #include #include +#include =20 #include #include @@ -1295,36 +1296,57 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, { pmd_t *pmd; pte_t *pte, *_pte; - int result =3D SCAN_FAIL, referenced =3D 0; - int none_or_zero =3D 0, shared =3D 0; - struct page *page =3D NULL; struct folio *folio =3D NULL; - unsigned long _address; + int result =3D SCAN_FAIL; spinlock_t *ptl; - int node =3D NUMA_NO_NODE, unmapped =3D 0; + unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap; + int referenced, shared, none_or_zero, unmapped; + unsigned long _address, orig_address =3D address; + int node =3D NUMA_NO_NODE; bool writable =3D false; + unsigned long orders, orig_orders; + int order, prev_order; =20 VM_BUG_ON(address & ~HPAGE_PMD_MASK); =20 + orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON); + orders =3D thp_vma_suitable_orders(vma, address, orders); + orig_orders =3D orders; + order =3D highest_order(orders); + + /* MADV_COLLAPSE needs to work irrespective of sysfs setting */ + if (!cc->is_khugepaged) + order =3D HPAGE_PMD_ORDER; + +scan_pte_range: + + max_ptes_shared =3D khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - orde= r); + max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); + max_ptes_swap =3D khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order); + referenced =3D 0, shared =3D 0, none_or_zero =3D 0, unmapped =3D 0; + + /* Check pmd after taking mmap lock */ result =3D find_pmd_or_thp_or_none(mm, address, &pmd); if (result !=3D SCAN_SUCCEED) goto out; =20 memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + pte =3D pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte) { result =3D SCAN_PMD_NULL; goto out; } =20 - for (_address =3D address, _pte =3D pte; _pte < pte + HPAGE_PMD_NR; + for (_address =3D address, _pte =3D pte; _pte < pte + (1UL << order); _pte++, _address +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (is_swap_pte(pteval)) { ++unmapped; if (!cc->is_khugepaged || - unmapped <=3D khugepaged_max_ptes_swap) { + unmapped <=3D max_ptes_swap) { /* * Always be strict with uffd-wp * enabled swap entries. Please see @@ -1345,7 +1367,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, ++none_or_zero; if (!userfaultfd_armed(vma) && (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { + none_or_zero <=3D max_ptes_none)) { continue; } else { result =3D SCAN_EXCEED_NONE_PTE; @@ -1369,12 +1391,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, if (pte_write(pteval)) writable =3D true; =20 - page =3D vm_normal_page(vma, _address, pteval); - if (unlikely(!page) || unlikely(is_zone_device_page(page))) { + folio =3D vm_normal_folio(vma, _address, pteval); + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) { result =3D SCAN_PAGE_NULL; goto out_unmap; } - folio =3D page_folio(page); =20 if (!folio_test_anon(folio)) { result =3D SCAN_PAGE_ANON; @@ -1390,7 +1411,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out_unmap; @@ -1447,7 +1468,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, result =3D SCAN_PAGE_RO; } else if (cc->is_khugepaged && (!referenced || - (unmapped && referenced < HPAGE_PMD_NR / 2))) { + (unmapped && referenced < (1UL << order) / 2))) { result =3D SCAN_LACK_REFERENCED_PAGE; } else { result =3D SCAN_SUCCEED; @@ -1456,10 +1477,58 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, pte_unmap_unlock(pte, ptl); if (result =3D=3D SCAN_SUCCEED) { result =3D collapse_huge_page(mm, address, referenced, - unmapped, HPAGE_PMD_ORDER, cc); + unmapped, order, cc); /* collapse_huge_page will return with the mmap_lock released */ *mmap_locked =3D false; + /* Skip over this range and decide order */ + if (result =3D=3D SCAN_SUCCEED) + goto decide_order; + } + if (result !=3D SCAN_SUCCEED) { + + /* Go to the next order */ + prev_order =3D order; + order =3D next_order(&orders, order); + if (order < 2) { + /* Skip over this range, and decide order */ + _address =3D address + (PAGE_SIZE << prev_order); + _pte =3D pte + (1UL << prev_order); + goto decide_order; + } + goto maybe_mmap_lock; } + +decide_order: + /* Immediately exit on exhaustion of range */ + if (_address =3D=3D orig_address + (PAGE_SIZE << HPAGE_PMD_ORDER)) + goto out; + + /* Get highest order possible starting from address */ + order =3D count_trailing_zeros(_address >> PAGE_SHIFT); + + orders =3D orig_orders & ((1UL << (order + 1)) - 1); + if (!(orders & (1UL << order))) + order =3D next_order(&orders, order); + + /* This should never happen, since we are on an aligned address */ + BUG_ON(cc->is_khugepaged && order < 2); + + address =3D _address; + pte =3D _pte; + +maybe_mmap_lock: + if (!(*mmap_locked)) { + mmap_read_lock(mm); + *mmap_locked =3D true; + /* Validate VMA after retaking mmap_lock */ + result =3D hugepage_vma_revalidate(mm, address, true, &vma, + order, cc); + if (result !=3D SCAN_SUCCEED) { + mmap_read_unlock(mm); + goto out; + } + } + goto scan_pte_range; out: trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced, none_or_zero, result, unmapped); --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B86631F8AF8 for ; Tue, 11 Feb 2025 11:15:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272507; cv=none; b=SHaz6+eYJJv5FDpomNhUrfY+WW2gUSNlsOnGkbIu83i9hf+6jesYSyTatBH6Xm0xwa8uUCo4w4zMYD2DJVpWeKS5ruF0450DxsWWsDV7VfzMu5JPG9FlYReqAjhhWzs+ZLoow31omSQta78ceddHL2ltvZxsgJkpxF41Aroq2d8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272507; c=relaxed/simple; bh=IubrVVftFg/kwm1NglPLF3PPOx0Z6acD3DLUCZRJm/g=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=EqO5kNgpvO7RzKAEv3a9T7xIZR+rcQMrw4lHdsmHQT7/f3OSQofWb0ZVIfIUdOCgshZMsEnIW8Xh42u8Q5nrgN9mSq8oBvW1qGN1Kk7ny/s6J84h3VlreYoTngNWsHHSTN0kiHm2gZkyGeRduc6pIBIPZDYZysE7mj0nQDESW40= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7967113D5; Tue, 11 Feb 2025 03:15:26 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id F17C83F5A1; Tue, 11 Feb 2025 03:14:54 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio() Date: Tue, 11 Feb 2025 16:43:17 +0530 Message-Id: <20250211111326.14295-9-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Similar to PMD collapse, take the write locks to stop pagetable walking. Copy page contents, clear the PTEs, remove folio pins, and (try to) unmap t= he old folios. Set the PTEs to the new folio using the set_ptes() API. Signed-off-by: Dev Jain --- mm/khugepaged.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fbfd8a78ef51..a674014b6563 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1217,6 +1217,96 @@ static int vma_collapse_anon_folio_pmd(struct mm_str= uct *mm, unsigned long addre return result; } =20 +/* Similar to the PMD case except we have to batch set the PTEs */ +static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long add= ress, + struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, + struct folio *folio, int order) +{ + LIST_HEAD(compound_pagelist); + spinlock_t *pmd_ptl, *pte_ptl; + int result =3D SCAN_FAIL; + struct mmu_notifier_range range; + pmd_t _pmd; + pte_t *pte; + pte_t entry; + int nr_pages =3D folio_nr_pages(folio); + unsigned long haddress =3D address & HPAGE_PMD_MASK; + + VM_BUG_ON(address & ((PAGE_SIZE << order) - 1));; + + result =3D hugepage_vma_revalidate(mm, address, true, &vma, order, cc); + if (result !=3D SCAN_SUCCEED) + goto out; + result =3D check_pmd_still_valid(mm, address, pmd); + if (result !=3D SCAN_SUCCEED) + goto out; + + vma_start_write(vma); + anon_vma_lock_write(vma->anon_vma); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress, + haddress + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + pmd_ptl =3D pmd_lock(mm, pmd); + _pmd =3D pmdp_collapse_flush(vma, haddress, pmd); + spin_unlock(pmd_ptl); + mmu_notifier_invalidate_range_end(&range); + tlb_remove_table_sync_one(); + + pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + if (pte) { + result =3D __collapse_huge_page_isolate(vma, address, pte, cc, + &compound_pagelist, order); + spin_unlock(pte_ptl); + } else { + result =3D SCAN_PMD_NULL; + } + + if (unlikely(result !=3D SCAN_SUCCEED)) { + if (pte) + pte_unmap(pte); + spin_lock(pmd_ptl); + BUG_ON(!pmd_none(*pmd)); + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + spin_unlock(pmd_ptl); + anon_vma_unlock_write(vma->anon_vma); + goto out; + } + + anon_vma_unlock_write(vma->anon_vma); + + result =3D __collapse_huge_page_copy(pte, folio, pmd, *pmd, + vma, address, pte_ptl, + &compound_pagelist, order); + pte_unmap(pte); + if (unlikely(result !=3D SCAN_SUCCEED)) + goto out; + + __folio_mark_uptodate(folio); + entry =3D mk_pte(&folio->page, vma->vm_page_prot); + entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); + + spin_lock(pte_ptl); + folio_ref_add(folio, nr_pages - 1); + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + set_ptes(mm, address, pte, entry, nr_pages); + spin_unlock(pte_ptl); + spin_lock(pmd_ptl); + + /* See pmd_install() */ + smp_wmb(); + BUG_ON(!pmd_none(*pmd)); + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + update_mmu_cache_pmd(vma, haddress, pmd); + spin_unlock(pmd_ptl); + + result =3D SCAN_SUCCEED; +out: + return result; +} + static int collapse_huge_page(struct mm_struct *mm, unsigned long address, int referenced, int unmapped, int order, struct collapse_control *cc) @@ -1276,6 +1366,8 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, =20 if (order =3D=3D HPAGE_PMD_ORDER) result =3D vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio); + else + result =3D vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, ord= er); =20 mmap_write_unlock(mm); =20 --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0D0901F5402 for ; Tue, 11 Feb 2025 11:15:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272517; cv=none; b=QmJ/MwWKAgWPmYOsELNqBUasgbxCcvO7nHRR8CYGsY6nLPSji2aPhwimjd2lrotOa63uOSCTsfmbTmVQb6gC+3r5/TsTpCOE60I2r3yI9/VqsQJbTp2Vt/3pUZNfmIfWRjtcehif8566w4rHyVUjYxsfUoFbDf5Y+U9QmxIXzi8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272517; c=relaxed/simple; bh=xBpgNPJmcBGZi9tJuNxF+QwninsTQ5o86m9CO/ljJTI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=EULk1M4qV2u+YjFsvurBl1Fc8OY1XTLgmycgQurzse6vxLk6eyWu825bk1lkn9ZA20ZZ86hWrVSTkkx9YnAXj+kfEDhu5TRkAFb02hy9Te+Jubh1OAnWeFZXbIQblxc42ifJfZoOb55MBODOaqQDPjd/tKXu/6l1oRrXzRCvyYA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CD31513D5; Tue, 11 Feb 2025 03:15:36 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 863983F5A1; Tue, 11 Feb 2025 03:15:05 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped Date: Tue, 11 Feb 2025 16:43:18 +0530 Message-Id: <20250211111326.14295-10-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" As noted in [1], khugepaged's goal must be to collapse memory to the highes= t aligned order possible. Suppose khugepaged is scanning for 64K, and we have a 128K = folio, whose first 64K half is VA-PA aligned and fully mapped. In such a case, it = does not make sense to break this down into two 64K folios. On the other hand, if the fir= st half is not aligned, or it is partially mapped, it makes sense for khugepaged to co= llapse this portion into a VA-PA aligned fully mapped 64K folio.=20 [1] https://lore.kernel.org/all/aa647830-cf55-48f0-98c2-8230796e35b3@arm.co= m/ Signed-off-by: Dev Jain --- mm/khugepaged.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a674014b6563..0d0d8f415a2e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -34,6 +34,7 @@ enum scan_result { SCAN_PMD_NULL, SCAN_PMD_NONE, SCAN_PMD_MAPPED, + SCAN_PTE_MAPPED_THP, SCAN_EXCEED_NONE_PTE, SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, @@ -562,6 +563,14 @@ static bool is_refcount_suitable(struct folio *folio) return folio_ref_count(folio) =3D=3D expected_refcount; } =20 +/* Assumes an embedded PFN */ +static bool is_same_folio(pte_t *first_pte, pte_t *last_pte) +{ + struct folio *folio1 =3D page_folio(pte_page(ptep_get(first_pte))); + struct folio *folio2 =3D page_folio(pte_page(ptep_get(last_pte))); + return folio1 =3D=3D folio2; +} + static int __collapse_huge_page_isolate(struct vm_area_struct *vma, unsigned long address, pte_t *pte, @@ -575,13 +584,22 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, bool writable =3D false; unsigned int max_ptes_shared =3D khugepaged_max_ptes_shared >> (HPAGE_PMD= _ORDER - order); unsigned int max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORD= ER - order); + bool all_pfns_present =3D true; + bool all_pfns_contig =3D true; + bool first_pfn_aligned =3D true; + pte_t prev_pteval; =20 for (_pte =3D pte; _pte < pte + (1UL << order); _pte++, address +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); + if (_pte =3D=3D pte) { + if (!IS_ALIGNED(pte_pfn(pteval), (1UL << order))) + first_pfn_aligned =3D false; + } if (pte_none(pteval) || (pte_present(pteval) && is_zero_pfn(pte_pfn(pteval)))) { ++none_or_zero; + all_pfns_present =3D false; if (!userfaultfd_armed(vma) && (!cc->is_khugepaged || none_or_zero <=3D max_ptes_none)) { @@ -660,6 +678,12 @@ static int __collapse_huge_page_isolate(struct vm_area= _struct *vma, goto out; } =20 + if (all_pfns_contig && (pte !=3D _pte) && !(all_pfns_present && + (pte_pfn(pteval) =3D=3D pte_pfn(prev_pteval) + 1))) + all_pfns_contig =3D false; + + prev_pteval =3D pteval; + /* * Isolate the page to avoid collapsing an hugepage * currently in use by the VM. @@ -696,6 +720,10 @@ static int __collapse_huge_page_isolate(struct vm_area= _struct *vma, result =3D SCAN_PAGE_RO; } else if (unlikely(cc->is_khugepaged && !referenced)) { result =3D SCAN_LACK_REFERENCED_PAGE; + } else if ((result =3D=3D SCAN_SUCCEED) && (order !=3D HPAGE_PMD_ORDER) &= & all_pfns_present && + all_pfns_contig && first_pfn_aligned && + is_same_folio(pte, pte + (1UL << order) - 1)) { + result =3D SCAN_PTE_MAPPED_THP; } else { result =3D SCAN_SUCCEED; trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero, @@ -1398,6 +1426,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, bool writable =3D false; unsigned long orders, orig_orders; int order, prev_order; + bool all_pfns_present, all_pfns_contig, first_pfn_aligned; + pte_t prev_pteval; =20 VM_BUG_ON(address & ~HPAGE_PMD_MASK); =20 @@ -1417,6 +1447,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); max_ptes_swap =3D khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order); referenced =3D 0, shared =3D 0, none_or_zero =3D 0, unmapped =3D 0; + all_pfns_present =3D true, all_pfns_contig =3D true, first_pfn_aligned = =3D true; =20 /* Check pmd after taking mmap lock */ result =3D find_pmd_or_thp_or_none(mm, address, &pmd); @@ -1435,8 +1466,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct = *mm, for (_address =3D address, _pte =3D pte; _pte < pte + (1UL << order); _pte++, _address +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); + if (_pte =3D=3D pte) { + if (!IS_ALIGNED(pte_pfn(pteval), (1UL << order))) + first_pfn_aligned =3D false; + } + if (is_swap_pte(pteval)) { ++unmapped; + all_pfns_present =3D false; if (!cc->is_khugepaged || unmapped <=3D max_ptes_swap) { /* @@ -1457,6 +1494,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, } if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { ++none_or_zero; + all_pfns_present =3D false; if (!userfaultfd_armed(vma) && (!cc->is_khugepaged || none_or_zero <=3D max_ptes_none)) { @@ -1546,6 +1584,17 @@ static int hpage_collapse_scan_pmd(struct mm_struct = *mm, goto out_unmap; } =20 + + /* + * PFNs not contig, if either at least one PFN not present, or the previ= ous + * and this PFN are not contig + */ + if (all_pfns_contig && (pte !=3D _pte) && !(all_pfns_present && + (pte_pfn(pteval) =3D=3D pte_pfn(prev_pteval) + 1))) + all_pfns_contig =3D false; + + prev_pteval =3D pteval; + /* * If collapse was initiated by khugepaged, check that there is * enough young pte to justify collapsing the page @@ -1567,15 +1616,30 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, } out_unmap: pte_unmap_unlock(pte, ptl); + + /* + * We skip if the following conditions are true: + * 1) All PTEs point to consecutive PFNs + * 2) All PFNs belong to the same folio + * 3) The PFNs are PA-aligned to the order we are scanning for + */ + if ((result =3D=3D SCAN_SUCCEED) && (order !=3D HPAGE_PMD_ORDER) && all_p= fns_present && + all_pfns_contig && first_pfn_aligned && + is_same_folio(pte, pte + (1UL << order) - 1)) { + result =3D SCAN_PTE_MAPPED_THP; + goto decide_order; + } + if (result =3D=3D SCAN_SUCCEED) { result =3D collapse_huge_page(mm, address, referenced, unmapped, order, cc); /* collapse_huge_page will return with the mmap_lock released */ *mmap_locked =3D false; /* Skip over this range and decide order */ - if (result =3D=3D SCAN_SUCCEED) + if (result =3D=3D SCAN_SUCCEED || result =3D=3D SCAN_PTE_MAPPED_THP) goto decide_order; } + if (result !=3D SCAN_SUCCEED) { =20 /* Go to the next order */ --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id F3A311F8AC5 for ; Tue, 11 Feb 2025 11:15:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272527; cv=none; b=YAlezGqh7lYhIayACT8SBojGdA0+7BScY24CYU0UgQEdWqRe7YnEeEoJYQ2SmVzXpSHShCkL9EbLugjKzjs1QTk0KYpCbsWaIT0XBQugnh/n0BTz/B2KFX9VjnYZiYVKMY0pDSC6PeImyNDDEYl79AyDS3YL3TvTkbMx7mk84xw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272527; c=relaxed/simple; bh=jwb7j9VMZE8iyQeZ3Z7Lcyr9jCb164rh8up+L8Wn+K0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=vE/HPi6HNOV4U0eR7t9Tm/cCpnxETg2wuYHszELBD7duVs/MnKWdSJmFrC7FoUxDjmFNP/bSdGnUoVvVrbY8mPDMkNqpcPLzRIyMXwMaBqBCOm+Q4yD/7RUYosRv8FOm5kzxeOzzpFQ36lAZZm04aLUpL21Owk+xdfheg75gs9I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 11C3313D5; Tue, 11 Feb 2025 03:15:47 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id D8CDC3F5A1; Tue, 11 Feb 2025 03:15:15 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP Date: Tue, 11 Feb 2025 16:43:19 +0530 Message-Id: <20250211111326.14295-11-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Since mTHP orders under consideration by khugepaged are also candidates for= the fault handler, the case we hit frequently is that khugepaged scans a region= for order-x, whereas an order-x folio was already installed by the fault handle= r there. Therefore, exit early; this prevents a timeout in the khugepaged selftest. = Earlier this was not a problem because a PMD-hugepage will get checked by find_pmd_= or_thp_or_none(), and the previous patch does not solve this problem because it will do the e= ntire PTE scan to exit. Signed-off-by: Dev Jain --- mm/khugepaged.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0d0d8f415a2e..baa5b44968ac 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -626,6 +626,11 @@ static int __collapse_huge_page_isolate(struct vm_area= _struct *vma, =20 VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); =20 + if (_pte =3D=3D pte && (order !=3D HPAGE_PMD_ORDER) && (folio_order(foli= o) =3D=3D order) && + test_bit(PG_head, &folio->page.flags) && !folio_test_partially_mappe= d(folio)) { + result =3D SCAN_PTE_MAPPED_THP; + goto out; + } /* See hpage_collapse_scan_pmd(). */ if (folio_likely_mapped_shared(folio)) { ++shared; @@ -1532,6 +1537,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct = *mm, goto out_unmap; } =20 + /* Exit early: There is high chance of this due to faulting */ + if (_pte =3D=3D pte && (order !=3D HPAGE_PMD_ORDER) && (folio_order(foli= o) =3D=3D order) && + test_bit(PG_head, &folio->page.flags) && !folio_test_partially_mappe= d(folio)) { + pte_unmap_unlock(pte, ptl); + _address =3D address + (PAGE_SIZE << order); + _pte =3D pte + (1UL << order); + result =3D SCAN_PTE_MAPPED_THP; + goto decide_order; + } + /* * We treat a single page as shared if any part of the THP * is shared. "False negatives" from --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 445EE1FCF7C for ; Tue, 11 Feb 2025 11:15:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272538; cv=none; b=Mg3UlaJNb144W7tLV7ATha495KzewQx/1+/F5AHV0f+O5YWXB1zgGO0yfy4pxl/x0S9zIKaZki+LetYAeKll1roTIUeiB4b/5R9qkx/8J8I2Cv8g2y0EH9yQSY3Luwz6N+R3GOyIt6XDMTrHZFZt4KTdKUphucGcGFRqVTgKVBw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272538; c=relaxed/simple; bh=iFV0hhjYpSnTo4OcMNtFrIBSyTfyrCAVrzjqHeN3Gxo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=KbvUZq4gR74xz98Zz7mAHovrLKsnVUTTsO4qgNchj4+ntjJJywy2ff2sFY6QTBe8H8tvTgbeJzGVFAcfcIa421oreT1LatJX2setY9z7YlmGxAAe1rJG8OnrDamUbhK7WXo9EeIslSg+FE4T0NJwVNaUkvfSth3a0MyBaziP61A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 03FCF13D5; Tue, 11 Feb 2025 03:15:58 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 3600B3F5A1; Tue, 11 Feb 2025 03:15:25 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse Date: Tue, 11 Feb 2025 16:43:20 +0530 Message-Id: <20250211111326.14295-12-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Activate khugepaged for anonymous collapse even if a single order is activa= ted. This condition will be updated upon by subsequent patches. Signed-off-by: Dev Jain --- mm/khugepaged.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index baa5b44968ac..37cfa7beba3d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -415,24 +415,20 @@ static inline int hpage_collapse_test_exit_or_disable= (struct mm_struct *mm) test_bit(MMF_DISABLE_THP, &mm->flags); } =20 -static bool hugepage_pmd_enabled(void) +static bool thp_enabled(void) { /* * We cover the anon, shmem and the file-backed case here; file-backed * hugepages, when configured in, are determined by the global control. - * Anon pmd-sized hugepages are determined by the pmd-size control. + * Anon mTHPs are determined by the per-size control. * Shmem pmd-sized hugepages are also determined by its pmd-size control, * except when the global shmem_huge is set to SHMEM_HUGE_DENY. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && hugepage_global_enabled()) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_always)) - return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) - return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && - hugepage_global_enabled()) + if (huge_anon_orders_always || huge_anon_orders_madvise || + (huge_anon_orders_inherit && hugepage_global_enabled())) return true; if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled()) return true; @@ -475,9 +471,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && - hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, - PMD_ORDER)) + thp_enabled()) { + if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS, + THP_ORDERS_ALL_ANON)) __khugepaged_enter(vma->vm_mm); } } @@ -2679,8 +2675,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, - TVA_ENFORCE_SYSFS, PMD_ORDER)) { + if (!thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON)) { skip: progress++; continue; @@ -2704,6 +2700,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned= int pages, int *result, khugepaged_scan.address + HPAGE_PMD_SIZE > hend); if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { + if (!thp_vma_allowable_order(vma, vma->vm_flags, + TVA_ENFORCE_SYSFS, PMD_ORDER)) + break; + struct file *file =3D get_file(vma->vm_file); pgoff_t pgoff =3D linear_page_index(vma, khugepaged_scan.address); @@ -2782,7 +2782,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, =20 static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && thp_enabled(); } =20 static int khugepaged_wait_event(void) @@ -2855,7 +2855,7 @@ static void khugepaged_wait_work(void) return; } =20 - if (hugepage_pmd_enabled()) + if (thp_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); } =20 @@ -2886,7 +2886,7 @@ static void set_recommended_min_free_kbytes(void) int nr_zones =3D 0; unsigned long recommended_min; =20 - if (!hugepage_pmd_enabled()) { + if (!thp_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -2936,7 +2936,7 @@ int start_stop_khugepaged(void) int err =3D 0; =20 mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled()) { + if (thp_enabled()) { if (!khugepaged_thread) khugepaged_thread =3D kthread_run(khugepaged, NULL, "khugepaged"); @@ -2962,7 +2962,7 @@ int start_stop_khugepaged(void) void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled() && khugepaged_thread) + if (thp_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7532B1F8ADF for ; Tue, 11 Feb 2025 11:15:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272548; cv=none; b=CmJerQXu2Clpq7lcTHygcXIq69jWXwqtNv+RrRbocRWBLA6sUdvlnDF00vW/YyA+LIJPl+b9NZ0Rer0kHLAAMVTcBvm6NZ+n9YbrJSD8Tkhigy11njkBVvD9iAvM7tnKxyV5YKwMOLFz/1RxbPtVgMmuQVE40UMg/5AvlDy9HEM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272548; c=relaxed/simple; bh=4xjvZqeLUawAyQidCrnqVChWR4w8AB2ikrgHA+33gAQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AKn3jPbXWhQrTfchXFu4nJB9o84+c8iXmhMs5+FdCDu3qvuAmX+0naFQeIEELTYMF/sGJ5fSERhdW5gdj7n6amxTdW2VU7Tz8fcXeXD8ZJYR5bPthmvRc+WGbQdWB4b5cVMhhcELTb8Zoq+DllZ+RhCOSxW+b8u0ur8KkhB3y2k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 87B9413D5; Tue, 11 Feb 2025 03:16:07 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 1289E3F5A1; Tue, 11 Feb 2025 03:15:36 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse Date: Tue, 11 Feb 2025 16:43:21 +0530 Message-Id: <20250211111326.14295-13-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Applications in general may have a lot of VMAs less than PMD-size. Therefore it is essential that khugepaged is able to collapse these VMAs. Signed-off-by: Dev Jain --- mm/khugepaged.c | 68 +++++++++++++++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 27 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 37cfa7beba3d..048f990d8507 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1413,7 +1413,7 @@ static int collapse_huge_page(struct mm_struct *mm, u= nsigned long address, static int hpage_collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, - struct collapse_control *cc) + unsigned long orders, struct collapse_control *cc) { pmd_t *pmd; pte_t *pte, *_pte; @@ -1425,22 +1425,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, unsigned long _address, orig_address =3D address; int node =3D NUMA_NO_NODE; bool writable =3D false; - unsigned long orders, orig_orders; + unsigned long orig_orders; int order, prev_order; bool all_pfns_present, all_pfns_contig, first_pfn_aligned; pte_t prev_pteval; =20 - VM_BUG_ON(address & ~HPAGE_PMD_MASK); - - orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON); - orders =3D thp_vma_suitable_orders(vma, address, orders); orig_orders =3D orders; order =3D highest_order(orders); - - /* MADV_COLLAPSE needs to work irrespective of sysfs setting */ - if (!cc->is_khugepaged) - order =3D HPAGE_PMD_ORDER; + VM_BUG_ON(address & ((PAGE_SIZE << order) - 1)); =20 scan_pte_range: =20 @@ -1667,7 +1659,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *= mm, =20 decide_order: /* Immediately exit on exhaustion of range */ - if (_address =3D=3D orig_address + (PAGE_SIZE << HPAGE_PMD_ORDER)) + if (_address =3D=3D orig_address + (PAGE_SIZE << (highest_order(orig_ord= ers)))) goto out; =20 /* Get highest order possible starting from address */ @@ -2636,6 +2628,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, struct mm_struct *mm; struct vm_area_struct *vma; int progress =3D 0; + unsigned long orders; + int order; + bool is_file_vma; =20 VM_BUG_ON(!pages); lockdep_assert_held(&khugepaged_mm_lock); @@ -2675,19 +2670,40 @@ static unsigned int khugepaged_scan_mm_slot(unsigne= d int pages, int *result, progress++; break; } - if (!thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON)) { + orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON); + if (!orders) { skip: progress++; continue; } - hstart =3D round_up(vma->vm_start, HPAGE_PMD_SIZE); - hend =3D round_down(vma->vm_end, HPAGE_PMD_SIZE); + + /* We can collapse anonymous VMAs less than PMD_SIZE */ + is_file_vma =3D IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma); + if (is_file_vma) { + order =3D HPAGE_PMD_ORDER; + if (!(orders & (1UL << order))) + goto skip; + hend =3D round_down(vma->vm_end, PAGE_SIZE << order); + } + else { + /* select the highest possible order for the VMA */ + order =3D highest_order(orders); + while (orders) { + hend =3D round_down(vma->vm_end, PAGE_SIZE << order); + if (khugepaged_scan.address <=3D hend) + break; + order =3D next_order(&orders, order); + } + } + if (!orders) + goto skip; if (khugepaged_scan.address > hend) goto skip; + hstart =3D round_up(vma->vm_start, PAGE_SIZE << order); if (khugepaged_scan.address < hstart) khugepaged_scan.address =3D hstart; - VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); + VM_BUG_ON(khugepaged_scan.address & ((PAGE_SIZE << order) - 1)); =20 while (khugepaged_scan.address < hend) { bool mmap_locked =3D true; @@ -2697,13 +2713,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned= int pages, int *result, goto breakouterloop; =20 VM_BUG_ON(khugepaged_scan.address < hstart || - khugepaged_scan.address + HPAGE_PMD_SIZE > + khugepaged_scan.address + (PAGE_SIZE << order) > hend); - if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) { - if (!thp_vma_allowable_order(vma, vma->vm_flags, - TVA_ENFORCE_SYSFS, PMD_ORDER)) - break; - + if (is_file_vma) { struct file *file =3D get_file(vma->vm_file); pgoff_t pgoff =3D linear_page_index(vma, khugepaged_scan.address); @@ -2725,15 +2737,15 @@ static unsigned int khugepaged_scan_mm_slot(unsigne= d int pages, int *result, } } else { *result =3D hpage_collapse_scan_pmd(mm, vma, - khugepaged_scan.address, &mmap_locked, cc); + khugepaged_scan.address, &mmap_locked, orders, cc); } =20 if (*result =3D=3D SCAN_SUCCEED) ++khugepaged_pages_collapsed; =20 /* move to next address */ - khugepaged_scan.address +=3D HPAGE_PMD_SIZE; - progress +=3D HPAGE_PMD_NR; + khugepaged_scan.address +=3D (PAGE_SIZE << order); + progress +=3D (1UL << order); if (!mmap_locked) /* * We released mmap_lock so break loop. Note @@ -3060,7 +3072,9 @@ int madvise_collapse(struct vm_area_struct *vma, stru= ct vm_area_struct **prev, fput(file); } else { result =3D hpage_collapse_scan_pmd(mm, vma, addr, - &mmap_locked, cc); + &mmap_locked, + BIT(HPAGE_PMD_ORDER), + cc); } if (!mmap_locked) *prev =3D NULL; /* Tell caller we dropped mmap_lock */ --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A96721F91E3 for ; Tue, 11 Feb 2025 11:15:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272559; cv=none; b=s1RbFL1+/HW8CSqGOGF4RsxDs/swWH1NBVe6tMAl8zqXW3frSEWqX8oMHHLBoCebYRX0/xkREgpZDerum+IkyekDBxAB4Nw4sG4qTtvs9KwJHQdECCue5+X+fZ0ad0YezK2YqHNZ94mAcyvr84KQMrV+zFfIP3WRBj4/CyIHBYM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272559; c=relaxed/simple; bh=3rnEd+xOAij9uQi20VgzYmdGJrX4pH4MSDprwMZkyBQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=iFb+cFRjG+sBkx0a5Ou5KZIGnKZR2tkx8QyHmucZh97wNSyjLrCnW5ECYLrLbtiQb3IWWuEF34dP4G3prjsxrB099CiJFU1qHMK5ZW69y3Zifzerd3pl3BbDAFsJIS5ghFMrYMBy9z1yuogXVso40YMgK112ONv9UDseU+1IebA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 68D7913D5; Tue, 11 Feb 2025 03:16:18 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 9313B3F5A1; Tue, 11 Feb 2025 03:15:46 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table Date: Tue, 11 Feb 2025 16:43:22 +0530 Message-Id: <20250211111326.14295-14-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" After enabling khugepaged to handle VMAs of any size, it may happen that the process faults on a VMA other than the VMA under collapse, and both these VMAs span the same PTE table. As a result, the fault handler will install a new PTE table after khugepaged isolates the PTE table. Therefore, scan the PTE table, retrieve all VMAs, and write lock them. Note that, rmap can still reach the PTE table from folios not under collapse; this is fine since it does not interfere with the PTEs under collapse, nor the foli= os under collapse, nor can rmap fill the PMD. Signed-off-by: Dev Jain --- mm/khugepaged.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 048f990d8507..e1c2c5b89f6d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1139,6 +1139,23 @@ static int alloc_charge_folio(struct folio **foliop,= struct mm_struct *mm, return SCAN_SUCCEED; } =20 +static void take_vma_locks_per_pte(struct mm_struct *mm, unsigned long had= dress) +{ + struct vm_area_struct *vma; + unsigned long start =3D haddress; + unsigned long end =3D haddress + HPAGE_PMD_SIZE; + + while (start < end) { + vma =3D vma_lookup(mm, start); + if (!vma) { + start +=3D PAGE_SIZE; + continue; + } + vma_start_write(vma); + start =3D vma->vm_end; + } +} + static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long= address, struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, struct folio *folio) @@ -1270,7 +1287,9 @@ static int vma_collapse_anon_folio(struct mm_struct *= mm, unsigned long address, if (result !=3D SCAN_SUCCEED) goto out; =20 - vma_start_write(vma); + /* Faulting may fill the PMD after flush; lock all VMAs mapping this PTE = */ + take_vma_locks_per_pte(mm, haddress); + anon_vma_lock_write(vma->anon_vma); =20 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress, --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 085051F8BCB for ; Tue, 11 Feb 2025 11:16:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272569; cv=none; b=PU4MrO+1AZbg8TGRcD5p0ryRU25h4u+KnOMFh0TEDz+d4b4aDiUc3/OF7iJsiqv9oGl6YyB/G9y7PmjD+kMKdGYWOsfxv2xrvtLCeSsmFmbE5tTOtw2yuc99ftQOwIVKCJJ1nfphQhabihX5Cr6MehLym83NynElIsIQXdUxv3A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272569; c=relaxed/simple; bh=xgo4K45Jo2UB7khUhz9mvLV3I1JSmJcGoo1j8w1znQs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=F0GXYLiML5D73WHCbrsI3ELiJtxLIVysWY5/z/xq+ijrxiRQrmc+/BPBQA8xh5bPzJ3jD6c//rjhVEtkEyNW/33lh4ynZ+coRva5x/Vy29HkYBTiXotgiYOKl9BAZmqKh+2n6136hPnxinrVFAw6ePC24ZKjXj8YNjKES9PyHk8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF1AD13D5; Tue, 11 Feb 2025 03:16:28 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 7B34A3F5A1; Tue, 11 Feb 2025 03:15:57 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment Date: Tue, 11 Feb 2025 16:43:23 +0530 Message-Id: <20250211111326.14295-15-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" There are two situations: 1) After retaking the mmap lock, the next VMA expands downwards. 2) After khugepaged sleeps and starts again, it will pick up the starting a= ddress from the global struct khugepaged_scan, and hence will pick up the same = VMA as in the previous cycle. In both cases, khugepaged_scan.address > hstart. Therefore, explicitly alig= n the address to the order we are scanning for. Previously this was not a problem= since the alignment was to be always PMD-aligned. Signed-off-by: Dev Jain --- mm/khugepaged.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e1c2c5b89f6d..7c9a758f6817 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2722,6 +2722,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, hstart =3D round_up(vma->vm_start, PAGE_SIZE << order); if (khugepaged_scan.address < hstart) khugepaged_scan.address =3D hstart; + else + khugepaged_scan.address =3D round_down(khugepaged_scan.address, PAGE_SI= ZE << order); + VM_BUG_ON(khugepaged_scan.address & ((PAGE_SIZE << order) - 1)); =20 while (khugepaged_scan.address < hend) { --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 11F411F9F73 for ; Tue, 11 Feb 2025 11:16:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272579; cv=none; b=fEFGZN7Hy+3D/ZTyDIgeGnmHtvZa/VEaNZhYytjAF4SVjBGGtKjj3cm2gWUXpQdPSFWhLygc+8N4p/7z5ol9kX359G8QzHYjzMQsIe9SKfdgdeXvXIfuQUyMqVcUYJy6fZECeeO2IcxlVUzMKzoc6H1cGRTET6zjg8wxnpjV7Bw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272579; c=relaxed/simple; bh=DxNCc6LhIHFxgx9mBBmND+ZonOy/FONexwys4x2dLX0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=GtTznFIgt72HrO2+ysvOyhKxRtSAEpNu1TRO7IjopCrmi/O46YbHEhY8izvqqF/UA4hSqk4Lqi20x1Eaxdn9HfoOTcmubxSMYoCX6pyxXZ9++xEbWTID8uT8z91V74aL5AHsNpFpL6iKkM2QJIrlB5ZB5tJxFVH1mDC7IpBzvkk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C80F513D5; Tue, 11 Feb 2025 03:16:38 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id CB9143F5A1; Tue, 11 Feb 2025 03:16:07 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 15/17] khugepaged: Delay cond_resched() Date: Tue, 11 Feb 2025 16:43:24 +0530 Message-Id: <20250211111326.14295-16-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Post scanning VMAs less than PMD-size, cond_resched() may get called at a frequency of 1 << order worth of pte scan. Earlier, this was at a PMD-worth scan. Therefore, manually enforce the previous behaviour; not doing this causes the khugepaged selftest to timeout. Signed-off-by: Dev Jain --- mm/khugepaged.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 7c9a758f6817..d2bb008b95e7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2650,6 +2650,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, unsigned long orders; int order; bool is_file_vma; + int prev_progress =3D 0; =20 VM_BUG_ON(!pages); lockdep_assert_held(&khugepaged_mm_lock); @@ -2730,7 +2731,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned= int pages, int *result, while (khugepaged_scan.address < hend) { bool mmap_locked =3D true; =20 - cond_resched(); + if (progress - prev_progress >=3D HPAGE_PMD_NR) { + cond_resched(); + prev_progress =3D progress; + } if (unlikely(hpage_collapse_test_exit_or_disable(mm))) goto breakouterloop; =20 --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5A7ED1F4E4F for ; Tue, 11 Feb 2025 11:16:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272589; cv=none; b=HJBWNjXXc1xb4yzumc+Sf83rXzfwOznkaUo2r4m2/1Ih5qwqMTOA8jwk89J+dHesXTadBBqnbdfla3p7wzBw4mHa0gOMHKVwz18sgmNTz5CM1Y3givfF+3L/GIetW55JeYTWcuNx9tsa4pqC7r9P87rWcGB3okbEt5rzdZVX/y4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272589; c=relaxed/simple; bh=EVppT31Lvruvr7LyntwxcS6OyoxM2LuHdbYOcEQZts4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Swr+zS/srVe64wsNR9Gb0LYtjad9C/vw1eTWTTbgAGXWoQ6n28de18IiR7GyeAFIBEjV++KSnR0XXUPDDIfBkmRlQlnMnsrv0IPaJ7P9x1PaApbxb9jvnTeIZ8WCBntSkusx3r7bjqZdU0syi9r9zPCglFSDUcOectoEq5dfxsQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1C70313D5; Tue, 11 Feb 2025 03:16:49 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id D52D43F5A1; Tue, 11 Feb 2025 03:16:17 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse Date: Tue, 11 Feb 2025 16:43:25 +0530 Message-Id: <20250211111326.14295-17-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" As noted in the discussion thread ending at [1], avoid the creep problem by collapsing to mTHPs only if max_ptes_none is zero or 511. Along with this, make mTHP collapse conditions stricter by removing scaling of max_ptes_shar= ed and max_ptes_swap, and consider collapse only if there are no shared or swap PTEs in the range. [1] https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.co= m/ Signed-off-by: Dev Jain --- mm/khugepaged.c | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d2bb008b95e7..b589f889bb5a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -417,6 +417,17 @@ static inline int hpage_collapse_test_exit_or_disable(= struct mm_struct *mm) =20 static bool thp_enabled(void) { + bool anon_pmd_enabled =3D (test_bit(PMD_ORDER, &huge_anon_orders_always) = || + test_bit(PMD_ORDER, &huge_anon_orders_madvise) || + (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + hugepage_global_enabled())); + + /* + * If PMD_ORDER is ineligible for collapse, check if mTHP collapse policy= is obeyed; + * see Documentation/admin-guide/transhuge.rst + */ + bool anon_collapse_mthp =3D (khugepaged_max_ptes_none =3D=3D 0 || + khugepaged_max_ptes_none =3D=3D HPAGE_PMD_NR - 1); /* * We cover the anon, shmem and the file-backed case here; file-backed * hugepages, when configured in, are determined by the global control. @@ -427,8 +438,9 @@ static bool thp_enabled(void) if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && hugepage_global_enabled()) return true; - if (huge_anon_orders_always || huge_anon_orders_madvise || - (huge_anon_orders_inherit && hugepage_global_enabled())) + if ((huge_anon_orders_always || huge_anon_orders_madvise || + (huge_anon_orders_inherit && hugepage_global_enabled())) && + (anon_pmd_enabled || anon_collapse_mthp)) return true; if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled()) return true; @@ -578,13 +590,16 @@ static int __collapse_huge_page_isolate(struct vm_are= a_struct *vma, pte_t *_pte; int none_or_zero =3D 0, shared =3D 0, result =3D SCAN_FAIL, referenced = =3D 0; bool writable =3D false; - unsigned int max_ptes_shared =3D khugepaged_max_ptes_shared >> (HPAGE_PMD= _ORDER - order); + unsigned int max_ptes_shared =3D khugepaged_max_ptes_shared; unsigned int max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORD= ER - order); bool all_pfns_present =3D true; bool all_pfns_contig =3D true; bool first_pfn_aligned =3D true; pte_t prev_pteval; =20 + if (order !=3D HPAGE_PMD_ORDER) + max_ptes_shared =3D 0; + for (_pte =3D pte; _pte < pte + (1UL << order); _pte++, address +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); @@ -1453,11 +1468,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct= *mm, order =3D highest_order(orders); VM_BUG_ON(address & ((PAGE_SIZE << order) - 1)); =20 + max_ptes_none =3D khugepaged_max_ptes_none; + max_ptes_shared =3D khugepaged_max_ptes_shared; + max_ptes_swap =3D khugepaged_max_ptes_swap; + scan_pte_range: =20 - max_ptes_shared =3D khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - orde= r); + if (order !=3D HPAGE_PMD_ORDER) + max_ptes_shared =3D max_ptes_swap =3D 0; + max_ptes_none =3D khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); - max_ptes_swap =3D khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order); referenced =3D 0, shared =3D 0, none_or_zero =3D 0, unmapped =3D 0; all_pfns_present =3D true, all_pfns_contig =3D true, first_pfn_aligned = =3D true; =20 @@ -2651,6 +2671,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned= int pages, int *result, int order; bool is_file_vma; int prev_progress =3D 0; + bool collapse_mthp =3D true; + + /* Avoid the creep problem; see Documentation/admin-guide/transhuge.rst */ + if (khugepaged_max_ptes_none && khugepaged_max_ptes_none !=3D HPAGE_PMD_N= R - 1) + collapse_mthp =3D false; =20 VM_BUG_ON(!pages); lockdep_assert_held(&khugepaged_mm_lock); @@ -2710,6 +2735,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, /* select the highest possible order for the VMA */ order =3D highest_order(orders); while (orders) { + if (order !=3D HPAGE_PMD_ORDER && !collapse_mthp) + goto skip; hend =3D round_down(vma->vm_end, PAGE_SIZE << order); if (khugepaged_scan.address <=3D hend) break; --=20 2.30.2 From nobody Thu Dec 18 01:49:24 2025 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id DAD7A1F4E4F for ; Tue, 11 Feb 2025 11:16:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272600; cv=none; b=KXnSI6FOTrZRQ6aZE67VANrxhIMEnhQWF2feoYlrOEyU/FasFGCTkT9FAUzT6MKWh7BD4DmBEr7J3izHeHIpUNlT2foQ0MQp0H8Uxwn4lhMAniYe+qir9/o1PvwGqwbCGmPj/FHTZvoXmqtuPYTkOCcD3f9Z7SCeycIjjSdPGac= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739272600; c=relaxed/simple; bh=imsoImSM7+lqKreALk6c6mN57FaidMYWztM2PbVsMjI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AH7K6z1edTfSCbdXT086ZRYLN68FqVFXrlvoNTcrgZ1ezREibqYcmYLzrhTsIUrpz+QvynKwY+g6Zmnb1oonbLEroULj0pc4HflU2zv8VrGgT3rz8y6U9HUptBmmMNR5vsve1OQGHU/xRj47xRhL1iT288B1ySzTDaYjnG/DfvE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A26381477; Tue, 11 Feb 2025 03:16:59 -0800 (PST) Received: from K4MQJ0H1H2.emea.arm.com (K4MQJ0H1H2.blr.arm.com [10.162.40.80]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 2A64C3F5A1; Tue, 11 Feb 2025 03:16:27 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: npache@redhat.com, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Date: Tue, 11 Feb 2025 16:43:26 +0530 Message-Id: <20250211111326.14295-18-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20250211111326.14295-1-dev.jain@arm.com> References: <20250211111326.14295-1-dev.jain@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update documentation to reflect the mTHP specific changes for khugepaged. Signed-off-by: Dev Jain --- Documentation/admin-guide/mm/transhuge.rst | 49 +++++++++++++++++----- 1 file changed, 38 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index dff8d5985f0f..6a513fa81005 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -63,7 +63,7 @@ often. THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into PMD-sized huge pages. +collapses sequences of basic pages into huge pages. =20 The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -212,20 +212,16 @@ this behaviour by writing 0 to shrink_underused, and = enable it by writing echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused =20 -khugepaged will be automatically started when PMD-sized THP is enabled +khugepaged will be automatically started when THP is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when -PMD-sized THP is disabled (when both the per-size anon control and the -top-level control are "never") +THP is disabled (when all of the per-size anon controls and the +top-level control are "never"). mTHP collapse is supported only for +private-anonymous memory. =20 Khugepaged controls ------------------- =20 -.. note:: - khugepaged currently only searches for opportunities to collapse to - PMD-sized THP and no attempt is made to collapse to other THP - sizes. - khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -254,8 +250,9 @@ The khugepaged progress can be seen in the number of pa= ges collapsed (note that this counter may not be an exact count of the number of pages collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping being replaced by a PMD mapping, or (2) All 4K physical pages replaced by -one 2M hugepage. Each may happen independently, or together, depending on -the type of memory and the failures that occur. As such, this value should +one 2M hugepage, or (3) A portion of the PTE mapping 4K pages replaced by +a mapping to an mTHP. Each may happen independently, or together, depending +on the type of memory and the failures that occur. As such, this value sho= uld be interpreted roughly as a sign of progress, and counters in /proc/vmstat consulted for more accurate accounting):: =20 @@ -294,6 +291,36 @@ that THP is shared. Exceeding the number would block t= he collapse:: =20 A higher value may increase memory footprint for some workloads. =20 +Khugepaged specifics for anon-mTHP collapse +------------------------------------------ + +The objective of khugepaged is to collapse memory to the highest aligned o= rder +possible. If it fails on PMD order, it will greedily try the lower orders. + +The tunables max_ptes_shared and max_ptes_swap are considered to be zero f= or +mTHP collapsing; i.e the memory range must not have any shared or swap PTE +for it to be eligible for mTHP collapse. + +The tunable max_ptes_none is scaled downwards, according to the order of +the collapse. For example, if max_ptes_none =3D 511, and khugepaged tries = to +collapse to order 4, then the memory range under consideration will become +a candidate for collapse only when the number of none PTEs (out of the 16 = PTEs) +does not exceed: 511 >> (9 - 4) =3D 15. + +mTHP collapse is supported only if max_ptes_none is either zero or 511 (on= e less +than the number of entries in the PTE table). Any other value, given the s= caling +logic presented above, produces what we call the "creep" problem; let the = bitmask +00110000 denote a memory range mapped by 8 consecutive pagetable entries, = where 0 +denotes an empty pte and 1, a pte embedding a physical folio. Let max_ptes= _none =3D 50% +(i.e max_ptes_none =3D 256, which implies 256 >> (9 - 4) =3D 8 for our cas= e). If order-2 and +order-3 are enabled, khugepaged may do the following: it scans the range f= or order-3, but +since the percentage of none ptes =3D 5/8 * 100 =3D 62.5%, it drops down t= o order 2. +It successfully collapses to order-2 for the first 4 PTEs, and the memory = range becomes: +11110000 +Now, from the order-3 PoV, the range has 4 out of 8 PTEs filled, and the r= ange has now +suddenly become eligible for order-3 collapse. So, we can creep into large= order +collapses in a very inefficient manner. + Boot parameters =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.30.2