From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2280029AAF3 for ; Fri, 5 Jun 2026 16:14:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676050; cv=none; b=VMXqz0UbabxaynFiBktg97lOwbvHziY8d7owefRNq9ekJoVKyeMTDfrI2D4FgYqFHkuXGASTC48oSQWGFoD/w0uxoSxZ7HvETyHEoFrV2hEP+M+FvTzB4iO9BMhkgdTFqBW7jKMCrRzmDAY+2SFqKRsInB1BnEdZrwjtdRPpZkQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676050; c=relaxed/simple; bh=MdpoajxXA6hKuAeLC6nQfBS+RgZBrvOYaIgoF+Vty6s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PSg9C2C8Bm6I8R3xxyKAMW15Wro1cnRYx8ijSiCEU/v5MzzLE65ahmV9z5DG/12OqoRiZNOk6PjYuiT8DWXBGYxlKCPInVkZCI/W1HF2U2fYqMyCwHGO9P7SNqfJVyWvOZ2UdrZpdOKEGiUHtYlMq5MBJmCCzncySyphAoinAG8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Wg5yrFaU; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Wg5yrFaU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676047; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=W10LZnS31yePGPe4mtnmYpxQr94H9xJjZCEKApsJr5Q=; b=Wg5yrFaUDY5+UUr1Oc78UJczKVebQTEGTORMsCQb/nM+u09GoRIeNtz8U8TYq8vbcKqjsV 48l1YwBxRirO9x6aHk950yjPyGtkhXo9nURllTTGD4Ut6AWL/mdWbMVON2+Zk4WGTQV6m/ 9u2CU1pFdExhpezqybRlTEAqBk2pv18= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-575-wJVdCErdPf-Q5PRP4ZNHfw-1; Fri, 05 Jun 2026 12:14:04 -0400 X-MC-Unique: wJVdCErdPf-Q5PRP4ZNHfw-1 X-Mimecast-MFC-AGG-ID: wJVdCErdPf-Q5PRP4ZNHfw_1780676042 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9FA8D1956089; Fri, 5 Jun 2026 16:14:02 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9B0531800351; Fri, 5 Jun 2026 16:13:42 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 01/14] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Date: Fri, 5 Jun 2026 10:14:08 -0600 Message-ID: <20260605161422.213817-2-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" For khugepaged to support different mTHP orders, we must generalize this to check if the PMD is not shared by another VMA and that the order is enabled. We cannot collapse VMA regions that do not span the full PMD. This is due to the potential of the PMD being shared by another VMA which leaves us vulnerable to race conditions if neighboring VMAs are resized. Always check the PMD order here to ensure its not shared by another VMA. We'd need to lock all VMAs in the PMD range to support this which may lead to increased lock contention and code complexity. No functional change in this patch. Also correct a comment about the functionality of the revalidation and fix a double space issues. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache --- mm/khugepaged.c | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a4b97ec8ce56..b3910042bbf7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -905,12 +905,13 @@ static int collapse_find_target_node(struct collapse_= control *cc) =20 /* * If mmap_lock temporarily dropped, revalidate vma - * before taking mmap_lock. + * after taking the mmap_lock again. * Returns enum scan_result value. */ =20 static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsi= gned long address, - bool expect_anon, struct vm_area_struct **vmap, struct collapse_control = *cc) + bool expect_anon, struct vm_area_struct **vmap, + struct collapse_control *cc, unsigned int order) { struct vm_area_struct *vma; enum tva_type type =3D cc->is_khugepaged ? TVA_KHUGEPAGED : @@ -923,15 +924,22 @@ static enum scan_result hugepage_vma_revalidate(struc= t mm_struct *mm, unsigned l if (!vma) return SCAN_VMA_NULL; =20 + /* + * We cannot collapse VMA regions that do not span the full PMD. This is + * due to the potential of the PMD being shared by another VMA leaving + * us vulnerable to a race condition. Always check the PMD order here to + * ensure its not shared by another VMA. We'd need to lock all VMAs in + * the PMD range to support this. + */ if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order))) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then * remapped to file after khugepaged reaquired the mmap_lock. * - * thp_vma_allowable_order may return true for qualified file + * thp_vma_allowable_orders may return true for qualified file * vmas. */ if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))) @@ -1124,7 +1132,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a goto out_nolock; =20 mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1158,7 +1167,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * mmap_lock. */ mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ @@ -2861,8 +2871,8 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, mmap_unlocked =3D false; *lock_dropped =3D true; result =3D hugepage_vma_revalidate(mm, addr, false, &vma, - cc); - if (result !=3D SCAN_SUCCEED) { + cc, HPAGE_PMD_ORDER); + if (result !=3D SCAN_SUCCEED) { last_fail =3D result; goto out_nolock; } --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B47A4EA37C for ; Fri, 5 Jun 2026 16:14:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676071; cv=none; b=WvAPZ/Y4ceKDKdro4luj5DaJhqI3cBvv8g5pRzk2RwresPt/jZfoZA+v9trZSpOWriN2LfFy0+EGrh55X2OGfpma6PE7uCNsNLrU3EfHNbps8m2HahDEE5jOwpHayItnIjEaQggynKmYqio4NrKwPY/5qAXTpOU2pO0AtpowLWk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676071; c=relaxed/simple; bh=tuavxpwLgvCxfcb16bH9ZbLlq4B8LDZlo4dK/7Js9yE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OAH2RMcmXq7AHDyDDzGOkiERC69egitIJAhi5GW85lCSzvZuHYW9XYQBcQIZXr9VP5PSlcJQNDVWGMzEqU26s0UYg5J4XMLOMg0efViKKj7SJLNMGT++rwNIvOcIPPDGJajlkunQw2txQm2IOWd92G7h8mR8NlIPjyvAHnC6aCc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=fGHoLSBU; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="fGHoLSBU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676068; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xlR8fhzh8q63CKgSojQvD7fR1bfQEPbHYVxE9z/YBVA=; b=fGHoLSBUC6lHaMheU63kUUU2u6mbRD1z3yTNkpJL7l2YTBcFgGTdj1fGOQPd5Vk6F/8EhW ev7D+g8Owgqn9CpYEoXc6Of3NyxZ/HLPtbBTvPcwa4Dewq4rRGRAsi5zVCjHe5MeRwM2Gn W4eIFuQi/hBb0W2nlOzYz/wnxNkFBuA= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-440-5lOMKw1SMtGdkD8YEUDwuQ-1; Fri, 05 Jun 2026 12:14:24 -0400 X-MC-Unique: 5lOMKw1SMtGdkD8YEUDwuQ-1 X-Mimecast-MFC-AGG-ID: 5lOMKw1SMtGdkD8YEUDwuQ_1780676063 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id F320519560A2; Fri, 5 Jun 2026 16:14:22 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 306B1180049F; Fri, 5 Jun 2026 16:14:02 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 02/14] mm/khugepaged: generalize alloc_charge_folio() Date: Fri, 5 Jun 2026 10:14:09 -0600 Message-ID: <20260605161422.213817-3-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" From: Dev Jain Pass order to alloc_charge_folio() and update mTHP statistics. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Signed-off-by: Dev Jain Co-developed-by: Nico Pache Signed-off-by: Nico Pache --- Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++ include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/khugepaged.c | 20 +++++++++++++------- 4 files changed, 27 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 76f4eb14e262..a74844e01f1e 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -639,6 +639,14 @@ anon_fault_fallback_charge instead falls back to using huge pages with lower orders or small pages even though the allocation was successful. =20 +collapse_alloc + is incremented every time a huge page is successfully allocated for a + khugepaged collapse. + +collapse_alloc_failed + is incremented every time a huge page allocation fails during a + khugepaged collapse. + zswpout is incremented every time a huge page is swapped out to zswap in one piece without splitting. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 58382e97a66d..443852423790 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -128,6 +128,8 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_COLLAPSE_ALLOC, + MTHP_STAT_COLLAPSE_ALLOC_FAILED, MTHP_STAT_ZSWPOUT, MTHP_STAT_SWPIN, MTHP_STAT_SWPIN_FALLBACK, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1f14c5c48b4a..eea83da9114a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -699,6 +699,8 @@ static struct kobj_attribute _name##_attr =3D __ATTR_RO= (_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FAL= LBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC); +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAIL= ED); DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT); DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN); DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK); @@ -764,6 +766,8 @@ static struct attribute *any_stats_attrs[] =3D { #endif &split_attr.attr, &split_failed_attr.attr, + &collapse_alloc_attr.attr, + &collapse_alloc_failed_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b3910042bbf7..44564c179636 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1077,28 +1077,34 @@ static enum scan_result __collapse_huge_page_swapin= (struct mm_struct *mm, } =20 static enum scan_result alloc_charge_folio(struct folio **foliop, struct m= m_struct *mm, - struct collapse_control *cc) + struct collapse_control *cc, unsigned int order) { gfp_t gfp =3D (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); int node =3D collapse_find_target_node(cc); struct folio *folio; =20 - folio =3D __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); + folio =3D __folio_alloc(gfp, order, node, &cc->alloc_nmask); if (!folio) { *foliop =3D NULL; - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED); return SCAN_ALLOC_HUGE_PAGE_FAIL; } =20 - count_vm_event(THP_COLLAPSE_ALLOC); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC); + if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { folio_put(folio); *foliop =3D NULL; return SCAN_CGROUP_CHARGE_FAIL; } =20 - count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1); + if (is_pmd_order(order)) + count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1); =20 *foliop =3D folio; return SCAN_SUCCEED; @@ -1127,7 +1133,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a */ mmap_read_unlock(mm); =20 - result =3D alloc_charge_folio(&folio, mm, cc); + result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; =20 @@ -1908,7 +1914,7 @@ static enum scan_result collapse_file(struct mm_struc= t *mm, unsigned long addr, VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); =20 - result =3D alloc_charge_folio(&new_folio, mm, cc); + result =3D alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out; =20 --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4AF4E404BC5 for ; Fri, 5 Jun 2026 16:14:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676089; cv=none; b=oWTwedoh0jCw7380ljpktGJxkEyPw4tMvKsyHRLYqrzvIKcJw8UnHcc87FyDcRoemAgdljYREaOUhkcbDx/iy4f4W1nlCEHcyH/H+/eKKCTV1Y4v4B02kjYdMn60006ZRmgsct/cHphRdOoEQBdYqZa6SHfz84jj9X4ETQ7zpH8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676089; c=relaxed/simple; bh=6BDBX/mFaIdEj3iSeo2sneTZKGNE15o9rtXccUDYSbQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=a9EARV9rMvJfM6Ct5loG/aB9qmQKT9lEOz/Nth4ootvaa0z175n27nK6Y5MBLWzfaPba+i3yT6oCq7bALo8vwLbQ69rt6VDjYDr8oTRCCDM9CELRI4lVvjibMPZriVqZJvE50yDkRIGBoqBmklHWrB5XeWzUb9YDWTGzlOGR3MI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=EpnhFVa3; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EpnhFVa3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676086; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EvMkx3xZvi80w/qjjymYzAQzqCCVHeJf6pT43rxuucw=; b=EpnhFVa3wXgd8iQOZbBJK2BulZ87PlccHdnr78UhNzEuI3awM1mjJd211j2mGQJZZEY/Xj IjjORLPJ3kczefVah9CJvAI4BfrcRWuT0rQSledZ6hxxJIR677aVJyX5eSMMmBesnpFvL3 u6R1T/SI/I7llWJaaqFg4nfphqqDxco= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-605-aos35wI1OuC2SnUMVWc42w-1; Fri, 05 Jun 2026 12:14:43 -0400 X-MC-Unique: aos35wI1OuC2SnUMVWc42w-1 X-Mimecast-MFC-AGG-ID: aos35wI1OuC2SnUMVWc42w_1780676082 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 0DBA3195608D; Fri, 5 Jun 2026 16:14:42 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 8200F1800351; Fri, 5 Jun 2026 16:14:23 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions Date: Fri, 5 Jun 2026 10:14:10 -0600 Message-ID: <20260605161422.213817-4-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" The following cleanup reworks all the max_ptes_* handling into helper functions. This increases the code readability and will later be used to implement the mTHP handling of these variables. With these changes we abstract all the madvise_collapse() special casing (do not respect the sysctls) away from the functions that utilize them. And will be used later in this series to cleanly restrict the mTHP collapse behavior. No functional change is intended; however, we are now only reading the sysfs variables once per scan, whereas before these variables were being read on each loop iteration. Reviewed-by: Zi Yan Reviewed-by: Lorenzo Stoakes Reviewed-by: Lance Yang Suggested-by: David Hildenbrand Acked-by: David Hildenbrand (Arm) Acked-by: Usama Arif Signed-off-by: Nico Pache --- mm/khugepaged.c | 120 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 84 insertions(+), 36 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 44564c179636..f56ab049a6c4 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -348,6 +348,64 @@ static bool pte_none_or_zero(pte_t pte) return pte_present(pte) && is_zero_pfn(pte_pfn(pte)); } =20 +/** + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs or PTEs m= apping + * the shared zeropage for the given collapse operation. + * @cc: The collapse control struct + * @vma: The vma to check for userfaultfd + * + * Return: Maximum number of empty/shared zeropage PTEs for the collapse o= peration + */ +static unsigned int collapse_max_ptes_none(struct collapse_control *cc, + struct vm_area_struct *vma) +{ + if (vma && userfaultfd_armed(vma)) + return 0; + /* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */ + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + /* For all other cases respect the user defined maximum */ + return khugepaged_max_ptes_none; +} + +/** + * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shar= ed + * anonymous pages for the given collapse operation. + * @cc: The collapse control struct + * + * Return: Maximum number of PTEs that map shared anonymous pages for the + * collapse operation + */ +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc) +{ + /* + * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared + * anonymous pages. + */ + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + return khugepaged_max_ptes_shared; +} + +/** + * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs= or the + * maximum allowed non-present pagecache entries for the given collapse op= eration. + * @cc: The collapse control struct + * + * Return: Maximum number of non-present PTEs or the maximum allowed non-p= resent + * pagecache entries for the collapse operation. + */ +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc) +{ + /* + * For MADV_COLLAPSE, do not restrict the number PTEs entries or + * pagecache entries that are non-present. + */ + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + return khugepaged_max_ptes_swap; +} + int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags, int advice) { @@ -543,6 +601,8 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, unsigned long start_addr, pte_t *pte, struct collapse_control *cc, struct list_head *compound_pagelist) { + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); + const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); struct page *page =3D NULL; struct folio *folio =3D NULL; unsigned long addr =3D start_addr; @@ -554,16 +614,12 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, _pte++, addr +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { - continue; - } else { + if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); goto out; } + continue; } if (!pte_present(pteval)) { result =3D SCAN_PTE_NON_PRESENT; @@ -594,9 +650,7 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, =20 /* See collapse_scan_pmd(). */ if (folio_maybe_mapped_shared(folio)) { - ++shared; - if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out; @@ -1271,6 +1325,9 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, struct vm_area_struct *vma, unsigned long start_addr, bool *lock_dropped, struct collapse_control *cc) { + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); + const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); + const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); pmd_t *pmd; pte_t *pte, *_pte; int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; @@ -1304,36 +1361,29 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, =20 pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { - continue; - } else { + if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); goto out_unmap; } + continue; } if (!pte_present(pteval)) { - ++unmapped; - if (!cc->is_khugepaged || - unmapped <=3D khugepaged_max_ptes_swap) { - /* - * Always be strict with uffd-wp - * enabled swap entries. Please see - * comment below for pte_uffd_wp(). - */ - if (pte_swp_uffd_wp_any(pteval)) { - result =3D SCAN_PTE_UFFD_WP; - goto out_unmap; - } - continue; - } else { + if (++unmapped > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); goto out_unmap; } + /* + * Always be strict with uffd-wp + * enabled swap entries. Please see + * comment below for pte_uffd_wp(). + */ + if (pte_swp_uffd_wp_any(pteval)) { + result =3D SCAN_PTE_UFFD_WP; + goto out_unmap; + } + continue; } if (pte_uffd_wp(pteval)) { /* @@ -1376,9 +1426,7 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, * is shared. */ if (folio_maybe_mapped_shared(folio)) { - ++shared; - if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out_unmap; @@ -2333,6 +2381,8 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, unsigned long addr, struct file *file, pgoff_t start, struct collapse_control *cc) { + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, NULL); + const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); struct folio *folio =3D NULL; struct address_space *mapping =3D file->f_mapping; XA_STATE(xas, &mapping->i_pages, start); @@ -2351,8 +2401,7 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, =20 if (xa_is_value(folio)) { swap +=3D 1 << xas_get_order(&xas); - if (cc->is_khugepaged && - swap > khugepaged_max_ptes_swap) { + if (swap > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); break; @@ -2423,8 +2472,7 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, cc->progress +=3D HPAGE_PMD_NR; =20 if (result =3D=3D SCAN_SUCCEED) { - if (cc->is_khugepaged && - present < HPAGE_PMD_NR - khugepaged_max_ptes_none) { + if (present < HPAGE_PMD_NR - max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); } else { --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBDF840BCDD for ; Fri, 5 Jun 2026 16:15:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676108; cv=none; b=W/D4VuOcB4/v4GEOjauQ3/eNUhAiKZLGaPqEGpgyyIzeLc8r2PYitX4fNsl93YUldRV2HkEHKniEOSW8YMQalt2L/75PGw/4b8r4uxP8+HMfSV0txumYE4muW2ufaFAtbDLl7yjnOwTcuqO3d+b1mKjsSPETTSNhwbvIC7wK/7k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676108; c=relaxed/simple; bh=kv5eLWE99kBnytB/+lR2s4rXEMPg9Ib6T+GmpdwfIQ4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Yqusbms89HlncPQ6tQtGZTXUc8A0P85bb9EtjMWGhYxpcmrLh7NqFd/fRcy8iddAev4N68TEN1qDH4tVxXiDiB54VSvJpgZyo1CjL3H2ymg6hISb6LBSrGprPFVUXHnT/naB0lYvk42WsVOyGkeozPPOE9xNV0ZTUAPZ8QrYPHI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=eArfSn8c; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="eArfSn8c" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676105; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sM2ghSC4v58MaQLggKCtDyraE3Qu9nrQx3/9hZTQe8Y=; b=eArfSn8c2qFMMgmK5FhvWd6G6iPAR6hEPqlujdROMWm1LENhdNGifh94IXGmkMwza5dKJZ sMxznWJh5qWEVd54gOLoSTZNcuywB/Ch10kgFbEzbOvRcF2j1x36ahtbhpmf//QixUvljR TDbNW1rAVb0ZrbXGNsWk5uqM9cCNQn4= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-249-maJH_NK2Pbql8v_fKvGCtA-1; Fri, 05 Jun 2026 12:15:01 -0400 X-MC-Unique: maJH_NK2Pbql8v_fKvGCtA-1 X-Mimecast-MFC-AGG-ID: maJH_NK2Pbql8v_fKvGCtA_1780676100 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C685819560A1; Fri, 5 Jun 2026 16:15:00 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 9E6F818005B5; Fri, 5 Jun 2026 16:14:42 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Date: Fri, 5 Jun 2026 10:14:11 -0600 Message-ID: <20260605161422.213817-5-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" generalize the order of the __collapse_huge_page_* and collapse_max_* functions to support future mTHP collapse. The current mechanism for determining collapse with the khugepaged_max_ptes_none value is not designed with mTHP in mind. This raises a key design issue: if we support user defined max_pte_none values (even those scaled by order), a collapse of a lower order can introduces an feedback loop, or "creep", when max_ptes_none is set to a value greater than HPAGE_PMD_NR / 2. [1] With this configuration, a successful collapse to order N will populate enough pages to satisfy the collapse condition on order N+1 on the next scan. This leads to unnecessary work and memory churn. To fix this issue introduce a helper function that will limit mTHP collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1. This effectively supports two modes: [2] - max_ptes_none=3D0: never collapses if it encounters an empty PTE or a PTE that maps the shared zeropage. Consequently, no memory bloat. - max_ptes_none=3D511 (on 4k pagesz): Always collapse to the highest available mTHP order. This removes the possibility of "creep", and a warning will be emitted if any non-supported max_ptes_none value is configured with mTHP enabled. Any intermediate value will default mTHP collapse to max_ptes_none=3D0. mTHP collapse will not honor the khugepaged_max_ptes_shared or khugepaged_max_ptes_swap parameters, and will fail if it encounters a shared or swapped entry. No functional changes in this patch; however it defines future behavior for mTHP collapse. [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.= com [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redh= at.com Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand (arm) Reviewed-by: Lance Yang Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Reviewed-by: Zi Yan --- mm/khugepaged.c | 126 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 93 insertions(+), 33 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f56ab049a6c4..474ee97c54ba 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -353,30 +353,51 @@ static bool pte_none_or_zero(pte_t pte) * the shared zeropage for the given collapse operation. * @cc: The collapse control struct * @vma: The vma to check for userfaultfd + * @order: The folio order being collapsed to * * Return: Maximum number of empty/shared zeropage PTEs for the collapse o= peration */ static unsigned int collapse_max_ptes_none(struct collapse_control *cc, - struct vm_area_struct *vma) + struct vm_area_struct *vma, unsigned int order) { + const unsigned int max_ptes_none =3D khugepaged_max_ptes_none; + if (vma && userfaultfd_armed(vma)) return 0; /* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */ if (!cc->is_khugepaged) return HPAGE_PMD_NR; - /* For all other cases respect the user defined maximum */ - return khugepaged_max_ptes_none; + /* for PMD collapse, respect the user defined maximum */ + if (is_pmd_order(order)) + return max_ptes_none; + /* + * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIM= IT, + * scale the maximum number of PTEs to the order of the collapse. + */ + if (max_ptes_none =3D=3D KHUGEPAGED_MAX_PTES_LIMIT) + return (1 << order) - 1; + /* + * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT, + * emit a warning and return 0. + */ + if (max_ptes_none) + pr_warn_once("mTHP collapse does not support max_ptes_none" + " values other than 0 or %u, defaulting to 0.\n", + KHUGEPAGED_MAX_PTES_LIMIT); + return 0; } =20 /** * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shar= ed * anonymous pages for the given collapse operation. * @cc: The collapse control struct + * @order: The folio order being collapsed to * * Return: Maximum number of PTEs that map shared anonymous pages for the * collapse operation */ -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc) +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc, + unsigned int order) { /* * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared @@ -384,6 +405,13 @@ static unsigned int collapse_max_ptes_shared(struct co= llapse_control *cc) */ if (!cc->is_khugepaged) return HPAGE_PMD_NR; + /* + * for mTHP collapse do not allow collapsing anonymous memory pages that + * are shared between processes. + */ + if (!is_pmd_order(order)) + return 0; + /* for PMD collapse, respect the user defined maximum */ return khugepaged_max_ptes_shared; } =20 @@ -391,11 +419,13 @@ static unsigned int collapse_max_ptes_shared(struct c= ollapse_control *cc) * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs= or the * maximum allowed non-present pagecache entries for the given collapse op= eration. * @cc: The collapse control struct + * @order: The folio order being collapsed to * * Return: Maximum number of non-present PTEs or the maximum allowed non-p= resent * pagecache entries for the collapse operation. */ -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc) +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc, + unsigned int order) { /* * For MADV_COLLAPSE, do not restrict the number PTEs entries or @@ -403,6 +433,10 @@ static unsigned int collapse_max_ptes_swap(struct coll= apse_control *cc) */ if (!cc->is_khugepaged) return HPAGE_PMD_NR; + /* for mTHP collapse do not allow any non-present PTEs or pagecache entri= es */ + if (!is_pmd_order(order)) + return 0; + /* for PMD collapse, respect the user defined maximum */ return khugepaged_max_ptes_swap; } =20 @@ -599,10 +633,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, =20 static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct= *vma, unsigned long start_addr, pte_t *pte, struct collapse_control *cc, - struct list_head *compound_pagelist) + unsigned int order, struct list_head *compound_pagelist) { - const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); - const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma, orde= r); + const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, order= ); + const unsigned long nr_pages =3D 1UL << order; struct page *page =3D NULL; struct folio *folio =3D NULL; unsigned long addr =3D start_addr; @@ -610,7 +645,7 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; enum scan_result result =3D SCAN_FAIL; =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; + for (_pte =3D pte; _pte < pte + nr_pages; _pte++, addr +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { @@ -650,6 +685,12 @@ static enum scan_result __collapse_huge_page_isolate(s= truct vm_area_struct *vma, =20 /* See collapse_scan_pmd(). */ if (folio_maybe_mapped_shared(folio)) { + /* + * TODO: Support shared pages without leading to further + * mTHP collapses. Currently bringing in new pages via + * shared may cause a future higher order collapse on a + * rescan of the same range. + */ if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); @@ -743,18 +784,18 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, } =20 static void __collapse_huge_page_copy_succeeded(pte_t *pte, - struct vm_area_struct *vma, - unsigned long address, - spinlock_t *ptl, - struct list_head *compound_pagelist) + struct vm_area_struct *vma, unsigned long address, + spinlock_t *ptl, unsigned int order, + struct list_head *compound_pagelist) { - unsigned long end =3D address + HPAGE_PMD_SIZE; + const unsigned long nr_pages =3D 1UL << order; + unsigned long end =3D address + (PAGE_SIZE * nr_pages); struct folio *src, *tmp; pte_t pteval; pte_t *_pte; unsigned int nr_ptes; =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; _pte +=3D nr_ptes, + for (_pte =3D pte; _pte < pte + nr_pages; _pte +=3D nr_ptes, address +=3D nr_ptes * PAGE_SIZE) { nr_ptes =3D 1; pteval =3D ptep_get(_pte); @@ -807,11 +848,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t= *pte, } =20 static void __collapse_huge_page_copy_failed(pte_t *pte, - pmd_t *pmd, - pmd_t orig_pmd, - struct vm_area_struct *vma, - struct list_head *compound_pagelist) + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, + unsigned int order, struct list_head *compound_pagelist) { + const unsigned long nr_pages =3D 1UL << order; spinlock_t *pmd_ptl; =20 /* @@ -827,7 +867,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, * Release both raw and compound pages isolated * in __collapse_huge_page_isolate. */ - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); + release_pte_pages(pte, pte + nr_pages, compound_pagelist); } =20 /* @@ -847,16 +887,17 @@ static void __collapse_huge_page_copy_failed(pte_t *p= te, */ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio= *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, - unsigned long address, spinlock_t *ptl, + unsigned long address, spinlock_t *ptl, unsigned int order, struct list_head *compound_pagelist) { + const unsigned long nr_pages =3D 1UL << order; unsigned int i; enum scan_result result =3D SCAN_SUCCEED; =20 /* * Copying pages' contents is subject to memory poison at any iteration. */ - for (i =3D 0; i < HPAGE_PMD_NR; i++) { + for (i =3D 0; i < nr_pages; i++) { pte_t pteval =3D ptep_get(pte + i); struct page *page =3D folio_page(folio, i); unsigned long src_addr =3D address + i * PAGE_SIZE; @@ -875,10 +916,10 @@ static enum scan_result __collapse_huge_page_copy(pte= _t *pte, struct folio *foli =20 if (likely(result =3D=3D SCAN_SUCCEED)) __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, - compound_pagelist); + order, compound_pagelist); else __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, - compound_pagelist); + order, compound_pagelist); =20 return result; } @@ -1051,16 +1092,20 @@ static enum scan_result check_pmd_still_valid(struc= t mm_struct *mm, * Bring missing pages in from swap, to complete THP collapse. * Only done if khugepaged_scan_pmd believes it is worthwhile. * + * For mTHP orders the function bails on the first swap entry, because + * faulting pages back in during collapse could re-populate PTEs that + * push a later scan over the threshold for a higher-order collapse. + * * Called and returns without pte mapped or spinlocks held. * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. */ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd, - int referenced) + struct vm_area_struct *vma, unsigned long start_addr, + pmd_t *pmd, int referenced, unsigned int order) { int swapped_in =3D 0; vm_fault_t ret =3D 0; - unsigned long addr, end =3D start_addr + (HPAGE_PMD_NR * PAGE_SIZE); + unsigned long addr, end =3D start_addr + (PAGE_SIZE << order); enum scan_result result; pte_t *pte =3D NULL; spinlock_t *ptl; @@ -1092,6 +1137,19 @@ static enum scan_result __collapse_huge_page_swapin(= struct mm_struct *mm, pte_present(vmf.orig_pte)) continue; =20 + /* + * TODO: Support swapin without leading to further mTHP + * collapses. Currently bringing in new pages via swapin may + * cause a future higher order collapse on a rescan of the same + * range. + */ + if (!is_pmd_order(order)) { + pte_unmap(pte); + mmap_read_unlock(mm); + result =3D SCAN_EXCEED_SWAP_PTE; + goto out; + } + vmf.pte =3D pte; vmf.ptl =3D ptl; ret =3D do_swap_page(&vmf); @@ -1212,7 +1270,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * that case. Continuing to collapse causes inconsistency. */ result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced); + referenced, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; } @@ -1260,6 +1318,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { result =3D __collapse_huge_page_isolate(vma, address, pte, cc, + HPAGE_PMD_ORDER, &compound_pagelist); spin_unlock(pte_ptl); } else { @@ -1290,6 +1349,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a =20 result =3D __collapse_huge_page_copy(pte, folio, pmd, _pmd, vma, address, pte_ptl, + HPAGE_PMD_ORDER, &compound_pagelist); pte_unmap(pte); if (unlikely(result !=3D SCAN_SUCCEED)) @@ -1325,9 +1385,9 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, struct vm_area_struct *vma, unsigned long start_addr, bool *lock_dropped, struct collapse_control *cc) { - const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); - const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); - const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma, HPAG= E_PMD_ORDER); + const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, HPAGE= _PMD_ORDER); + const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD= _ORDER); pmd_t *pmd; pte_t *pte, *_pte; int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; @@ -2381,8 +2441,8 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, unsigned long addr, struct file *file, pgoff_t start, struct collapse_control *cc) { - const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, NULL); - const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); + const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, NULL, HPA= GE_PMD_ORDER); + const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD= _ORDER); struct folio *folio =3D NULL; struct address_space *mapping =3D file->f_mapping; XA_STATE(xas, &mapping->i_pages, start); --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A38A3FF89F for ; Fri, 5 Jun 2026 16:15:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676127; cv=none; b=jwOovvz00qPwBM1fyvTtmgyYKf2+PQO/HVnMdiht2naslAKCBMnhMSDEPhVKmMk1zvdbDvBjM4v9WQSk6KYLfpJv52ooWztnucFN3Xhegh1bo0W2/SZn8Y291MZbnB3KlNliU4mB8AIfeLnl0n0l4+NKxi0WzwjcFozkeK6X21w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676127; c=relaxed/simple; bh=+xneohTX9Lz11Een8uTd/GS0joO/VMqHCU4nitpbTY0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DAWWntGDV4RLd/7/ZLdj32O18J/NXhpib57GJlbNW46763GV02HhSfyQEhYBatoUSKGFNbkqndAMNuV4G66rH7hXUvv6gVrAUGLJQq/DolBxmHgBYQRZTL6pOwylHdUt5172sxRirFXV+0SB3qgSrv7/AaYDjMvb31YNKiz2N4w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Xw8YGkqC; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Xw8YGkqC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676125; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fzDHpbqIBXdBWPrEX6e4jd9D7S6Wp53dvbcajiiuilc=; b=Xw8YGkqCl70zFVUhAK/oH294iaU/2UfhyJ+34qoxQ8qX0D1cKQCEXTr3vPpV5gnpaM1d5G vs0oEXZre4kB1zylALla72KICK55UHEdZwTtUWaC21s2H4p7NLbVFeerrkJ5RcG5XFm/CF 1YeTNBHErfb4BkSJA6dvm6CvpnacUig= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-472-UXKfE2kGPJqHwFwDQg_wwA-1; Fri, 05 Jun 2026 12:15:20 -0400 X-MC-Unique: UXKfE2kGPJqHwFwDQg_wwA-1 X-Mimecast-MFC-AGG-ID: UXKfE2kGPJqHwFwDQg_wwA_1780676119 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A5844195608E; Fri, 5 Jun 2026 16:15:19 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 52CF6180049F; Fri, 5 Jun 2026 16:15:01 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped Date: Fri, 5 Jun 2026 10:14:12 -0600 Message-ID: <20260605161422.213817-6-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Currently the collapse_huge_page function requires the mmap_read_lock to enter with it held, and exit with it dropped. This function moves the unlock into its parent caller, and changes this semantic to requiring it to enter/exit with it always unlocked. In future patches, we need this expectation, as for in mTHP collapse, we may have already dropped the lock, and do not want to conditionally check for this by passing through the lock_dropped variable. No functional change is expected as one of the first things the collapse_huge_page function does is drop this lock before allocating the hugepage. Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache Reviewed-by: Lance Yang Reviewed-by: Zi Yan --- mm/khugepaged.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 474ee97c54ba..e4b2ca77ecf6 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1222,6 +1222,12 @@ static enum scan_result alloc_charge_folio(struct fo= lio **foliop, struct mm_stru return SCAN_SUCCEED; } =20 +/* + * collapse_huge_page expects the mmap_lock to be unlocked before entering= and + * will always return with the lock unlocked, to avoid holding the mmap_lo= ck + * while allocating a THP, as that could trigger direct reclaim/compaction. + * Note that the VMA must be rechecked after grabbing the mmap_lock again. + */ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned = long address, int referenced, int unmapped, struct collapse_control *cc) { @@ -1237,14 +1243,6 @@ static enum scan_result collapse_huge_page(struct mm= _struct *mm, unsigned long a =20 VM_BUG_ON(address & ~HPAGE_PMD_MASK); =20 - /* - * Before allocating the hugepage, release the mmap_lock read lock. - * The allocation can take potentially a long time if it involves - * sync compaction, and we do not need to hold the mmap_lock during - * that. We will recheck the vma after taking it again in write mode. - */ - mmap_read_unlock(mm); - result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; @@ -1549,6 +1547,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, out_unmap: pte_unmap_unlock(pte, ptl); if (result =3D=3D SCAN_SUCCEED) { + /* collapse_huge_page expects the lock to be dropped before calling */ + mmap_read_unlock(mm); result =3D collapse_huge_page(mm, start_addr, referenced, unmapped, cc); /* collapse_huge_page will return with the mmap_lock released */ --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D03FA322DB7 for ; Fri, 5 Jun 2026 16:15:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676143; cv=none; b=VFQCUQPBZZXvDufjyHd5WKb9Y9O1cMHSgo7sT1uVRCRt3MaLC6hhTnRmJpMEtJgD4oqtrd1z6psBNHTbGWUecwM7NZw3aC3n+Dj3NRH7ya1HbcxdXLBVetiwYeMiaHVbCJkqNVMVhS+KeNkaRPx63Ru5hecO3yY8CIt3jkS/sTU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676143; c=relaxed/simple; bh=bgYyrGeQ0EG7KfgQSFwqySBqxq67GkB4xd8N/kemOKg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GF9qUL9RZYjwyfJjnONSmUM3X/y7KNlwwoR9MVkuuHdO6WWuB8gM8zSnuY4z/4zXsuaXZj9chvMOqWx41WDyfSG3vvVgRnjxKs9/CWMQWkvWrxODZp9kQ0IeQUFgYpRJiZ9RZT2AJKWGBJ2WCPdVtHBsWLjXSHmFRowZY27dJcU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UDBmtgeJ; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UDBmtgeJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676140; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AiyvVIBTfeyftNjdamEQ07U4KZCyjVtuDDP8zoffdS8=; b=UDBmtgeJq6QYowXp589vFuZojc3MBEVA/g6YvM0pQIAdsOX6n7pybsspESC4Kr+LYVbsrU XPSARSlMlKSiKCuxv7Bgnkh96CjS1fzz8gzOY1u09EINHEgRM3TIIXDuywx1Z3JlhaFbAm bo5CxQ1rCDW7AEnMUVuOnUMDGwi2yhM= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-557-qNqoqX5gPEO7ECGyy8gtkQ-1; Fri, 05 Jun 2026 12:15:39 -0400 X-MC-Unique: qNqoqX5gPEO7ECGyy8gtkQ-1 X-Mimecast-MFC-AGG-ID: qNqoqX5gPEO7ECGyy8gtkQ_1780676138 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 807DC1800581; Fri, 5 Jun 2026 16:15:38 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 36015180049F; Fri, 5 Jun 2026 16:15:19 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Date: Fri, 5 Jun 2026 10:14:13 -0600 Message-ID: <20260605161422.213817-7-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Pass an order to collapse_huge_page to support collapsing anon memory to arbitrary orders within a PMD. order indicates what mTHP size we are attempting to collapse to. For non-PMD collapse we must leave the anon VMA write locked until after we collapse the mTHP-- in the PMD case all the pages are isolated, but in the mTHP case this is not true, and we must keep the lock to prevent access/changes to the page tables. This can happen if the rmap walkers hit a pmd_none while the PMD entry is currently unavailable due to being temporarily removed during the collapse phase. To properly establish the page table hierarchy without violating any expectations from certain architectures (e.g. MIPS), we must make sure to have the PMD reinstalled before the PTEs, and hold both PTE/PMD locks before calling update_mmu_cache_range() (if they are distinct locks). Signed-off-by: Nico Pache Acked-by: David Hildenbrand (Arm) Reviewed-by: Lance Yang Reviewed-by: Lorenzo Stoakes --- mm/khugepaged.c | 105 ++++++++++++++++++++++++++++++------------------ 1 file changed, 67 insertions(+), 38 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e4b2ca77ecf6..c2769d82a719 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1228,34 +1228,36 @@ static enum scan_result alloc_charge_folio(struct f= olio **foliop, struct mm_stru * while allocating a THP, as that could trigger direct reclaim/compaction. * Note that the VMA must be rechecked after grabbing the mmap_lock again. */ -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned = long address, - int referenced, int unmapped, struct collapse_control *cc) +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned = long start_addr, + int referenced, int unmapped, struct collapse_control *cc, + unsigned int order) { + const unsigned long pmd_addr =3D start_addr & HPAGE_PMD_MASK; + const unsigned long end_addr =3D start_addr + (PAGE_SIZE << order); LIST_HEAD(compound_pagelist); pmd_t *pmd, _pmd; - pte_t *pte; + pte_t *pte =3D NULL; pgtable_t pgtable; struct folio *folio; spinlock_t *pmd_ptl, *pte_ptl; enum scan_result result =3D SCAN_FAIL; struct vm_area_struct *vma; struct mmu_notifier_range range; + bool anon_vma_locked =3D false; =20 - VM_BUG_ON(address & ~HPAGE_PMD_MASK); - - result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); + result =3D alloc_charge_folio(&folio, mm, cc, order); if (result !=3D SCAN_SUCCEED) goto out_nolock; =20 mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + result =3D hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=3D*/ true, + &vma, cc, order); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; } =20 - result =3D find_pmd_or_thp_or_none(mm, address, &pmd); + result =3D find_pmd_or_thp_or_none(mm, pmd_addr, &pmd); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1267,8 +1269,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * released when it fails. So we jump out_nolock directly in * that case. Continuing to collapse causes inconsistency. */ - result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced, HPAGE_PMD_ORDER); + result =3D __collapse_huge_page_swapin(mm, vma, start_addr, pmd, + referenced, order); if (result !=3D SCAN_SUCCEED) goto out_nolock; } @@ -1283,20 +1285,28 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * mmap_lock. */ mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + result =3D hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=3D*/ true, + &vma, cc, order); if (result !=3D SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ vma_start_write(vma); - result =3D check_pmd_still_valid(mm, address, pmd); + result =3D check_pmd_still_valid(mm, pmd_addr, pmd); if (result !=3D SCAN_SUCCEED) goto out_up_write; =20 anon_vma_lock_write(vma->anon_vma); + anon_vma_locked =3D true; =20 - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, - address + HPAGE_PMD_SIZE); + /* + * Only notify about the PTE range we will actually modify. While we + * temporary unmap the whole PTE table for mTHP collapse, we'll remap + * it later, leaving other PTEs effectively unmodified. The locks we + * hold prevent anybody from stumbling over such temporarily unmapped + * PTE tables. + */ + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr, + end_addr); mmu_notifier_invalidate_range_start(&range); =20 pmd_ptl =3D pmd_lock(mm, pmd); /* probably unnecessary */ @@ -1308,26 +1318,23 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * Parallel GUP-fast is fine since GUP-fast will back off when * it detects PMD is changed. */ - _pmd =3D pmdp_collapse_flush(vma, address, pmd); + _pmd =3D pmdp_collapse_flush(vma, pmd_addr, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); tlb_remove_table_sync_one(); =20 - pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + pte =3D pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl); if (pte) { - result =3D __collapse_huge_page_isolate(vma, address, pte, cc, - HPAGE_PMD_ORDER, - &compound_pagelist); + result =3D __collapse_huge_page_isolate(vma, start_addr, pte, cc, + order, &compound_pagelist); spin_unlock(pte_ptl); } else { result =3D SCAN_NO_PTE_TABLE; } =20 if (unlikely(result !=3D SCAN_SUCCEED)) { - if (pte) - pte_unmap(pte); spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); + VM_WARN_ON_ONCE(!pmd_none(*pmd)); /* * We can only use set_pmd_at when establishing * hugepmds and never for establishing regular pmds that @@ -1335,21 +1342,24 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a */ pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); - anon_vma_unlock_write(vma->anon_vma); goto out_up_write; } =20 /* - * All pages are isolated and locked so anon_vma rmap - * can't run anymore. + * For PMD collapse all pages are isolated and locked so anon_vma + * rmap can't run anymore. For mTHP collapse the PMD entry has been + * removed and not all pages are isolated and locked, so we must hold + * the lock to prevent neighboring folios from attempting to access + * this PMD until its reinstalled. */ - anon_vma_unlock_write(vma->anon_vma); + if (is_pmd_order(order)) { + anon_vma_unlock_write(vma->anon_vma); + anon_vma_locked =3D false; + } =20 result =3D __collapse_huge_page_copy(pte, folio, pmd, _pmd, - vma, address, pte_ptl, - HPAGE_PMD_ORDER, - &compound_pagelist); - pte_unmap(pte); + vma, start_addr, pte_ptl, + order, &compound_pagelist); if (unlikely(result !=3D SCAN_SUCCEED)) goto out_up_write; =20 @@ -1359,18 +1369,37 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * write. */ __folio_mark_uptodate(folio); - pgtable =3D pmd_pgtable(_pmd); - spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); - pgtable_trans_huge_deposit(mm, pmd, pgtable); - map_anon_folio_pmd_nopf(folio, pmd, vma, address); + VM_WARN_ON_ONCE(!pmd_none(*pmd)); + if (is_pmd_order(order)) { + pgtable =3D pmd_pgtable(_pmd); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr); + } else { + /* + * Some architectures (e.g. MIPS) walk the live page table in + * their implementation. update_mmu_cache_range() must be called + * with a valid page table hierarchy and the PTE lock held. + * Acquire it nested inside pmd_ptl when they are distinct locks. + */ + if (pte_ptl !=3D pmd_ptl) + spin_lock_nested(pte_ptl, SINGLE_DEPTH_NESTING); + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, + /*uffd_wp=3D*/ false); + if (pte_ptl !=3D pmd_ptl) + spin_unlock(pte_ptl); + } spin_unlock(pmd_ptl); =20 folio =3D NULL; =20 result =3D SCAN_SUCCEED; out_up_write: + if (anon_vma_locked) + anon_vma_unlock_write(vma->anon_vma); + if (pte) + pte_unmap(pte); mmap_write_unlock(mm); out_nolock: if (folio) @@ -1550,7 +1579,7 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, /* collapse_huge_page expects the lock to be dropped before calling */ mmap_read_unlock(mm); result =3D collapse_huge_page(mm, start_addr, referenced, - unmapped, cc); + unmapped, cc, HPAGE_PMD_ORDER); /* collapse_huge_page will return with the mmap_lock released */ *lock_dropped =3D true; } --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB2C1287510 for ; Fri, 5 Jun 2026 16:16:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676164; cv=none; b=mUaYin6HOUaAlguHJwb2BJoERW/0RlcXcpxn6fZE+re/kUWfQaQdeNGOJWBGNhST6H0rQaR6RdtNTTjj8ZN4BxiKdvtNkBnAwxWj7t+9qDVEv+vFxQjjpcBLetew3A651YPRFfJ+L/E+PJS9nF8bR5W0KUoE+CAJGFqbGPOrYMk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676164; c=relaxed/simple; bh=Lp3mQgrnxE8TKr/is0dIE/RSXUoVI5uSXFFWliib+U8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=h4MVvk4WmbZJZtFnLIiLpBSx0lNZ22GWIj/ehjm55wc1G7juO37CQT+r6kpfG/8RMkWefbORwk5/ViQxFmtsLyaN90DbZqjlN2U4sG1W2iy5rNYtl1YnAV9yu2kZZp/K2+kkolgTps1PcA9GtbRPOpUBMH+8xw0DTRwBFacR7JI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=P4mPLH9N; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="P4mPLH9N" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676161; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TyZARYAQdS2v/9DKdV4yV9thBDI4Aut4joTjAdzEV0M=; b=P4mPLH9NIEexm9sdOC0fo+YSLsAITlbn80GUAnJCHyQJb+E4zmfoRRUmAidmupCnOb4wxr 5YYgBK1L6l5x6yaa0MiZfRxcJnxVKheqfcX7XfhBsBvDLn3tEcQG57klK+G+n9uHGUze6p 0dY77jHjgy0M8tWGXwnZs7tq900loZs= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-253-jbZplnonMFOwnSfF1WAemA-1; Fri, 05 Jun 2026 12:15:58 -0400 X-MC-Unique: jbZplnonMFOwnSfF1WAemA-1 X-Mimecast-MFC-AGG-ID: jbZplnonMFOwnSfF1WAemA_1780676157 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 3F7E71956094; Fri, 5 Jun 2026 16:15:57 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 11B64180049F; Fri, 5 Jun 2026 16:15:38 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 07/14] mm/khugepaged: skip collapsing mTHP to smaller orders Date: Fri, 5 Jun 2026 10:14:14 -0600 Message-ID: <20260605161422.213817-8-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" khugepaged may try to collapse a mTHP to a folio of equal or smaller size, possibly resulting in a partially mapped source folio, which is undesired. Skip these cases until we have a way to check if its ok to collapse to a smaller mTHP size (like in the case of a partially mapped folio). This check is not done during the scan phase as the current collapse order is unknown at that time. This patch is inspired by Dev Jain's work on khugepaged mTHP support [1]. [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/ Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Acked-by: David Hildenbrand (arm) Acked-by: Usama Arif Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index c2769d82a719..191e529c185c 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -697,6 +697,14 @@ static enum scan_result __collapse_huge_page_isolate(s= truct vm_area_struct *vma, goto out; } } + /* + * TODO: In some cases of partially-mapped folios, we'd actually + * want to collapse. + */ + if (!is_pmd_order(order) && folio_order(folio) >=3D order) { + result =3D SCAN_PTE_MAPPED_HUGEPAGE; + goto out; + } =20 if (folio_test_large(folio)) { struct folio *f; --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2537F207A32 for ; Fri, 5 Jun 2026 16:16:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676185; cv=none; b=kP8ll1R4kgbsjLn7reahYaW90pTIU/D+rKP9mVP7Fm4UyURynlQ/KOSaFRlvqvdQs9sA1rOxMrjrRHaytxIQV+LrhAh0wC4LNRtmFc7UHDeGiCXp+kXYCjHHLmbTSRpnqCnaEmrbUcVl144yYQJjLrz1/JbSWX4pPaafZPHF0Q0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676185; c=relaxed/simple; bh=ieSX9EKVyoCs/4FVCtF1wJ4WQnMnbkudqZxWU7fQtLM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mbycNNkyLIn3v8cwEUmqbGYIiBTtqML5BNuhPu8PQStwy3c7jqbe0ZiQQxFqn28cqWEDHQBxEtg43GR59QrcExVjFgXaZs9EKq0DnKWJLZZ37dpEr6B4K80WXXPPH9V9uQWl34iXLPExYjpeEGP+vR+eQspEMlCBehr1NDN9AfU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WUj6Go6h; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WUj6Go6h" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676182; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c41C+aUCjryZaKfFd1hJe5PThl4zuDF1MGYxzScNCi0=; b=WUj6Go6he2/4zCOK9cO5qjn4caWQkLtPUiYFL5km6kjaHWNuralO90+qx4Ta/sYRzWGxXH 2zFLWh3qy+U/HoumxAA6vVXn5Gwi3aaOR8fAIC90FHmYas5MRB8R5JK0uoTbGG0NOi5jpq UNsNFAEX16Kg78WEPVaHIsLSzebIUyU= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-FPwph6UcNQyD_L1sHVgsEg-1; Fri, 05 Jun 2026 12:16:17 -0400 X-MC-Unique: FPwph6UcNQyD_L1sHVgsEg-1 X-Mimecast-MFC-AGG-ID: FPwph6UcNQyD_L1sHVgsEg_1780676176 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 42D9D18002C1; Fri, 5 Jun 2026 16:16:16 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id CD408180049F; Fri, 5 Jun 2026 16:15:57 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 08/14] mm/khugepaged: add per-order mTHP collapse failure statistics Date: Fri, 5 Jun 2026 10:14:15 -0600 Message-ID: <20260605161422.213817-9-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Add three new mTHP statistics to track collapse failures for different orders when encountering swap PTEs, excessive none PTEs, and shared PTEs: - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to encountering a swap PTE. - collapse_exceed_none_pte: Counts when mTHP collapse fails due to exceeding the none PTE threshold for the given order - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to encountering a shared PTE. These statistics complement the existing THP_SCAN_EXCEED_* events by providing per-order granularity for mTHP collapse attempts. The stats are exposed via sysfs under `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each supported hugepage size. As we currently do not support collapsing mTHPs that contain a swap or shared entry, those statistics keep track of how often we are encountering failed mTHP collapses due to these restrictions. We will add support for mTHP collapse for anonymous pages next; lets also track when this happens at the PMD level within the per-mTHP stats. Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache --- Documentation/admin-guide/mm/transhuge.rst | 14 ++++++++++++++ include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 7 +++++++ mm/khugepaged.c | 15 +++++++++++++-- 4 files changed, 37 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index a74844e01f1e..b98e18c80185 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -714,6 +714,20 @@ nr_anon_partially_mapped an anonymous THP as "partially mapped" and count it here, even thou= gh it is not actually partially mapped anymore. =20 +collapse_exceed_none_pte + The number of collapse attempts that failed due to exceeding the + max_ptes_none threshold. + +collapse_exceed_swap_pte + The number of collapse attempts that failed due to exceeding the + max_ptes_swap threshold. For non-PMD orders this occurs if a mTHP r= ange + contains at least one swap PTE. + +collapse_exceed_shared_pte + The number of collapse attempts that failed due to exceeding the + max_ptes_shared threshold. For non-PMD orders this occurs if a mTHP= range + contains at least one shared PTE. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 443852423790..148109ebd08a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -144,6 +144,9 @@ enum mthp_stat_item { MTHP_STAT_SPLIT_DEFERRED, MTHP_STAT_NR_ANON, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, + MTHP_STAT_COLLAPSE_EXCEED_SWAP, + MTHP_STAT_COLLAPSE_EXCEED_NONE, + MTHP_STAT_COLLAPSE_EXCEED_SHARED, __MTHP_STAT_COUNT }; =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index eea83da9114a..222e421d9e8e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -717,6 +717,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FA= ILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON); DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALL= Y_MAPPED); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_= SWAP); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_= NONE); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEE= D_SHARED); + =20 static struct attribute *anon_stats_attrs[] =3D { &anon_fault_alloc_attr.attr, @@ -733,6 +737,9 @@ static struct attribute *anon_stats_attrs[] =3D { &split_deferred_attr.attr, &nr_anon_attr.attr, &nr_anon_partially_mapped_attr.attr, + &collapse_exceed_swap_pte_attr.attr, + &collapse_exceed_none_pte_attr.attr, + &collapse_exceed_shared_pte_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 191e529c185c..ac4731addafa 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -651,7 +651,9 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, if (pte_none_or_zero(pteval)) { if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + if (is_pmd_order(order)) + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE); goto out; } continue; @@ -693,7 +695,9 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, */ if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + if (is_pmd_order(order)) + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED); goto out; } } @@ -1152,6 +1156,7 @@ static enum scan_result __collapse_huge_page_swapin(s= truct mm_struct *mm, * range. */ if (!is_pmd_order(order)) { + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP); pte_unmap(pte); mmap_read_unlock(mm); result =3D SCAN_EXCEED_SWAP_PTE; @@ -1459,6 +1464,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_NONE); goto out_unmap; } continue; @@ -1467,6 +1474,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++unmapped > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_SWAP); goto out_unmap; } /* @@ -1524,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_SHARED); goto out_unmap; } } --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08A964B8DFD for ; Fri, 5 Jun 2026 16:16:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676202; cv=none; b=urt1UDuPUp922BGtGldIQz1a7S31L01JZs2vINPzaxsu7pGoBhA0Ql3tdKBIyg60++m7aBjsCNwyFL+htxbhsvG8vBRoOO0b3mZduSC1ahqOMvLelP7uRCb21ZfjfsVb051KuT4QkBmQ/dnmDSKa7HsSEyfYioAMXDrZF+BxGno= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676202; c=relaxed/simple; bh=OtQ2Ildtuda+Mitraxx9CKhbaOESPQvJzopBVJckRMM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jqUMGZaz7NZS4dlTSiavp2TjHkk2e1O6EXmyKXZeMmAo3Z721fgc1w99d5fcTUBF791+4kyCs9fmEzBq7Guhf+vQR9MrO3ohCaqUzZMhkGKm7Os0ARHwnHTLlG68/K3+UZpgfYYY8iQjPlv7NPqLNh3YgUXhYWMotFns4bRQU2w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=IKDtq/Xz; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IKDtq/Xz" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676199; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=P6L9066owL61TKgEjmCW2MU6W4Dj6LYvPAJCJjXhicQ=; b=IKDtq/XzzTFuRGDBJtI8Tx7xAToFO6Ba+VbMpJJzSmvUEKjXQIOQddVfR9JXObq17Yu/xB g3KPCbdz+0y7AiXJvkHWVHZal2ZafR8SXZ75siR71XXUQ3+xlwlT4di7pI3ngdKo6uON0Y Q/zIlEx1cwhsekmt84UcdDep7fuOjKc= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-670-vKnwmfzgPLCgt9Z8eqhyow-1; Fri, 05 Jun 2026 12:16:36 -0400 X-MC-Unique: vKnwmfzgPLCgt9Z8eqhyow-1 X-Mimecast-MFC-AGG-ID: vKnwmfzgPLCgt9Z8eqhyow_1780676194 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D3A101956059; Fri, 5 Jun 2026 16:16:34 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id F0CC0180049F; Fri, 5 Jun 2026 16:16:16 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 09/14] mm/khugepaged: improve tracepoints for mTHP orders Date: Fri, 5 Jun 2026 10:14:16 -0600 Message-ID: <20260605161422.213817-10-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to give better insight into what order is being operated at for. Reviewed-by: Lorenzo Stoakes Reviewed-by: Baolin Wang Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache --- include/trace/events/huge_memory.h | 34 +++++++++++++++++++----------- mm/khugepaged.c | 9 ++++---- 2 files changed, 27 insertions(+), 16 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge= _memory.h index bcdc57eea270..291fae364c62 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -89,40 +89,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, =20 TRACE_EVENT(mm_collapse_huge_page, =20 - TP_PROTO(struct mm_struct *mm, int isolated, int status), + TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int ord= er), =20 - TP_ARGS(mm, isolated, status), + TP_ARGS(mm, isolated, status, order), =20 TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, isolated) __field(int, status) + __field(unsigned int, order) ), =20 TP_fast_assign( __entry->mm =3D mm; __entry->isolated =3D isolated; __entry->status =3D status; + __entry->order =3D order; ), =20 - TP_printk("mm=3D%p, isolated=3D%d, status=3D%s", + TP_printk("mm=3D%p, isolated=3D%d, status=3D%s, order=3D%u", __entry->mm, __entry->isolated, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); =20 TRACE_EVENT(mm_collapse_huge_page_isolate, =20 TP_PROTO(struct folio *folio, int none_or_zero, - int referenced, int status), + int referenced, int status, unsigned int order), =20 - TP_ARGS(folio, none_or_zero, referenced, status), + TP_ARGS(folio, none_or_zero, referenced, status, order), =20 TP_STRUCT__entry( __field(unsigned long, pfn) __field(int, none_or_zero) __field(int, referenced) __field(int, status) + __field(unsigned int, order) ), =20 TP_fast_assign( @@ -130,26 +134,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __entry->none_or_zero =3D none_or_zero; __entry->referenced =3D referenced; __entry->status =3D status; + __entry->order =3D order; ), =20 - TP_printk("scan_pfn=3D0x%lx, none_or_zero=3D%d, referenced=3D%d, status= =3D%s", + TP_printk("scan_pfn=3D0x%lx, none_or_zero=3D%d, referenced=3D%d, status= =3D%s, order=3D%u", __entry->pfn, __entry->none_or_zero, __entry->referenced, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); =20 TRACE_EVENT(mm_collapse_huge_page_swapin, =20 - TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret), + TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret, + unsigned int order), =20 - TP_ARGS(mm, swapped_in, referenced, ret), + TP_ARGS(mm, swapped_in, referenced, ret, order), =20 TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, swapped_in) __field(int, referenced) __field(int, ret) + __field(unsigned int, order) ), =20 TP_fast_assign( @@ -157,13 +165,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin, __entry->swapped_in =3D swapped_in; __entry->referenced =3D referenced; __entry->ret =3D ret; + __entry->order =3D order; ), =20 - TP_printk("mm=3D%p, swapped_in=3D%d, referenced=3D%d, ret=3D%d", + TP_printk("mm=3D%p, swapped_in=3D%d, referenced=3D%d, ret=3D%d, order=3D%= u", __entry->mm, __entry->swapped_in, __entry->referenced, - __entry->ret) + __entry->ret, + __entry->order) ); =20 TRACE_EVENT(mm_khugepaged_scan_file, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ac4731addafa..26c343a6fa3d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -785,13 +785,13 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, } else { result =3D SCAN_SUCCEED; trace_mm_collapse_huge_page_isolate(folio, none_or_zero, - referenced, result); + referenced, result, order); return result; } out: release_pte_pages(pte, _pte, compound_pagelist); trace_mm_collapse_huge_page_isolate(folio, none_or_zero, - referenced, result); + referenced, result, order); return result; } =20 @@ -1197,7 +1197,8 @@ static enum scan_result __collapse_huge_page_swapin(s= truct mm_struct *mm, =20 result =3D SCAN_SUCCEED; out: - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result); + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result, + order); return result; } =20 @@ -1417,7 +1418,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long s out_nolock: if (folio) folio_put(folio); - trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, result); + trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, result, order= ); return result; } =20 --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 02D663EF0C8 for ; Fri, 5 Jun 2026 16:16:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676221; cv=none; b=h2z8bhLNv8weLPbYm1HgQQ3zwnLkHH0XOkuu63Q5opHfv8myk9IL18h3wNEiLb9kOATXcUcH62fbsEHGaLz6+boSzPDIuqVK3Yb4hlRIs2OSfNrhuzCP2PZJetiGdbLqAJaLLq0zbd8YYfAiXQuqEwf9725HJiimiVpmrmMoq8w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676221; c=relaxed/simple; bh=e3CdKyeCZeG9Uqnv8QCy+dDz3AoIPMhZbF4uV2FZaSc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=p1H7u3/z57+qRxJ5fP36VgBjenluM/eER7I5fHOHfCrq9fWjPu6JsltEhyls/6wdDk1tCls6VWaPagR6JzbEwDW1uIFYIDjSf/URNL9EWtrobs6OP/SadItjYu9CTBJXxJb46/poCyI87JhhflpXk0fujO2Y4Qd4fvrPOPSkJNg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UeokzzW7; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UeokzzW7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676218; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UfNLaMiKvOW75JhtX0mzA+ezODy491wa3o3Fr06jBN8=; b=UeokzzW7E0sf0S5YX5WLJXK++7wnwiMbIpVZISUYsGY5a/cR1j0cbcHzafgkLdEZE/EXn3 DR6dIKvta6BpNPjhqDSogVLdPbFCJN0VbnHYfi+twDgYFcUWBX6xa8cAIa7Af9Y4ga0fvD 7eJxYvh6QvtBKGuEEqIYzobYK1PXNeg= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-586-3LJpmfNZNGKK4h1Lcz-wFQ-1; Fri, 05 Jun 2026 12:16:54 -0400 X-MC-Unique: 3LJpmfNZNGKK4h1Lcz-wFQ-1 X-Mimecast-MFC-AGG-ID: 3LJpmfNZNGKK4h1Lcz-wFQ_1780676213 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 01FCE18004A9; Fri, 5 Jun 2026 16:16:53 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 79A13180049F; Fri, 5 Jun 2026 16:16:35 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 10/14] mm/khugepaged: introduce collapse_possible_orders helper functions Date: Fri, 5 Jun 2026 10:14:17 -0600 Message-ID: <20260605161422.213817-11-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Add collapse_possible_orders() to generalize THP order eligibility. The function determines which THP orders are permitted based on collapse context (khugepaged vs madv_collapse). We also add collapse_possible() as a thin wrapper around collapse_possible_orders() that returns a bool rather than the whole bitmap. This consolidates collapse configuration logic and provides a clean interface for future mTHP collapse support where the orders may be different. Acked-by: David Hildenbrand (Arm) Reviewed-by: Baolin Wang Signed-off-by: Nico Pache Reviewed-by: Lorenzo Stoakes --- mm/khugepaged.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 26c343a6fa3d..ec886a031952 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -554,12 +554,30 @@ void __khugepaged_enter(struct mm_struct *mm) wake_up_interruptible(&khugepaged_wait); } =20 +/* + * Check what orders are possible based on the vma and collapse type. + * This is used to determine if mTHP collapse is a viable option. + */ +static unsigned long collapse_possible_orders(struct vm_area_struct *vma, + vm_flags_t vm_flags, enum tva_type tva_flags) +{ + const unsigned long orders =3D BIT(HPAGE_PMD_ORDER); + + return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); +} + +static bool collapse_possible(struct vm_area_struct *vma, + vm_flags_t vm_flags, enum tva_type tva_flags) +{ + return collapse_possible_orders(vma, vm_flags, tva_flags); +} + void khugepaged_enter_vma(struct vm_area_struct *vma, vm_flags_t vm_flags) { if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) + if (collapse_possible(vma, vm_flags, TVA_KHUGEPAGED)) __khugepaged_enter(vma->vm_mm); } } @@ -2700,7 +2718,7 @@ static void collapse_scan_mm_slot(unsigned int progre= ss_max, cc->progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORD= ER)) { + if (!collapse_possible(vma, vma->vm_flags, TVA_KHUGEPAGED)) { cc->progress++; continue; } @@ -3010,7 +3028,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); =20 - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) + if (!collapse_possible(vma, vma->vm_flags, TVA_FORCED_COLLAPSE)) return -EINVAL; =20 cc =3D kmalloc_obj(*cc); --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF8513B42C5 for ; Fri, 5 Jun 2026 16:17:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676238; cv=none; b=c7aUevhikyznkH5MkgcEy8tYJ2dqQ7+fDx8chOAOh4TTRYXNuJnOlNp6bC3vtPAoWQJoB9Fu9NlRdX9AIn66dlR0N97uGJsFLxFthWmsYf+N6Sy3KZU0m2oRuGVnnqQcbBX6DMgqR2/gdc5qRzSuNAF/j9NoEI8GwZmesXbtbus= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676238; c=relaxed/simple; bh=i9ehEy6BW1vsrPqsal/CI886LpCz8qjmnWanNBRJl60=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eXLVt3F3oq8NX68QHYPF4Jr7TVvRiONoHHsSyGlob0nztIrM3Xs7H1PK4eAVSZiWb1sWcoG9aMTBaEDgtvlN37zBO+o/T9l4+sLABEld9NHuTJRBjUG2QAGDjKra6q1ZeLBYrN1/v89XZG8L62COCR736eaBQrdKGbXx6Cxh6Nk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=DqkG3emy; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="DqkG3emy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676236; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TJ4Ysg15fHDbPXcijjVYNLyDQf/OR2xqlGkIjwxP6zo=; b=DqkG3emy96MautilB4HOuK4Kbp/ltBg1SdlW/GHtdpYNP+EJK5in39U79d3x9Z0Wqsx1LR 9vWSxvTaH7lVCihlkPYRxja2JdmqJSPBbve+Q/tw+nHDndJjtBlD3tV+kkf7MxmYFbN7wb LHcvXzYxUKrY9CnrC3mQxJ4tMFDNUgY= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-683-yaD7IghJMmaeHrz4Xz1y5A-1; Fri, 05 Jun 2026 12:17:13 -0400 X-MC-Unique: yaD7IghJMmaeHrz4Xz1y5A-1 X-Mimecast-MFC-AGG-ID: yaD7IghJMmaeHrz4Xz1y5A_1780676231 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A4E2E1944A97; Fri, 5 Jun 2026 16:17:11 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AA84C180049F; Fri, 5 Jun 2026 16:16:53 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH mm-unstable v19 11/14] mm/khugepaged: Introduce mTHP collapse support Date: Fri, 5 Jun 2026 10:14:18 -0600 Message-ID: <20260605161422.213817-12-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and the algorithm to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, an alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement mthp_collapse() to walk forward through the bitmap and determine the best eligible order for each naturally-aligned region. The algorithm starts at the beginning of the PMD range and, for each offset, tries the highest order that fits the alignment. If the number of occupied PTEs in that region satisfies the max_ptes_none threshold for that order, a collapse is attempted. On failure, the order is decremented and the same offset is retried at the next smaller size. Once the smallest enabled order is exhausted (or a collapse succeeds), the offset advances past the region just processed, and the next attempt starts at the highest order permitted by the new offset's natural alignment. The algorithm works as follows: 1) set offset=3D0 and order=3DHPAGE_PMD_ORDER 2) if the order is not enabled, go to step (5) 3) count occupied PTEs in the (offset, order) range using bitmap_weight_from() 4) if the count satisfies the max_ptes_none threshold, attempt collapse; on success, advance to step (6) 5) if a smaller enabled order exists, decrement order and retry from step (2) at the same offset 6) advance offset past the current region and compute the next order from the new offset's natural alignment via __ffs(offset), capped at HPAGE_PMD_ORDER 7) repeat from step (2) until the full PMD range is covered mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order mTHP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=3D0: Never introduce new empty pages during collapse - max_ptes_none=3DHPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and default mTHP collapse to max_ptes_none=3D0. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Signed-off-by: Nico Pache Acked-by: David Hildenbrand (Arm) Reviewed-by: Lorenzo Stoakes --- mm/khugepaged.c | 146 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 138 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ec886a031952..430047316f43 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLO= TS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __ro_after_init; =20 +#define KHUGEPAGED_MIN_MTHP_ORDER 2 + struct collapse_control { bool is_khugepaged; =20 @@ -110,6 +112,9 @@ struct collapse_control { =20 /* nodemask for allocation fallback */ nodemask_t alloc_nmask; + + /* Each bit represents a single occupied (!none/zero) page. */ + DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE); }; =20 /** @@ -1440,20 +1445,130 @@ static enum scan_result collapse_huge_page(struct = mm_struct *mm, unsigned long s return result; } =20 +/* Return the highest naturally aligned order that fits at @offset within = a PMD. */ +static unsigned int max_order_from_offset(unsigned int offset) +{ + if (offset =3D=3D 0) + return HPAGE_PMD_ORDER; + + return min_t(unsigned int, __ffs(offset), HPAGE_PMD_ORDER); +} + +/* + * mthp_collapse() consumes the bitmap that is generated during + * collapse_scan_pmd() to determine what regions and mTHP orders fit best. + * + * Each bit in cc->mthp_present_ptes represents a single occupied (!none/z= ero) + * page. We start at the PMD order and check if it is eligible for collaps= e; + * if not, we check the left and right halves of the PTE page table we are + * examining at a lower order. + * + * For each of these, we determine how many PTE entries are occupied in the + * range of PTE entries we propose to collapse, then we compare this to a + * threshold number of PTE entries which would need to be occupied for a + * collapse to be permitted at that order (accounting for max_ptes_none). + * + * If a collapse is permitted, we attempt to collapse the PTE range into a + * mTHP. + */ +static enum scan_result mthp_collapse(struct mm_struct *mm, + unsigned long address, int referenced, int unmapped, + struct collapse_control *cc, unsigned long enabled_orders) +{ + unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none; + enum scan_result last_result =3D SCAN_FAIL; + int collapsed =3D 0; + bool alloc_failed =3D false; + unsigned long collapse_address; + unsigned int offset =3D 0; + unsigned int order =3D HPAGE_PMD_ORDER; + + while (offset < HPAGE_PMD_NR) { + nr_ptes =3D 1UL << order; + + if (!test_bit(order, &enabled_orders)) + goto next_order; + + max_ptes_none =3D collapse_max_ptes_none(cc, NULL, order); + nr_occupied_ptes =3D bitmap_weight_from(cc->mthp_present_ptes, offset, + offset + nr_ptes); + + if (nr_occupied_ptes >=3D nr_ptes - max_ptes_none) { + enum scan_result ret; + + collapse_address =3D address + offset * PAGE_SIZE; + ret =3D collapse_huge_page(mm, collapse_address, referenced, + unmapped, cc, order); + switch (ret) { + /* Cases where we continue to next collapse candidate */ + case SCAN_SUCCEED: + collapsed +=3D nr_ptes; + fallthrough; + case SCAN_PTE_MAPPED_HUGEPAGE: + goto next_offset; + /* Cases where lower orders might still succeed */ + case SCAN_ALLOC_HUGE_PAGE_FAIL: + alloc_failed =3D true; + last_result =3D ret; + goto next_order; + /* Cases where no further collapse is possible */ + case SCAN_PMD_MAPPED: + fallthrough; + default: + last_result =3D ret; + goto done; + } + } + +next_order: + /* + * Continue with the next smaller order if there is still + * any smaller order enabled. When at the smallest order + * we must always move to the next offset. + */ + if (order > KHUGEPAGED_MIN_MTHP_ORDER && + (enabled_orders & GENMASK(order - 1, 0))) { + order--; + continue; + } +next_offset: + /* + * Advance past the region we just processed and determine the + * highest order we can attempt next. Since huge pages must be + * naturally aligned, the max order we can attempt next is + * limited by the alignment of the new offset. + * E.g. if we collapsed a order-2 mTHP at offset 0, offset + * becomes 4 and __ffs(4) =3D=3D 2, so the next attempt starts at + * order 2. + */ + offset +=3D nr_ptes; + order =3D max_order_from_offset(offset); + } +done: + if (collapsed) + return SCAN_SUCCEED; + if (alloc_failed) + return SCAN_ALLOC_HUGE_PAGE_FAIL; + return last_result; +} + static enum scan_result collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, bool *lock_dropped, struct collapse_control *cc) { - const unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma, HPAG= E_PMD_ORDER); const unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, HPAGE= _PMD_ORDER); const unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD= _ORDER); + unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma, HPAGE_PMD_= ORDER); + enum tva_type tva_flags =3D cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORC= ED_COLLAPSE; pmd_t *pmd; - pte_t *pte, *_pte; + pte_t *pte, *_pte, pteval; + int i; int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; enum scan_result result =3D SCAN_FAIL; struct page *page =3D NULL; struct folio *folio =3D NULL; unsigned long addr; + unsigned long enabled_orders; spinlock_t *ptl; int node =3D NUMA_NO_NODE, unmapped =3D 0; =20 @@ -1465,8 +1580,19 @@ static enum scan_result collapse_scan_pmd(struct mm_= struct *mm, goto out; } =20 + bitmap_zero(cc->mthp_present_ptes, MAX_PTRS_PER_PTE); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + enabled_orders =3D collapse_possible_orders(vma, vma->vm_flags, tva_flags= ); + + /* + * If PMD is the only enabled order, enforce max_ptes_none, otherwise + * scan all pages to populate the bitmap for mTHP collapse. + */ + if (enabled_orders !=3D BIT(HPAGE_PMD_ORDER)) + max_ptes_none =3D KHUGEPAGED_MAX_PTES_LIMIT; + pte =3D pte_offset_map_lock(mm, pmd, start_addr, &ptl); if (!pte) { cc->progress++; @@ -1474,11 +1600,13 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, goto out; } =20 - for (addr =3D start_addr, _pte =3D pte; _pte < pte + HPAGE_PMD_NR; - _pte++, addr +=3D PAGE_SIZE) { + for (i =3D 0; i < HPAGE_PMD_NR; i++) { + _pte =3D pte + i; + addr =3D start_addr + i * PAGE_SIZE; + pteval =3D ptep_get(_pte); + cc->progress++; =20 - pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; @@ -1558,6 +1686,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, } } =20 + /* Set bit for occupied pages */ + __set_bit(i, cc->mthp_present_ptes); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1616,9 +1746,9 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (result =3D=3D SCAN_SUCCEED) { /* collapse_huge_page expects the lock to be dropped before calling */ mmap_read_unlock(mm); - result =3D collapse_huge_page(mm, start_addr, referenced, - unmapped, cc, HPAGE_PMD_ORDER); - /* collapse_huge_page will return with the mmap_lock released */ + result =3D mthp_collapse(mm, start_addr, referenced, + unmapped, cc, enabled_orders); + /* mmap_lock was released above, set lock_dropped */ *lock_dropped =3D true; } out: --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B4ED3195F9 for ; Fri, 5 Jun 2026 16:17:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676258; cv=none; b=CshV0Wj0Q+K8IcjIg16ViTnGS/gh7tFukEtM41olMNeoGq7NjenQOQwBj1plQHKKKo2U450tyhtVDfH6t0AVCxkn6ElRK34whuSfY64elmxJDJdJyMem0BydRf9momkV4Gf5n0kZ4FTOWUHFDiTMlCxyjvCMj71cGd49m/XIZx0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676258; c=relaxed/simple; bh=Jbsd3RFvzZia6KfmmcZB95qCmS885EPyGk6VJN12EZg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=axkB1dIHkVZbuvTjLFqnCW5X9TQVNoR2AdI/RuJIZOb9CxeZ/JI7AyP54GvZrdTwW2qntt9ErwA63rIM/4HrTD88M85pAoyZjrpmJSzuVKHXf40Uz3t3xfT2bS7gM2jZUxJ359Sfa0qVo/GD0/ShHw5IBrw4ONFqfjNX5CemoWQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=E/LAjxgN; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="E/LAjxgN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676256; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=duJTZJog/IvaQNcYHxOllCKMt0JMc5PBsQeEuw3a1uk=; b=E/LAjxgN2R/kAOuOnlyat6PpQuTta/lMU/kwqlofCRWNRjSlLp/WQJuCzPLq7o0ym/u/JB 2T14J6bUMXcXA8FxX97olxR28UPEMklmIDUQe387Zl5pVXPS/6O5Rd//QyJQlNJC+FCWV9 y12hACXaw2993xsWZAK0WKoRTxmCOW0= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-449-LIpes0QPPVuIxpV-J_iW3A-1; Fri, 05 Jun 2026 12:17:31 -0400 X-MC-Unique: LIpes0QPPVuIxpV-J_iW3A-1 X-Mimecast-MFC-AGG-ID: LIpes0QPPVuIxpV-J_iW3A_1780676250 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id F21B9180061E; Fri, 5 Jun 2026 16:17:29 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 36F2D1800351; Fri, 5 Jun 2026 16:17:11 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 12/14] mm/khugepaged: avoid unnecessary mTHP collapse attempts Date: Fri, 5 Jun 2026 10:14:19 -0600 Message-ID: <20260605161422.213817-13-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" There are cases where, if an attempted collapse fails, all subsequent orders are guaranteed to also fail. Avoid these collapse attempts by bailing out early. Reviewed-by: Lorenzo Stoakes Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache --- mm/khugepaged.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 430047316f43..7de92b28dd30 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1499,6 +1499,7 @@ static enum scan_result mthp_collapse(struct mm_struc= t *mm, collapse_address =3D address + offset * PAGE_SIZE; ret =3D collapse_huge_page(mm, collapse_address, referenced, unmapped, cc, order); + switch (ret) { /* Cases where we continue to next collapse candidate */ case SCAN_SUCCEED: @@ -1509,6 +1510,18 @@ static enum scan_result mthp_collapse(struct mm_stru= ct *mm, /* Cases where lower orders might still succeed */ case SCAN_ALLOC_HUGE_PAGE_FAIL: alloc_failed =3D true; + fallthrough; + case SCAN_LACK_REFERENCED_PAGE: + case SCAN_EXCEED_NONE_PTE: + case SCAN_EXCEED_SWAP_PTE: + case SCAN_EXCEED_SHARED_PTE: + case SCAN_PAGE_LOCK: + case SCAN_PAGE_COUNT: + case SCAN_PAGE_NULL: + case SCAN_DEL_PAGE_LRU: + case SCAN_PTE_NON_PRESENT: + case SCAN_PTE_UFFD_WP: + case SCAN_PAGE_LAZYFREE: last_result =3D ret; goto next_order; /* Cases where no further collapse is possible */ --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CB433195F9 for ; Fri, 5 Jun 2026 16:17:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676279; cv=none; b=MldCLlKJcL9UCWsA5Fn8hJmV5/AFJfUXUb4h8vuUNaW2+4X7963qhig1qyEQQbDrC9EVCuylHu4vXeayfEz6cK0GfaUW6POvAdepuGyENjwLdsdWqrKws1ifA9uROh2goU9RS8AeYjSV7i0WcxkDL4hNzbwpxfHsyLp+/kSpPA0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676279; c=relaxed/simple; bh=ra/VrWd9o0gUvLC+0SGOZCZn7k3sSo3ee0MiNiZfG5g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XwC2bKSVgiz9YFpbAlJhm+XHOvv9x6Gias4+5NOCerb20s54VTeJZXzLlJn+CcBscajJqrrPcX5WJ+nmlBhLR/4dotCR1cw9xJIq4IdWV5epfqJbrIOF8QcbhTL8X6punUxXBO5sm+VsJz3xsNV04wzezPgdXXRPY7+5Ud8Cxc4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=cw4V4ynL; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cw4V4ynL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676276; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lKoUumy80dG3dvE69RmnQuAtKZgih6bAKNwqy+L8dfs=; b=cw4V4ynLmc//Kgu1bCsFq+fe1EIAdVhTI3LrBjb6Xnu1sO23PZN/VYj/JieDr4ZLQS/Fiw b01YGcQOrt4HGc2uZpkeSYiL4wQon0+Jh8CMMhA3qmnHDQFBXN/WRpoTCZv0od2e2grqqS rG13TrFSXEqfEgJEnaxEbTa8oP8H7Ko= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-284-mS2lKC1tNlGijYWR2apfUg-1; Fri, 05 Jun 2026 12:17:52 -0400 X-MC-Unique: mS2lKC1tNlGijYWR2apfUg-1 X-Mimecast-MFC-AGG-ID: mS2lKC1tNlGijYWR2apfUg_1780676271 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 43B99195C26C; Fri, 5 Jun 2026 16:17:50 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 94CF1180049F; Fri, 5 Jun 2026 16:17:30 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH mm-unstable v19 13/14] mm/khugepaged: run khugepaged for all orders Date: Fri, 5 Jun 2026 10:14:20 -0600 Message-ID: <20260605161422.213817-14-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" From: Baolin Wang If any order (m)THP is enabled we should allow running khugepaged to attempt scanning and collapsing mTHPs. In order for khugepaged to operate when only mTHP sizes are specified in sysfs, we must modify the predicate function that determines whether it ought to run to do so. This function is currently called hugepage_pmd_enabled(), this patch renames it to hugepage_enabled() and updates the logic to check to determine whether any valid orders may exist which would justify khugepaged running. We must also update collapse_possible_orders() to check all orders if the vma is anonymous and the collapse is khugepaged. After this patch khugepaged mTHP collapse is fully enabled. Reviewed-by: Lorenzo Stoakes Reviewed-by: Lance Yang Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Signed-off-by: Baolin Wang Signed-off-by: Nico Pache --- mm/khugepaged.c | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 7de92b28dd30..996e014a03d3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -503,23 +503,23 @@ static inline int collapse_test_exit_or_disable(struc= t mm_struct *mm) mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm); } =20 -static bool hugepage_pmd_enabled(void) +static bool hugepage_enabled(void) { /* * We cover the anon, shmem and the file-backed case here; file-backed * hugepages, when configured in, are determined by the global control. - * Anon pmd-sized hugepages are determined by the pmd-size control. + * Anon hugepages are determined by its per-size mTHP control. * Shmem pmd-sized hugepages are also determined by its pmd-size control, * except when the global shmem_huge is set to SHMEM_HUGE_DENY. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && hugepage_global_enabled()) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_always)) + if (READ_ONCE(huge_anon_orders_always)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) + if (READ_ONCE(huge_anon_orders_madvise)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + if (READ_ONCE(huge_anon_orders_inherit) && hugepage_global_enabled()) return true; if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled()) @@ -566,7 +566,13 @@ void __khugepaged_enter(struct mm_struct *mm) static unsigned long collapse_possible_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, enum tva_type tva_flags) { - const unsigned long orders =3D BIT(HPAGE_PMD_ORDER); + unsigned long orders; + + /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */ + if ((tva_flags =3D=3D TVA_KHUGEPAGED) && vma_is_anonymous(vma)) + orders =3D THP_ORDERS_ALL_ANON; + else + orders =3D BIT(HPAGE_PMD_ORDER); =20 return thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); } @@ -580,11 +586,9 @@ static bool collapse_possible(struct vm_area_struct *v= ma, void khugepaged_enter_vma(struct vm_area_struct *vma, vm_flags_t vm_flags) { - if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && - hugepage_pmd_enabled()) { - if (collapse_possible(vma, vm_flags, TVA_KHUGEPAGED)) - __khugepaged_enter(vma->vm_mm); - } + if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_enabled() + && collapse_possible(vma, vm_flags, TVA_KHUGEPAGED)) + __khugepaged_enter(vma->vm_mm); } =20 void __khugepaged_exit(struct mm_struct *mm) @@ -2936,7 +2940,7 @@ static void collapse_scan_mm_slot(unsigned int progre= ss_max, =20 static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled(); } =20 static int khugepaged_wait_event(void) @@ -3009,7 +3013,7 @@ static void khugepaged_wait_work(void) return; } =20 - if (hugepage_pmd_enabled()) + if (hugepage_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); } =20 @@ -3040,7 +3044,7 @@ void set_recommended_min_free_kbytes(void) int nr_zones =3D 0; unsigned long recommended_min; =20 - if (!hugepage_pmd_enabled()) { + if (!hugepage_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -3090,7 +3094,7 @@ int start_stop_khugepaged(void) int err =3D 0; =20 mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled()) { + if (hugepage_enabled()) { if (!khugepaged_thread) khugepaged_thread =3D kthread_run(khugepaged, NULL, "khugepaged"); @@ -3116,7 +3120,7 @@ int start_stop_khugepaged(void) void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled() && khugepaged_thread) + if (hugepage_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } --=20 2.54.0 From nobody Mon Jun 8 05:25:26 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42F673EC2C4 for ; Fri, 5 Jun 2026 16:18:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676297; cv=none; b=qt01ZfrlwJpkbGuTT2n6pAcghjH16/pJZ2/+0Vui4Xi/KgkekfeS734/VLRerCO8QME2t/UnegUR/Qsg6re/uKH+4TORzwWJnoL/J+To8uMZApKkOKIbhr0QjEVva35Z8mVo38fr5LfAq4c0DgdrMbricJr+e2YIONSjrCYR8vc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780676297; c=relaxed/simple; bh=tR1fDlAzMJvhK4HCy+DckQWp+RczZqh4BBMTVFhKyEs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QXGEKj0nO5tphKXuNUdxG6XP3KG+575SbwpB6WdHJ72cSldrYIjUQGRmji2l++dCvfLHXgkzfvqCw/ED5BEpT76hXxkF3ss3eVgPbgmH2j1N39dhuYL9wtn4LRq434I2OivK+wlEIhvOz6lt42u+WG2jNpPZFQuLwfrn7bVopQo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=PzCWHnMy; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PzCWHnMy" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1780676295; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2bBQWaXR8oRLz4Hk62HvwsKqlzdG1BUydAb8xounnNo=; b=PzCWHnMyjXAIU9ZrDxOIBm4AmtOn8VlvxsNRZrgvlOkLW/M/qlWkPhTxAMnko8RnkMyyS0 d+43wKCU5VBvME4qxB+Un3qm9/3ON1Yu+dbmlCQvCrhkO9gpFNf1rCr9DK3Mkx+MKgDY5R 38x7YHsKEQsNc0PvXGXanceqMLKjR2o= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-446-w7H7jxseNwuNJCLxnjVZpg-1; Fri, 05 Jun 2026 12:18:11 -0400 X-MC-Unique: w7H7jxseNwuNJCLxnjVZpg-1 X-Mimecast-MFC-AGG-ID: w7H7jxseNwuNJCLxnjVZpg_1780676290 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 97BED18003FC; Fri, 5 Jun 2026 16:18:10 +0000 (UTC) Received: from p1.redhat.com (unknown [10.44.22.9]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B35A918005AE; Fri, 5 Jun 2026 16:17:50 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, liam@infradead.org, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Bagas Sanjaya Subject: [PATCH mm-unstable v19 14/14] Documentation: mm: update the admin guide for mTHP collapse Date: Fri, 5 Jun 2026 10:14:21 -0600 Message-ID: <20260605161422.213817-15-npache@redhat.com> In-Reply-To: <20260605161422.213817-1-npache@redhat.com> References: <20260605161422.213817-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Now that we can collapse to mTHPs lets update the admin guide to reflect these changes and provide proper guidance on how to utilize it. Reviewed-by: Lorenzo Stoakes Reviewed-by: Bagas Sanjaya Signed-off-by: Nico Pache Acked-by: David Hildenbrand (Arm) --- Documentation/admin-guide/mm/transhuge.rst | 49 ++++++++++++++-------- 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index b98e18c80185..23f8d13c2629 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -63,7 +63,8 @@ often. THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into PMD-sized huge pages. +collapses sequences of basic pages into huge pages of either PMD size +or mTHP sizes, if the system is configured to do so. =20 The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and = enable it by writing echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused =20 -khugepaged will be automatically started when PMD-sized THP is enabled +khugepaged will be automatically started when any THP size is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when -PMD-sized THP is disabled (when both the per-size anon control and the +all THP sizes are disabled (when both the per-size anon control and the top-level control are "never") =20 process THP controls @@ -265,8 +266,8 @@ Khugepaged controls ------------------- =20 .. note:: - khugepaged currently only searches for opportunities to collapse to - PMD-sized THP and no attempt is made to collapse to other THP + khugepaged currently only searches for opportunities to collapse file/s= hmem + to PMD-sized THP. Only anonymous memory will attempt to collapse to oth= er THP sizes. =20 khugepaged runs usually at low frequency so while one may not want to @@ -296,11 +297,11 @@ allocation failure to throttle the next allocation at= tempt:: The khugepaged progress can be seen in the number of pages collapsed (note that this counter may not be an exact count of the number of pages collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by -one 2M hugepage. Each may happen independently, or together, depending on -the type of memory and the failures that occur. As such, this value should -be interpreted roughly as a sign of progress, and counters in /proc/vmstat -consulted for more accurate accounting):: +being replaced by a PMD mapping, or (2) physical pages replaced by one +hugepage of various sizes (PMD-sized or mTHP). Each may happen independent= ly, +or together, depending on the type of memory and the failures that occur. +As such, this value should be interpreted roughly as a sign of progress, +and counters in /proc/vmstat consulted for more accurate accounting):: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed =20 @@ -308,16 +309,21 @@ for each pass:: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans =20 -``max_ptes_none`` specifies how many extra small pages (that are -not already mapped) can be allocated when collapsing a group -of small pages into one large page:: +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed +when collapsing a group of small pages into one large page:: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none =20 -A higher value leads to use additional memory for programs. -A lower value leads to gain less thp performance. Value of -max_ptes_none can waste cpu time very little, you can -ignore it. +For PMD-sized THP collapse, this directly limits the number of empty pages +allowed in the 2MB region. + +For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At +HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermedi= ate +value will emit a warning and mTHP collapse will default to max_ptes_none= =3D0. + +A higher value allows more empty pages, potentially leading to more memory +usage but better THP performance. A lower value is more conservative and +may result in fewer THP collapses. =20 ``max_ptes_swap`` specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page:: @@ -337,6 +343,15 @@ that THP is shared. Exceeding the number would block t= he collapse:: =20 A higher value may increase memory footprint for some workloads. =20 +.. note:: + For mTHP collapse, khugepaged does not support collapsing regions that + contain shared or swapped out pages, as this could lead to continuous + promotion to higher orders. The collapse will fail if any shared or + swapped PTEs are encountered during the scan. + + Currently, madvise_collapse only supports collapsing to PMD-sized THPs + and does not attempt mTHP collapses. + Boot parameters =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.54.0