From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E08CD1FECBA for ; Sun, 19 Apr 2026 18:59:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625146; cv=none; b=fVZiBzKkP5yFBqKl6DNFwGurbqjIQg7Jcj43F6g0NcEHbhjE0MdlVOGsU8VN3xwbEoygbiBN8gbdV7xraB0UfSBwM9Z3xyknfpo0Z/xxqm9IJKxWkfOEXoJUMm6xiwIzrsRpV2PaSpR7zqpTZvASHg+6uRqaqLuQUPncxDLIBxs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625146; c=relaxed/simple; bh=aSNtmOwocdnRpfiN/L3GhAXbU3swyhPeWvhe2YUzEQw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EnpOp3EEFKZOohD0vwCnMFHb40n6sLhl2kgrQiE+cUfRaEQIqHpQ6jsCzu/yVaaS5hRbzvTcmkBRlmfM2zRkBVSx5tutAK+aOFgNzKCraKPVuHr4RSy0AD9VnhFRqGhIIOmt5dgumDToYxIK3GvbBmV8+bUxhA+fBokZGElLIWw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=K35O87/p; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="K35O87/p" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625142; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1LJaOyXdmI63cfNVNKc4jzTWWvQfAeh+OPxH5aN4fYk=; b=K35O87/pbne9DfvfQ8bP8ih7SbZrR4Satavf4abCqb7PVJLxkOOcOXuMbieaXplumbzAw5 NrSP05QDYgJC/N/eKjLtdGQMxMAqVUWc8SGP16RXFjmk1OSOzR4ZfcfutBte/xQViNPoOP oYdPoUaElxaVFOAAY9eJ9JYW1rxq0Qo= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-577-D1VMC3nnMtenRTH-2o9vGA-1; Sun, 19 Apr 2026 14:58:57 -0400 X-MC-Unique: D1VMC3nnMtenRTH-2o9vGA-1 X-Mimecast-MFC-AGG-ID: D1VMC3nnMtenRTH-2o9vGA_1776625136 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 52B851800451; Sun, 19 Apr 2026 18:58:56 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 8235A19560AB; Sun, 19 Apr 2026 18:58:35 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 01/13] mm/khugepaged: generalize hugepage_vma_revalidate for mTHP support Date: Sun, 19 Apr 2026 12:57:38 -0600 Message-ID: <20260419185750.260784-2-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" For khugepaged to support different mTHP orders, we must generalize this to check if the PMD is not shared by another VMA and that the order is enabled. No functional change in this patch. Also correct a comment about the functionality of the revalidation and fix a double space issues. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: David Hildenbrand (Arm) Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Acked-by: Usama Arif --- mm/khugepaged.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b8452dbdb043..53e7e4be172d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -902,12 +902,13 @@ static int collapse_find_target_node(struct collapse_= control *cc) =20 /* * If mmap_lock temporarily dropped, revalidate vma - * before taking mmap_lock. + * after taking the mmap_lock again. * Returns enum scan_result value. */ =20 static enum scan_result hugepage_vma_revalidate(struct mm_struct *mm, unsi= gned long address, - bool expect_anon, struct vm_area_struct **vmap, struct collapse_control = *cc) + bool expect_anon, struct vm_area_struct **vmap, + struct collapse_control *cc, unsigned int order) { struct vm_area_struct *vma; enum tva_type type =3D cc->is_khugepaged ? TVA_KHUGEPAGED : @@ -920,15 +921,16 @@ static enum scan_result hugepage_vma_revalidate(struc= t mm_struct *mm, unsigned l if (!vma) return SCAN_VMA_NULL; =20 + /* Always check the PMD order to ensure its not shared by another VMA */ if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) + if (!thp_vma_allowable_orders(vma, vma->vm_flags, type, BIT(order))) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then * remapped to file after khugepaged reaquired the mmap_lock. * - * thp_vma_allowable_order may return true for qualified file + * thp_vma_allowable_orders may return true for qualified file * vmas. */ if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap))) @@ -1121,7 +1123,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a goto out_nolock; =20 mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1155,7 +1158,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * mmap_lock. */ mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc); + result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, + HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ @@ -2857,8 +2861,8 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, mmap_unlocked =3D false; *lock_dropped =3D true; result =3D hugepage_vma_revalidate(mm, addr, false, &vma, - cc); - if (result !=3D SCAN_SUCCEED) { + cc, HPAGE_PMD_ORDER); + if (result !=3D SCAN_SUCCEED) { last_fail =3D result; goto out_nolock; } --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3802263C8C for ; Sun, 19 Apr 2026 18:59:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625162; cv=none; b=FM+nL3eJsnGjI2eXXRc9qSGdglql+5Pdj4t/meH0bZ+f9Auw+ocUBDw/0jkBAWRflU+Bwy8L4SrWokQ6/z0zvATVeFP2XY8WLNwNK3f8hKicP1oVGce0qolERFY5MP/a3cvEiBb+RsnnXGIyn35oLvsc/Hvkr0rmRr6r0ZWjCJ4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625162; c=relaxed/simple; bh=RQ07zV50Vbheaegkj/o5h4AD9SA6xm8bs6+RHExfRts=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kSCWSibcdK0rYkxelLzVAJwpGdwoI3zfJFuYJ96PWZgKkOSbw33tQ4yY+GNlJYJy21TNEeus6hzp6cgOm2xuQuCEijMOYcEw8zWi1Z/OdKtAvGxkNo27XUPX494mWG6o2RFBAC3hoi/zEdYg/A1MGb+soq+ldfISmvCVgmo1HLc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=RaL+zhZq; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="RaL+zhZq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625159; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9q2oERcjlj7R4rFcsbNivHzvRliiiDrWRWufI9/VkAY=; b=RaL+zhZqvs1PmVpb33lDCv/ThXGbIk1BVqeBZMNeuZaW6zK/uvm0DGIb2hqTXhYpOpvBzw gmOB8T3kv1H446P0rTsbBYvCbETI8iDz8XTKC9VPYy1j49rv9Jq3N5qy15vd3hLYy0X+xX v0xAYkWlodH/QwGt8kf/jEgabkR+sr4= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-675-omTRLie4OOWkZUZWFO1EoQ-1; Sun, 19 Apr 2026 14:59:16 -0400 X-MC-Unique: omTRLie4OOWkZUZWFO1EoQ-1 X-Mimecast-MFC-AGG-ID: omTRLie4OOWkZUZWFO1EoQ_1776625155 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 6ACE91800473; Sun, 19 Apr 2026 18:59:15 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id D0CD51956095; Sun, 19 Apr 2026 18:58:56 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 02/13] mm/khugepaged: generalize alloc_charge_folio() Date: Sun, 19 Apr 2026 12:57:39 -0600 Message-ID: <20260419185750.260784-3-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" From: Dev Jain Pass order to alloc_charge_folio() and update mTHP statistics. Reviewed-by: Wei Yang Reviewed-by: Lance Yang Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Acked-by: David Hildenbrand (Arm) Co-developed-by: Nico Pache Signed-off-by: Nico Pache Signed-off-by: Dev Jain Acked-by: Usama Arif --- Documentation/admin-guide/mm/transhuge.rst | 8 ++++++++ include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/khugepaged.c | 17 +++++++++++------ 4 files changed, 25 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 5fbc3d89bb07..c51932e6275d 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -639,6 +639,14 @@ anon_fault_fallback_charge instead falls back to using huge pages with lower orders or small pages even though the allocation was successful. =20 +collapse_alloc + is incremented every time a huge page is successfully allocated for a + khugepaged collapse. + +collapse_alloc_failed + is incremented every time a huge page allocation fails during a + khugepaged collapse. + zswpout is incremented every time a huge page is swapped out to zswap in one piece without splitting. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2949e5acff35..ba7ae6808544 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -128,6 +128,8 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_COLLAPSE_ALLOC, + MTHP_STAT_COLLAPSE_ALLOC_FAILED, MTHP_STAT_ZSWPOUT, MTHP_STAT_SWPIN, MTHP_STAT_SWPIN_FALLBACK, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 970e077019b7..345c54133c83 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -685,6 +685,8 @@ static struct kobj_attribute _name##_attr =3D __ATTR_RO= (_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FAL= LBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC); +DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAIL= ED); DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT); DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN); DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK); @@ -750,6 +752,8 @@ static struct attribute *any_stats_attrs[] =3D { #endif &split_attr.attr, &split_failed_attr.attr, + &collapse_alloc_attr.attr, + &collapse_alloc_failed_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 53e7e4be172d..afac6bc4e76d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1068,21 +1068,26 @@ static enum scan_result __collapse_huge_page_swapin= (struct mm_struct *mm, } =20 static enum scan_result alloc_charge_folio(struct folio **foliop, struct m= m_struct *mm, - struct collapse_control *cc) + struct collapse_control *cc, unsigned int order) { gfp_t gfp =3D (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : GFP_TRANSHUGE); int node =3D collapse_find_target_node(cc); struct folio *folio; =20 - folio =3D __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask); + folio =3D __folio_alloc(gfp, order, node, &cc->alloc_nmask); if (!folio) { *foliop =3D NULL; - count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC_FAILED); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED); return SCAN_ALLOC_HUGE_PAGE_FAIL; } =20 - count_vm_event(THP_COLLAPSE_ALLOC); + if (is_pmd_order(order)) + count_vm_event(THP_COLLAPSE_ALLOC); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC); + if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { folio_put(folio); *foliop =3D NULL; @@ -1118,7 +1123,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a */ mmap_read_unlock(mm); =20 - result =3D alloc_charge_folio(&folio, mm, cc); + result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; =20 @@ -1899,7 +1904,7 @@ static enum scan_result collapse_file(struct mm_struc= t *mm, unsigned long addr, VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); =20 - result =3D alloc_charge_folio(&new_folio, mm, cc); + result =3D alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out; =20 --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37B122F9C37 for ; Sun, 19 Apr 2026 18:59:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625179; cv=none; b=O9Tx+dif/qk5FejtMD6DT2mdPbEIE/l09ahqUGWiFbxt4MO6JNWkmiFxmmlUNnPD5pfehnIelyT94QZYI6A7HkgYojT+k9uY56BKVQtgDNrSEG4bVFWY8WuEWEV2/8YeBIGcS+P+xZ2Livr+E/lCYwkLDTzUXovqCxMn3q21P8Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625179; c=relaxed/simple; bh=8C83PnZiI6uhPVxr1d7J8EPyzyzk1MqX9S8lPbooTU8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=szjeIvsbKLAPJYisLw5TjeV7yhAWeuCC4xLp/e6A7isllj5yetXQGu2HqGebzF5bRP/kNxKf8pQVJU8RN4xGfMzEvn5L3eTrCD/GS5qQAaPNH50D2Uu4DJVoEOuCFVlatnuXDOqOKrpz8RvTO6E+PsAAmmUwPeWGRR1mmDMRh64= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WcBV5SLd; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WcBV5SLd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625176; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gpiEosJanpfJUaGRhwa4O3aVZ0/wcJYLu3b9XwgPJO4=; b=WcBV5SLdS4BNaiycJlHmp8AlddjB03d4n87JahiHl0lQF7t56/fRIxBJCp/i0jsOOey0S2 HbaI6u0X/I/F1IxnUM0iOxpMjaOXFTPwCBfFnyvIUo02WDgUXniI2CLDtWuYzjLtQtualG biSTg3If23CrCE3cYNzPbx6GqRJLkzE= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-517-4mITkLacO6O1kjAcuUXwMQ-1; Sun, 19 Apr 2026 14:59:34 -0400 X-MC-Unique: 4mITkLacO6O1kjAcuUXwMQ-1 X-Mimecast-MFC-AGG-ID: 4mITkLacO6O1kjAcuUXwMQ_1776625173 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B7CBD19560AA; Sun, 19 Apr 2026 18:59:33 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E18C9195608E; Sun, 19 Apr 2026 18:59:15 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 03/13] mm/khugepaged: rework max_ptes_* handling with helper functions Date: Sun, 19 Apr 2026 12:57:40 -0600 Message-ID: <20260419185750.260784-4-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" The following cleanup reworks all the max_ptes_* handling into helper functions. This increases the code readability and will later be used to implement the mTHP handling of these variables. With these changes we abstract all the madvise_collapse() special casing (dont respect the sysctls) away from the functions that utilize them. And will later in this series to cleanly restrict mTHP collapses behaviors. Suggested-by: David Hildenbrand Signed-off-by: Nico Pache Acked-by: Usama Arif --- mm/khugepaged.c | 114 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 78 insertions(+), 36 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index afac6bc4e76d..f42b55421191 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -348,6 +348,58 @@ static bool pte_none_or_zero(pte_t pte) return pte_present(pte) && is_zero_pfn(pte_pfn(pte)); } =20 +/** + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for colla= pse + * @cc: The collapse control struct + * @vma: The vma to check for userfaultfd + * + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any + * empty page. + * + * Return: Maximum number of empty PTEs allowed for the collapse operation + */ +static unsigned int collapse_max_ptes_none(struct collapse_control *cc, + struct vm_area_struct *vma) +{ + if (vma && userfaultfd_armed(vma)) + return 0; + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + return khugepaged_max_ptes_none; +} + +/** + * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for co= llapse + * @cc: The collapse control struct + * + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any + * shared page. + * + * Return: Maximum number of shared PTEs allowed for the collapse operation + */ +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc) +{ + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + return khugepaged_max_ptes_shared; +} + +/** + * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collap= se + * @cc: The collapse control struct + * + * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any + * swap page. + * + * Return: Maximum number of swap PTEs allowed for the collapse operation + */ +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc) +{ + if (!cc->is_khugepaged) + return HPAGE_PMD_NR; + return khugepaged_max_ptes_swap; +} + int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags, int advice) { @@ -546,21 +598,19 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, pte_t *_pte; int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; enum scan_result result =3D SCAN_FAIL; + unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); + unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); =20 for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; _pte++, addr +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { - continue; - } else { + if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); goto out; } + continue; } if (!pte_present(pteval)) { result =3D SCAN_PTE_NON_PRESENT; @@ -591,9 +641,7 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, =20 /* See collapse_scan_pmd(). */ if (folio_maybe_mapped_shared(folio)) { - ++shared; - if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out; @@ -1270,6 +1318,9 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, unsigned long addr; spinlock_t *ptl; int node =3D NUMA_NO_NODE, unmapped =3D 0; + unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); + unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); + unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); =20 VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK); =20 @@ -1294,36 +1345,29 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, =20 pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { - continue; - } else { + if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); goto out_unmap; } + continue; } if (!pte_present(pteval)) { - ++unmapped; - if (!cc->is_khugepaged || - unmapped <=3D khugepaged_max_ptes_swap) { - /* - * Always be strict with uffd-wp - * enabled swap entries. Please see - * comment below for pte_uffd_wp(). - */ - if (pte_swp_uffd_wp_any(pteval)) { - result =3D SCAN_PTE_UFFD_WP; - goto out_unmap; - } - continue; - } else { + if (++unmapped > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); goto out_unmap; } + /* + * Always be strict with uffd-wp + * enabled swap entries. Please see + * comment below for pte_uffd_wp(). + */ + if (pte_swp_uffd_wp_any(pteval)) { + result =3D SCAN_PTE_UFFD_WP; + goto out_unmap; + } + continue; } if (pte_uffd_wp(pteval)) { /* @@ -1366,9 +1410,7 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, * is shared. */ if (folio_maybe_mapped_shared(folio)) { - ++shared; - if (cc->is_khugepaged && - shared > khugepaged_max_ptes_shared) { + if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); goto out_unmap; @@ -2329,6 +2371,8 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, int present, swap; int node =3D NUMA_NO_NODE; enum scan_result result =3D SCAN_SUCCEED; + unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, NULL); + unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); =20 present =3D 0; swap =3D 0; @@ -2341,8 +2385,7 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, =20 if (xa_is_value(folio)) { swap +=3D 1 << xas_get_order(&xas); - if (cc->is_khugepaged && - swap > khugepaged_max_ptes_swap) { + if (swap > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); break; @@ -2413,8 +2456,7 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, cc->progress +=3D HPAGE_PMD_NR; =20 if (result =3D=3D SCAN_SUCCEED) { - if (cc->is_khugepaged && - present < HPAGE_PMD_NR - khugepaged_max_ptes_none) { + if (present < HPAGE_PMD_NR - max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); } else { --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8813D2F49F1 for ; Sun, 19 Apr 2026 18:59:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625205; cv=none; b=laaIImCJPCLsKd8YjS3RywgKXDTR3Lss/IrQrka1Ful9SZsPCFrvD7xyUaSnd5FwrXWXRJ2FQwLSlxoDg+yipFNpgcmFTKc4ZxBACMxPI1ovqBwyKIs0E153KNBNQIhu2X27JGmE2bN/w0RneBl6kRsg5LgY9JBfJnthheyFPhc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625205; c=relaxed/simple; bh=t3Q6ODkUzyUrpawyjPBCySG9QyBhFccdmUWCcPXos0U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=taoS4CmD9T5shZdyCbnIO9GhVTbJkAnLupomCQ0mXNxMTW9nSunc1/vO5HT8ndUCf2Z4sBUaZfKVxESCaLk0oarBK6OD7q1xcEIc13pO/Fn+RQXuPvi93nYgQi6OsuDUEDP32AX/tpFoGto43RtMF5mKwR+XiP85O+sU7cPO2r4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=H69Z+yK5; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="H69Z+yK5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625197; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X3ELmtG1D0kWfzKPqqA8nVVpkTNegL6V2K5I0oBSsQs=; b=H69Z+yK5BgCD6CTNAIVSjXmvy1WUaIPYJu/yP0E696DWwItN11jv+vHGQXHshJT64b+suc 9kwdBaztJX0ypVcrXZV6lx/8yeFfQxPLdpmE16ZHqvXkVZDfW1CV/IbuiKawps/r0K4y2o fI+QcqF87hnH9NLq7He2YzFlzCPGrL8= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-389-Cw5j3xtlNhuYKhAt1bseTg-1; Sun, 19 Apr 2026 14:59:52 -0400 X-MC-Unique: Cw5j3xtlNhuYKhAt1bseTg-1 X-Mimecast-MFC-AGG-ID: Cw5j3xtlNhuYKhAt1bseTg_1776625190 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AF7151800451; Sun, 19 Apr 2026 18:59:50 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 3BDEF1956095; Sun, 19 Apr 2026 18:59:34 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 04/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support Date: Sun, 19 Apr 2026 12:57:41 -0600 Message-ID: <20260419185750.260784-5-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" generalize the order of the __collapse_huge_page_* and collapse_max_* functions to support future mTHP collapse. The current mechanism for determining collapse with the khugepaged_max_ptes_none value is not designed with mTHP in mind. This raises a key design issue: if we support user defined max_pte_none values (even those scaled by order), a collapse of a lower order can introduces an feedback loop, or "creep", when max_ptes_none is set to a value greater than HPAGE_PMD_NR / 2. With this configuration, a successful collapse to order N will populate enough pages to satisfy the collapse condition on order N+1 on the next scan. This leads to unnecessary work and memory churn. To fix this issue introduce a helper function that will limit mTHP collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1. This effectively supports two modes: - max_ptes_none=3D0: never introduce new none-pages for mTHP collapse. - max_ptes_none=3D511 (on 4k pagesz): Always collapse to the highest available mTHP order. This removes the possiblilty of "creep", while not modifying any uAPI expectations. A warning will be emitted if any non-supported max_ptes_none value is configured with mTHP enabled. mTHP collapse will not honor the khugepaged_max_ptes_shared or khugepaged_max_ptes_swap parameters, and will fail if it encounters a shared or swapped entry. No functional changes in this patch; however it defines future behavior for mTHP collapse. Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache --- mm/khugepaged.c | 124 ++++++++++++++++++++++++++++++++++-------------- 1 file changed, 88 insertions(+), 36 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f42b55421191..283bb63854a5 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -352,51 +352,86 @@ static bool pte_none_or_zero(pte_t pte) * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for colla= pse * @cc: The collapse control struct * @vma: The vma to check for userfaultfd + * @order: The folio order being collapsed to * * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any - * empty page. + * empty page. For PMD-sized collapses (order =3D=3D HPAGE_PMD_ORDER), use= the + * configured khugepaged_max_ptes_none value. + * + * For mTHP collapses, we currently only support khugepaged_max_pte_none v= alues + * of 0 or (KHUGEPAGED_MAX_PTES_LIMIT). Any other value will emit a warnin= g and + * no mTHP collapse will be attempted * * Return: Maximum number of empty PTEs allowed for the collapse operation */ -static unsigned int collapse_max_ptes_none(struct collapse_control *cc, - struct vm_area_struct *vma) +static int collapse_max_ptes_none(struct collapse_control *cc, + struct vm_area_struct *vma, unsigned int order) { if (vma && userfaultfd_armed(vma)) return 0; if (!cc->is_khugepaged) return HPAGE_PMD_NR; - return khugepaged_max_ptes_none; + if (is_pmd_order(order)) + return khugepaged_max_ptes_none; + /* Zero/non-present collapse disabled. */ + if (!khugepaged_max_ptes_none) + return 0; + if (khugepaged_max_ptes_none =3D=3D KHUGEPAGED_MAX_PTES_LIMIT) + return (1 << order) - 1; + + pr_warn_once("mTHP collapse only supports max_ptes_none values of 0 or %u= \n", + KHUGEPAGED_MAX_PTES_LIMIT); + return -EINVAL; } =20 /** * collapse_max_ptes_shared - Calculate maximum allowed shared PTEs for co= llapse * @cc: The collapse control struct + * @order: The folio order being collapsed to * * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any * shared page. * + * For mTHP collapses, we currently dont support collapsing memory with + * shared memory. + * * Return: Maximum number of shared PTEs allowed for the collapse operation */ -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc) +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc, + unsigned int order) { if (!cc->is_khugepaged) return HPAGE_PMD_NR; + if (!is_pmd_order(order)) + return 0; + return khugepaged_max_ptes_shared; } =20 /** * collapse_max_ptes_swap - Calculate maximum allowed swap PTEs for collap= se * @cc: The collapse control struct + * @order: The folio order being collapsed to * * If we are not in khugepaged mode use HPAGE_PMD_NR to allow any * swap page. * + * For PMD-sized collapses (order =3D=3D HPAGE_PMD_ORDER), use the configu= red + * khugepaged_max_ptes_swap value. + * + * For mTHP collapses, we currently dont support collapsing memory with + * swapped out memory. + * * Return: Maximum number of swap PTEs allowed for the collapse operation */ -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc) +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc, + unsigned int order) { if (!cc->is_khugepaged) return HPAGE_PMD_NR; + if (!is_pmd_order(order)) + return 0; + return khugepaged_max_ptes_swap; } =20 @@ -590,18 +625,22 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte, =20 static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct= *vma, unsigned long start_addr, pte_t *pte, struct collapse_control *cc, - struct list_head *compound_pagelist) + unsigned int order, struct list_head *compound_pagelist) { + const unsigned long nr_pages =3D 1UL << order; struct page *page =3D NULL; struct folio *folio =3D NULL; unsigned long addr =3D start_addr; pte_t *_pte; int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; enum scan_result result =3D SCAN_FAIL; - unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); - unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); + int max_ptes_none =3D collapse_max_ptes_none(cc, vma, order); + unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, order); + + if (max_ptes_none < 0) + return result; =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; + for (_pte =3D pte; _pte < pte + nr_pages; _pte++, addr +=3D PAGE_SIZE) { pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { @@ -734,18 +773,18 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, } =20 static void __collapse_huge_page_copy_succeeded(pte_t *pte, - struct vm_area_struct *vma, - unsigned long address, - spinlock_t *ptl, - struct list_head *compound_pagelist) + struct vm_area_struct *vma, unsigned long address, + spinlock_t *ptl, unsigned int order, + struct list_head *compound_pagelist) { - unsigned long end =3D address + HPAGE_PMD_SIZE; + const unsigned long nr_pages =3D 1UL << order; + unsigned long end =3D address + (PAGE_SIZE << order); struct folio *src, *tmp; pte_t pteval; pte_t *_pte; unsigned int nr_ptes; =20 - for (_pte =3D pte; _pte < pte + HPAGE_PMD_NR; _pte +=3D nr_ptes, + for (_pte =3D pte; _pte < pte + nr_pages; _pte +=3D nr_ptes, address +=3D nr_ptes * PAGE_SIZE) { nr_ptes =3D 1; pteval =3D ptep_get(_pte); @@ -798,13 +837,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t= *pte, } =20 static void __collapse_huge_page_copy_failed(pte_t *pte, - pmd_t *pmd, - pmd_t orig_pmd, - struct vm_area_struct *vma, - struct list_head *compound_pagelist) + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, + unsigned int order, struct list_head *compound_pagelist) { + const unsigned long nr_pages =3D 1UL << order; spinlock_t *pmd_ptl; - /* * Re-establish the PMD to point to the original page table * entry. Restoring PMD needs to be done prior to releasing @@ -818,7 +855,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, * Release both raw and compound pages isolated * in __collapse_huge_page_isolate. */ - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); + release_pte_pages(pte, pte + nr_pages, compound_pagelist); } =20 /* @@ -838,16 +875,16 @@ static void __collapse_huge_page_copy_failed(pte_t *p= te, */ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio= *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, - unsigned long address, spinlock_t *ptl, + unsigned long address, spinlock_t *ptl, unsigned int order, struct list_head *compound_pagelist) { + const unsigned long nr_pages =3D 1UL << order; unsigned int i; enum scan_result result =3D SCAN_SUCCEED; - /* * Copying pages' contents is subject to memory poison at any iteration. */ - for (i =3D 0; i < HPAGE_PMD_NR; i++) { + for (i =3D 0; i < nr_pages; i++) { pte_t pteval =3D ptep_get(pte + i); struct page *page =3D folio_page(folio, i); unsigned long src_addr =3D address + i * PAGE_SIZE; @@ -866,10 +903,10 @@ static enum scan_result __collapse_huge_page_copy(pte= _t *pte, struct folio *foli =20 if (likely(result =3D=3D SCAN_SUCCEED)) __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, - compound_pagelist); + order, compound_pagelist); else __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, - compound_pagelist); + order, compound_pagelist); =20 return result; } @@ -1040,12 +1077,12 @@ static enum scan_result check_pmd_still_valid(struc= t mm_struct *mm, * Returns result: if not SCAN_SUCCEED, mmap_lock has been released. */ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd, - int referenced) + struct vm_area_struct *vma, unsigned long start_addr, + pmd_t *pmd, int referenced, unsigned int order) { int swapped_in =3D 0; vm_fault_t ret =3D 0; - unsigned long addr, end =3D start_addr + (HPAGE_PMD_NR * PAGE_SIZE); + unsigned long addr, end =3D start_addr + (PAGE_SIZE << order); enum scan_result result; pte_t *pte =3D NULL; spinlock_t *ptl; @@ -1077,6 +1114,19 @@ static enum scan_result __collapse_huge_page_swapin(= struct mm_struct *mm, pte_present(vmf.orig_pte)) continue; =20 + /* + * TODO: Support swapin without leading to further mTHP + * collapses. Currently bringing in new pages via swapin may + * cause a future higher order collapse on a rescan of the same + * range. + */ + if (!is_pmd_order(order)) { + pte_unmap(pte); + mmap_read_unlock(mm); + result =3D SCAN_EXCEED_SWAP_PTE; + goto out; + } + vmf.pte =3D pte; vmf.ptl =3D ptl; ret =3D do_swap_page(&vmf); @@ -1196,7 +1246,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * that case. Continuing to collapse causes inconsistency. */ result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced); + referenced, HPAGE_PMD_ORDER); if (result !=3D SCAN_SUCCEED) goto out_nolock; } @@ -1244,6 +1294,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { result =3D __collapse_huge_page_isolate(vma, address, pte, cc, + HPAGE_PMD_ORDER, &compound_pagelist); spin_unlock(pte_ptl); } else { @@ -1274,6 +1325,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a =20 result =3D __collapse_huge_page_copy(pte, folio, pmd, _pmd, vma, address, pte_ptl, + HPAGE_PMD_ORDER, &compound_pagelist); pte_unmap(pte); if (unlikely(result !=3D SCAN_SUCCEED)) @@ -1318,9 +1370,9 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, unsigned long addr; spinlock_t *ptl; int node =3D NUMA_NO_NODE, unmapped =3D 0; - unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, vma); - unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc); - unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); + int max_ptes_none =3D collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER); + unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, HPAGE_PMD_O= RDER); + unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER= ); =20 VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK); =20 @@ -2371,8 +2423,8 @@ static enum scan_result collapse_scan_file(struct mm_= struct *mm, int present, swap; int node =3D NUMA_NO_NODE; enum scan_result result =3D SCAN_SUCCEED; - unsigned int max_ptes_none =3D collapse_max_ptes_none(cc, NULL); - unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc); + int max_ptes_none =3D collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER); + unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER= ); =20 present =3D 0; swap =3D 0; --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 43AC0307AF0 for ; Sun, 19 Apr 2026 19:00:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625228; cv=none; b=upYlGVpvdwSkQ3sZl5MHVhORpHRh8yEuO1z7mU02FFL4Mi1972SEJhk55HhduKH2ZWI+8kwbvpyidWOco8+q2McwxBg75zHKCd3jiUuSCF0hrtmlMx5PrQ029NE2h7931OumwMygCyZvmKBUKRsn0SJk1TqGps7TFgIIbQTPzzw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625228; c=relaxed/simple; bh=HfiYTuMcnaPXBRFc2Reklhd6zGjZLtCqWeDvlQshRiA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=P0NjyCBRcU/E59cZaoWE6UaPWAGPmqOrB7yumc5OYT3GCTcOZQiG9XbrMnMouNO6P3Mp6BB++YYAf6T+cX6SjxnSETWuC8xlaXfIN3VI1O6aSMsk0vygD7ZIREzcZlYP4xa2T25r/yDCpd0msBybnQRLHlLxZtwA83h8QLIXkqk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MkPjmIRi; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MkPjmIRi" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625212; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LKz4Kxt5SwykyMCkEEHu47ipnDsdhZipTsFPCayWNsk=; b=MkPjmIRiNh0i/+PqIAuPXiUGbgUoNuJtILxJCWtbh1JJejd7HDFuf734YF1QvvpZng11Cr CbtQzA+tk44zYI0OmThyKLpDRMrccIwW3mB0+/gB/x/7weExRULE7gtaOFQxpScizZ+IG5 apUTKFHZSgQqDn9GxvwZ3PUfPUSR2VE= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-644-19zZiZVLP4ql05oGpQ8RRw-1; Sun, 19 Apr 2026 15:00:08 -0400 X-MC-Unique: 19zZiZVLP4ql05oGpQ8RRw-1 X-Mimecast-MFC-AGG-ID: 19zZiZVLP4ql05oGpQ8RRw_1776625207 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AEB1218005A8; Sun, 19 Apr 2026 19:00:07 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 32837195608E; Sun, 19 Apr 2026 18:59:50 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 05/13] mm/khugepaged: generalize collapse_huge_page for mTHP collapse Date: Sun, 19 Apr 2026 12:57:42 -0600 Message-ID: <20260419185750.260784-6-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Pass an order and offset to collapse_huge_page to support collapsing anon memory to arbitrary orders within a PMD. order indicates what mTHP size we are attempting to collapse to, and offset indicates were in the PMD to start the collapse attempt. For non-PMD collapse we must leave the anon VMA write locked until after we collapse the mTHP-- in the PMD case all the pages are isolated, but in the mTHP case this is not true, and we must keep the lock to prevent access/changes to the page tables. This can happen if the rmap walkers hit a pmd_none while the PMD entry is currently unavailable due to being temporarily removed during the collapse phase. Signed-off-by: Nico Pache --- mm/khugepaged.c | 103 +++++++++++++++++++++++++++--------------------- 1 file changed, 57 insertions(+), 46 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 283bb63854a5..ff6f9f1883ed 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1198,42 +1198,36 @@ static enum scan_result alloc_charge_folio(struct f= olio **foliop, struct mm_stru return SCAN_SUCCEED; } =20 -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned = long address, - int referenced, int unmapped, struct collapse_control *cc) +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned = long start_addr, + int referenced, int unmapped, struct collapse_control *cc, + unsigned int order) { LIST_HEAD(compound_pagelist); pmd_t *pmd, _pmd; - pte_t *pte; + pte_t *pte =3D NULL; pgtable_t pgtable; struct folio *folio; spinlock_t *pmd_ptl, *pte_ptl; enum scan_result result =3D SCAN_FAIL; struct vm_area_struct *vma; struct mmu_notifier_range range; + bool anon_vma_locked =3D false; + const unsigned long pmd_addr =3D start_addr & HPAGE_PMD_MASK; + const unsigned long end_addr =3D start_addr + (PAGE_SIZE << order); =20 - VM_BUG_ON(address & ~HPAGE_PMD_MASK); - - /* - * Before allocating the hugepage, release the mmap_lock read lock. - * The allocation can take potentially a long time if it involves - * sync compaction, and we do not need to hold the mmap_lock during - * that. We will recheck the vma after taking it again in write mode. - */ - mmap_read_unlock(mm); - - result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); + result =3D alloc_charge_folio(&folio, mm, cc, order); if (result !=3D SCAN_SUCCEED) goto out_nolock; =20 mmap_read_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + result =3D hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=3D*/ true, + &vma, cc, order); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; } =20 - result =3D find_pmd_or_thp_or_none(mm, address, &pmd); + result =3D find_pmd_or_thp_or_none(mm, pmd_addr, &pmd); if (result !=3D SCAN_SUCCEED) { mmap_read_unlock(mm); goto out_nolock; @@ -1245,8 +1239,8 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long a * released when it fails. So we jump out_nolock directly in * that case. Continuing to collapse causes inconsistency. */ - result =3D __collapse_huge_page_swapin(mm, vma, address, pmd, - referenced, HPAGE_PMD_ORDER); + result =3D __collapse_huge_page_swapin(mm, vma, start_addr, pmd, + referenced, order); if (result !=3D SCAN_SUCCEED) goto out_nolock; } @@ -1261,20 +1255,21 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * mmap_lock. */ mmap_write_lock(mm); - result =3D hugepage_vma_revalidate(mm, address, true, &vma, cc, - HPAGE_PMD_ORDER); + result =3D hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=3D*/ true, + &vma, cc, order); if (result !=3D SCAN_SUCCEED) goto out_up_write; /* check if the pmd is still valid */ vma_start_write(vma); - result =3D check_pmd_still_valid(mm, address, pmd); + result =3D check_pmd_still_valid(mm, pmd_addr, pmd); if (result !=3D SCAN_SUCCEED) goto out_up_write; =20 anon_vma_lock_write(vma->anon_vma); + anon_vma_locked =3D true; =20 - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, - address + HPAGE_PMD_SIZE); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr, + end_addr); mmu_notifier_invalidate_range_start(&range); =20 pmd_ptl =3D pmd_lock(mm, pmd); /* probably unnecessary */ @@ -1286,26 +1281,23 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * Parallel GUP-fast is fine since GUP-fast will back off when * it detects PMD is changed. */ - _pmd =3D pmdp_collapse_flush(vma, address, pmd); + _pmd =3D pmdp_collapse_flush(vma, pmd_addr, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); tlb_remove_table_sync_one(); =20 - pte =3D pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + pte =3D pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl); if (pte) { - result =3D __collapse_huge_page_isolate(vma, address, pte, cc, - HPAGE_PMD_ORDER, - &compound_pagelist); + result =3D __collapse_huge_page_isolate(vma, start_addr, pte, cc, + order, &compound_pagelist); spin_unlock(pte_ptl); } else { result =3D SCAN_NO_PTE_TABLE; } =20 if (unlikely(result !=3D SCAN_SUCCEED)) { - if (pte) - pte_unmap(pte); spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); + WARN_ON_ONCE(!pmd_none(*pmd)); /* * We can only use set_pmd_at when establishing * hugepmds and never for establishing regular pmds that @@ -1313,21 +1305,24 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a */ pmd_populate(mm, pmd, pmd_pgtable(_pmd)); spin_unlock(pmd_ptl); - anon_vma_unlock_write(vma->anon_vma); goto out_up_write; } =20 /* - * All pages are isolated and locked so anon_vma rmap - * can't run anymore. + * For PMD collapse all pages are isolated and locked so anon_vma + * rmap can't run anymore. For mTHP collapse the PMD entry has been + * removed and not all pages are isolated and locked, so we must hold + * the lock to prevent neighboring folios from attempting to access + * this PMD until its reinstalled. */ - anon_vma_unlock_write(vma->anon_vma); + if (is_pmd_order(order)) { + anon_vma_unlock_write(vma->anon_vma); + anon_vma_locked =3D false; + } =20 result =3D __collapse_huge_page_copy(pte, folio, pmd, _pmd, - vma, address, pte_ptl, - HPAGE_PMD_ORDER, - &compound_pagelist); - pte_unmap(pte); + vma, start_addr, pte_ptl, + order, &compound_pagelist); if (unlikely(result !=3D SCAN_SUCCEED)) goto out_up_write; =20 @@ -1337,18 +1332,27 @@ static enum scan_result collapse_huge_page(struct m= m_struct *mm, unsigned long a * write. */ __folio_mark_uptodate(folio); - pgtable =3D pmd_pgtable(_pmd); - spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); - pgtable_trans_huge_deposit(mm, pmd, pgtable); - map_anon_folio_pmd_nopf(folio, pmd, vma, address); + WARN_ON_ONCE(!pmd_none(*pmd)); + if (is_pmd_order(order)) { /* PMD collapse */ + pgtable =3D pmd_pgtable(_pmd); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr); + } else { /* mTHP collapse */ + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=3D*/ fals= e); + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */ + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + } spin_unlock(pmd_ptl); =20 folio =3D NULL; =20 result =3D SCAN_SUCCEED; out_up_write: + if (anon_vma_locked) + anon_vma_unlock_write(vma->anon_vma); + if (pte) + pte_unmap(pte); mmap_write_unlock(mm); out_nolock: if (folio) @@ -1525,8 +1529,15 @@ static enum scan_result collapse_scan_pmd(struct mm_= struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); if (result =3D=3D SCAN_SUCCEED) { + /* + * Before allocating the hugepage, release the mmap_lock read lock. + * The allocation can take potentially a long time if it involves + * sync compaction, and we do not need to hold the mmap_lock during + * that. We will recheck the vma after taking it again in write mode. + */ + mmap_read_unlock(mm); result =3D collapse_huge_page(mm, start_addr, referenced, - unmapped, cc); + unmapped, cc, HPAGE_PMD_ORDER); /* collapse_huge_page will return with the mmap_lock released */ *lock_dropped =3D true; } --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC6D82E88BD for ; Sun, 19 Apr 2026 19:00:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625233; cv=none; b=mQXCsXHeX2//m1jOudrBgRoqNJg5cl8zklYmXCHZyrjJKkDbfZ1Yc+i4psq1kUYzADDI7sceKG6OJiX8kGW+gqhJ+SL8FPX/rB8cBRG/Pg3OBPGM0AqMsMCfFR0D84FGGLp1fwhhzMPODk1HEB9Ul1xpzIpPAkamOXOdcPogK/s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625233; c=relaxed/simple; bh=3ER+YTbmmwldanG3No1Su8z9uWQ8sh6J1G0sB86U8gk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=r8Onm8EeFLawDfl+sQsuPJ7NGsX78mRf80toNcSWWuplwspuR/yxndvvSEMKRNwe/onLlYUl5dC+w6WbIFK+Q95AiPqeWrUNarnGz+LaNncP4UOq1yDtWuty+yfXtlilEqsLHx2kB6qQBAazEh6mHQ0gB+3K0ZhQpbHEPDcpSPE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GLpB0hgz; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GLpB0hgz" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625227; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0gbD4f/RO2mWPwJriUyvKl0G7J4huajbSVgy4DAT/PQ=; b=GLpB0hgzvJHCzrF+qzBdmhwATsSZNQMdpEZ3h1kN6IhSJKMJp5mAjkemjdRXT3jJOaBny4 yro6nLCrKdyyyM6/GwZvC6ItmcD8sqMDA4hhMOqx9tWxKMiFr3W1MV7sgKT5noj1gC8yH9 dkd7kjOJvfMXb9konQ8gAzEO1l6swuw= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-500-6gm84o8iNMCx8NE0C4lc7A-1; Sun, 19 Apr 2026 15:00:25 -0400 X-MC-Unique: 6gm84o8iNMCx8NE0C4lc7A-1 X-Mimecast-MFC-AGG-ID: 6gm84o8iNMCx8NE0C4lc7A_1776625224 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 700721956094; Sun, 19 Apr 2026 19:00:24 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 3E2C5195608E; Sun, 19 Apr 2026 19:00:08 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 06/13] mm/khugepaged: skip collapsing mTHP to smaller orders Date: Sun, 19 Apr 2026 12:57:43 -0600 Message-ID: <20260419185750.260784-7-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in some pages being unmapped. Skip these cases until we have a way to check if its ok to collapse to a smaller mTHP size (like in the case of a partially mapped folio). This check is also not done during the scan phase as the current collapse order is unknown at that time. This patch is inspired by Dev Jain's work on khugepaged mTHP support [1]. [1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/ Reviewed-by: Lorenzo Stoakes =20 Reviewed-by: Baolin Wang Co-developed-by: Dev Jain Signed-off-by: Dev Jain Signed-off-by: Nico Pache Acked-by: Usama Arif --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ff6f9f1883ed..8740d379882e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -686,6 +686,14 @@ static enum scan_result __collapse_huge_page_isolate(s= truct vm_area_struct *vma, goto out; } } + /* + * TODO: In some cases of partially-mapped folios, we'd actually + * want to collapse. + */ + if (!is_pmd_order(order) && folio_order(folio) >=3D order) { + result =3D SCAN_PTE_MAPPED_HUGEPAGE; + goto out; + } =20 if (folio_test_large(folio)) { struct folio *f; --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7ED252F3C37 for ; Sun, 19 Apr 2026 19:00:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625254; cv=none; b=thcQFzem/3st+jbsCzLaiSbFnbdpV9uSstTLOz72Pf12XjYB/XfGaI0gQR756MD9HQHb6CXqhZAdPunsPGLAKs5OTknkwc1XhlJFce2KgM+pomdfEbxG5jFHBKe8YG0vRwpteayOrYFfA4PcHUHjrroxBeRcXOE2JD1S0H7QS34= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625254; c=relaxed/simple; bh=EaHQd8wMafMCCtRPzi7iQIZgQPTdaaQXzxq5ON56Dn8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oT4ZMg5HKnqRXRqgygGgbIbXpwsy7RMwmwoAvAbX//wxUGuykrbBdFIuA80DlSVHGzVKCFbQQJRCqpUagMBFsKnyt27MZt8DTt/gsgKrcJ10+APIwO0kXI1p+1RQYeUOrutiCQkdMWPYv2bO800n+C1l3bmhBgEbfWiGNymOrHg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=J3eZMdmd; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="J3eZMdmd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625248; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5y8r7AkwzRx/e6vsiKDBu5x8M1bLOvW6Rbo+RlaDBGg=; b=J3eZMdmdV9vdphuPXEu6eRGJLrZPOw0KP8JrQg6RYp0Z5GuBSpzVu5o1aB4HIFh4kaXrTe 4y8bYLWBLe1YLwfm4K4kigbSrrl4/GPGcscp9ho5P98ax6CdoBYSmtprL/8v8OljWkkoE5 3/hM2ZqthqriKPFRHoADip7tkJTD2Gw= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-399-SPqHiMiJMyyb78Yucf6ZKw-1; Sun, 19 Apr 2026 15:00:43 -0400 X-MC-Unique: SPqHiMiJMyyb78Yucf6ZKw-1 X-Mimecast-MFC-AGG-ID: SPqHiMiJMyyb78Yucf6ZKw_1776625242 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id ECFB7195608B; Sun, 19 Apr 2026 19:00:41 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E2286195608E; Sun, 19 Apr 2026 19:00:24 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics Date: Sun, 19 Apr 2026 12:57:44 -0600 Message-ID: <20260419185750.260784-8-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Add three new mTHP statistics to track collapse failures for different orders when encountering swap PTEs, excessive none PTEs, and shared PTEs: - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap PTEs - collapse_exceed_none_pte: Counts when mTHP collapse fails due to exceeding the none PTE threshold for the given order - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared PTEs These statistics complement the existing THP_SCAN_EXCEED_* events by providing per-order granularity for mTHP collapse attempts. The stats are exposed via sysfs under `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each supported hugepage size. As we currently dont support collapsing mTHPs that contain a swap or shared entry, those statistics keep track of how often we are encountering failed mTHP collapses due to these restrictions. Now that we plan to support mTHP collapse for anon pages, lets also track when this happens at the PMD level within the per-mTHP stats. Signed-off-by: Nico Pache --- Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++ include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 7 +++++++ mm/khugepaged.c | 21 +++++++++++++++++-- 4 files changed, 53 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index c51932e6275d..eebb1f6bbc6c 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -714,6 +714,30 @@ nr_anon_partially_mapped an anonymous THP as "partially mapped" and count it here, even thou= gh it is not actually partially mapped anymore. =20 +collapse_exceed_none_pte + The number of collapse attempts that failed due to exceeding the + max_ptes_none threshold. For mTHP collapse, Currently only max_ptes= _none + values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value w= ill + emit a warning and no mTHP collapse will be attempted. khugepaged w= ill + try to collapse to the largest enabled (m)THP size; if it fails, it= will + try the next lower enabled mTHP size. This counter records the numb= er of + times a collapse attempt was skipped for exceeding the max_ptes_none + threshold, and khugepaged will move on to the next available mTHP s= ize. + +collapse_exceed_swap_pte + The number of anonymous mTHP PTE ranges which were unable to collap= se due + to containing at least one swap PTE. Currently khugepaged does not + support collapsing mTHP regions that contain a swap PTE. This count= er can + be used to monitor the number of khugepaged mTHP collapses that fai= led + due to the presence of a swap PTE. + +collapse_exceed_shared_pte + The number of anonymous mTHP PTE ranges which were unable to collap= se due + to containing at least one shared PTE. Currently khugepaged does not + support collapsing mTHP PTE ranges that contain a shared PTE. This + counter can be used to monitor the number of khugepaged mTHP collap= ses + that failed due to the presence of a shared PTE. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index ba7ae6808544..48496f09909b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -144,6 +144,9 @@ enum mthp_stat_item { MTHP_STAT_SPLIT_DEFERRED, MTHP_STAT_NR_ANON, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, + MTHP_STAT_COLLAPSE_EXCEED_SWAP, + MTHP_STAT_COLLAPSE_EXCEED_NONE, + MTHP_STAT_COLLAPSE_EXCEED_SHARED, __MTHP_STAT_COUNT }; =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 345c54133c83..5c128cdec810 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -703,6 +703,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FA= ILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON); DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALL= Y_MAPPED); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_= SWAP); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_= NONE); +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEE= D_SHARED); + =20 static struct attribute *anon_stats_attrs[] =3D { &anon_fault_alloc_attr.attr, @@ -719,6 +723,9 @@ static struct attribute *anon_stats_attrs[] =3D { &split_deferred_attr.attr, &nr_anon_attr.attr, &nr_anon_partially_mapped_attr.attr, + &collapse_exceed_swap_pte_attr.attr, + &collapse_exceed_none_pte_attr.attr, + &collapse_exceed_shared_pte_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 8740d379882e..0a1c7cc20c0e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -646,7 +646,9 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, if (pte_none_or_zero(pteval)) { if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + if (is_pmd_order(order)) + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE); goto out; } continue; @@ -680,9 +682,17 @@ static enum scan_result __collapse_huge_page_isolate(s= truct vm_area_struct *vma, =20 /* See collapse_scan_pmd(). */ if (folio_maybe_mapped_shared(folio)) { + /* + * TODO: Support shared pages without leading to further + * mTHP collapses. Currently bringing in new pages via + * shared may cause a future higher order collapse on a + * rescan of the same range. + */ if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + if (is_pmd_order(order)) + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED); goto out; } } @@ -1129,6 +1139,7 @@ static enum scan_result __collapse_huge_page_swapin(s= truct mm_struct *mm, * range. */ if (!is_pmd_order(order)) { + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP); pte_unmap(pte); mmap_read_unlock(mm); result =3D SCAN_EXCEED_SWAP_PTE; @@ -1412,6 +1423,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_NONE); goto out_unmap; } continue; @@ -1420,6 +1433,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++unmapped > max_ptes_swap) { result =3D SCAN_EXCEED_SWAP_PTE; count_vm_event(THP_SCAN_EXCEED_SWAP_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_SWAP); goto out_unmap; } /* @@ -1477,6 +1492,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, if (++shared > max_ptes_shared) { result =3D SCAN_EXCEED_SHARED_PTE; count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); + count_mthp_stat(HPAGE_PMD_ORDER, + MTHP_STAT_COLLAPSE_EXCEED_SHARED); goto out_unmap; } } --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E36C2236FA for ; Sun, 19 Apr 2026 19:01:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625268; cv=none; b=ane4Pew0XM3NqbshRl6k5DWt15JstO7VMSF7ie7SYt1RQb55Rmr/xy28NW3wVlBN9L22Sy5DBtR3uBp+thSMDkScyzVAba3hQ65NHKY7Ctm/kbrUBmzYEY+AjxnvaYdIHd1WiqCTnNRoZuNtJlzGS1SxXeYL0HeCcq295cRBlqM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625268; c=relaxed/simple; bh=9GIyyjHyr0fB+ui6G7HAh5xpzDhL9PGWsssLsWWx9T0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mMQIIddnf7v/HKw7N0OvB+7COpFxZJHdktOLgH7du9zvse4rqnjsL3B/zkVbRxmytINnD0opHasNlPXvuuZv0+OWhfwGJ1XnzGOweVxxDiTC8DvT4LzyLCD7UPMhXzZM7dtADxzE6A4qWbC2EeuIfAnUzaGH2I8+g2ONJ8OUeAU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=PumqDsbe; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="PumqDsbe" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625261; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eEs7awys4SM85pNXSkVo8hUKY9UhlhVsZXLsWffnxw4=; b=PumqDsbe0MUdB+tj65DcaDG0mma+FYes55iSCtCDj2hYKjDtiFNOpGTGboKMJHhxCCWzP/ mgegsb0kfJsTRye9n7mwctCSGl8MJnVxiijYay5T1OQPqJ7ftM2qCWPmQAcG8AV/DyWwPP U1vErUeITVKnrPuIU8kHV2l4kDQsMy8= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-372-RcZ-tnCkPvmSHTrn8F9sfQ-1; Sun, 19 Apr 2026 15:00:58 -0400 X-MC-Unique: RcZ-tnCkPvmSHTrn8F9sfQ-1 X-Mimecast-MFC-AGG-ID: RcZ-tnCkPvmSHTrn8F9sfQ_1776625257 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BEAD21956055; Sun, 19 Apr 2026 19:00:57 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 74D60195608E; Sun, 19 Apr 2026 19:00:42 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 08/13] mm/khugepaged: improve tracepoints for mTHP orders Date: Sun, 19 Apr 2026 12:57:45 -0600 Message-ID: <20260419185750.260784-9-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to give better insight into what order is being operated at for. Reviewed-by: Lorenzo Stoakes =20 Reviewed-by: Baolin Wang Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache --- include/trace/events/huge_memory.h | 34 +++++++++++++++++++----------- mm/khugepaged.c | 9 ++++---- 2 files changed, 27 insertions(+), 16 deletions(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge= _memory.h index bcdc57eea270..291fae364c62 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -89,40 +89,44 @@ TRACE_EVENT(mm_khugepaged_scan_pmd, =20 TRACE_EVENT(mm_collapse_huge_page, =20 - TP_PROTO(struct mm_struct *mm, int isolated, int status), + TP_PROTO(struct mm_struct *mm, int isolated, int status, unsigned int ord= er), =20 - TP_ARGS(mm, isolated, status), + TP_ARGS(mm, isolated, status, order), =20 TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, isolated) __field(int, status) + __field(unsigned int, order) ), =20 TP_fast_assign( __entry->mm =3D mm; __entry->isolated =3D isolated; __entry->status =3D status; + __entry->order =3D order; ), =20 - TP_printk("mm=3D%p, isolated=3D%d, status=3D%s", + TP_printk("mm=3D%p, isolated=3D%d, status=3D%s, order=3D%u", __entry->mm, __entry->isolated, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); =20 TRACE_EVENT(mm_collapse_huge_page_isolate, =20 TP_PROTO(struct folio *folio, int none_or_zero, - int referenced, int status), + int referenced, int status, unsigned int order), =20 - TP_ARGS(folio, none_or_zero, referenced, status), + TP_ARGS(folio, none_or_zero, referenced, status, order), =20 TP_STRUCT__entry( __field(unsigned long, pfn) __field(int, none_or_zero) __field(int, referenced) __field(int, status) + __field(unsigned int, order) ), =20 TP_fast_assign( @@ -130,26 +134,30 @@ TRACE_EVENT(mm_collapse_huge_page_isolate, __entry->none_or_zero =3D none_or_zero; __entry->referenced =3D referenced; __entry->status =3D status; + __entry->order =3D order; ), =20 - TP_printk("scan_pfn=3D0x%lx, none_or_zero=3D%d, referenced=3D%d, status= =3D%s", + TP_printk("scan_pfn=3D0x%lx, none_or_zero=3D%d, referenced=3D%d, status= =3D%s, order=3D%u", __entry->pfn, __entry->none_or_zero, __entry->referenced, - __print_symbolic(__entry->status, SCAN_STATUS)) + __print_symbolic(__entry->status, SCAN_STATUS), + __entry->order) ); =20 TRACE_EVENT(mm_collapse_huge_page_swapin, =20 - TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret), + TP_PROTO(struct mm_struct *mm, int swapped_in, int referenced, int ret, + unsigned int order), =20 - TP_ARGS(mm, swapped_in, referenced, ret), + TP_ARGS(mm, swapped_in, referenced, ret, order), =20 TP_STRUCT__entry( __field(struct mm_struct *, mm) __field(int, swapped_in) __field(int, referenced) __field(int, ret) + __field(unsigned int, order) ), =20 TP_fast_assign( @@ -157,13 +165,15 @@ TRACE_EVENT(mm_collapse_huge_page_swapin, __entry->swapped_in =3D swapped_in; __entry->referenced =3D referenced; __entry->ret =3D ret; + __entry->order =3D order; ), =20 - TP_printk("mm=3D%p, swapped_in=3D%d, referenced=3D%d, ret=3D%d", + TP_printk("mm=3D%p, swapped_in=3D%d, referenced=3D%d, ret=3D%d, order=3D%= u", __entry->mm, __entry->swapped_in, __entry->referenced, - __entry->ret) + __entry->ret, + __entry->order) ); =20 TRACE_EVENT(mm_khugepaged_scan_file, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 0a1c7cc20c0e..a4f1c570b69b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -780,13 +780,13 @@ static enum scan_result __collapse_huge_page_isolate(= struct vm_area_struct *vma, } else { result =3D SCAN_SUCCEED; trace_mm_collapse_huge_page_isolate(folio, none_or_zero, - referenced, result); + referenced, result, order); return result; } out: release_pte_pages(pte, _pte, compound_pagelist); trace_mm_collapse_huge_page_isolate(folio, none_or_zero, - referenced, result); + referenced, result, order); return result; } =20 @@ -1180,7 +1180,8 @@ static enum scan_result __collapse_huge_page_swapin(s= truct mm_struct *mm, =20 result =3D SCAN_SUCCEED; out: - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result); + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result, + order); return result; } =20 @@ -1376,7 +1377,7 @@ static enum scan_result collapse_huge_page(struct mm_= struct *mm, unsigned long s out_nolock: if (folio) folio_put(folio); - trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, result); + trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, result, order= ); return result; } =20 --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47D7F3BB4A for ; Sun, 19 Apr 2026 19:01:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625282; cv=none; b=dusrbv4kCzsbQcYt0b71+ANZi19l1hT5r1WB5Ubt/XDUlEVAoteke7nKz6s8zhHHGM8NLiT6y8R2a+deGYpd//pvj2afqrQnr4bsteYZ986CSqy0CkXgfxgKaLbLkkVA7glEnTIAraeLKfOT1hRT3vzDVLrqbOtQG+ZtmXgsom8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625282; c=relaxed/simple; bh=N4jjCdVPyegBp31fvyDn5V6C/5+/gN84Dk8o+TCfRVE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eat8Bo5llNa+NjXzHSLOte7prqf8GOUG8h5Lef4FO3a426DAdY4VV5NDnwn6/jTolIOMDgxnsQi6aUTUVvo3K0o4uH9OQPcz7NNM7c2IQ55mybrSMcCWoRpPVSgb52RGPbl/pf0g2182p1bNVC4Pph9OKRgGuINtRm1gsaKVCqM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=KMOVuj6O; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KMOVuj6O" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625278; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WYI4SonfG+7QWZk1LR8FwdFbitigpjTH5dDffgzeyvU=; b=KMOVuj6OTyntlHcDKyZWSvq8l7pvyfU3aLWVKKmNbQC1gfDpIk3gBdWSw7slMlyGgpP5Sy 6adPtpSrA+wekNnCWE+tTO1ydbGRB2srwhs5kXLRhysiXwkeIqxk46ypFCIyd5lvyQ7new xFvCprqxx1fJX+KymhUazY259NPFD1A= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-623-0mg9gD7sO0exFl3wiLgedw-1; Sun, 19 Apr 2026 15:01:15 -0400 X-MC-Unique: 0mg9gD7sO0exFl3wiLgedw-1 X-Mimecast-MFC-AGG-ID: 0mg9gD7sO0exFl3wiLgedw_1776625274 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 0293419560BB; Sun, 19 Apr 2026 19:01:14 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 29723195608E; Sun, 19 Apr 2026 19:00:57 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 09/13] mm/khugepaged: introduce collapse_allowable_orders helper function Date: Sun, 19 Apr 2026 12:57:46 -0600 Message-ID: <20260419185750.260784-10-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Add collapse_allowable_orders() to generalize THP order eligibility. The function determines which THP orders are permitted based on collapse context (khugepaged vs madv_collapse). This consolidates collapse configuration logic and provides a clean interface for future mTHP collapse support where the orders may be different. Reviewed-by: Baolin Wang Signed-off-by: Nico Pache --- include/linux/khugepaged.h | 6 ++---- mm/huge_memory.c | 2 +- mm/khugepaged.c | 20 ++++++++++++++------ mm/vma.c | 6 +++--- tools/testing/vma/include/stubs.h | 3 +-- 5 files changed, 21 insertions(+), 16 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index d7a9053ff4fe..e87df2fa6931 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -13,8 +13,7 @@ extern void khugepaged_destroy(void); extern int start_stop_khugepaged(void); extern void __khugepaged_enter(struct mm_struct *mm); extern void __khugepaged_exit(struct mm_struct *mm); -extern void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags); +extern void khugepaged_enter_vma(struct vm_area_struct *vma); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, @@ -38,8 +37,7 @@ static inline void khugepaged_fork(struct mm_struct *mm, = struct mm_struct *oldmm static inline void khugepaged_exit(struct mm_struct *mm) { } -static inline void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags) +static inline void khugepaged_enter_vma(struct vm_area_struct *vma) { } static inline void collapse_pte_mapped_thp(struct mm_struct *mm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5c128cdec810..1023698a8b96 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1557,7 +1557,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault= *vmf) ret =3D vmf_anon_prepare(vmf); if (ret) return ret; - khugepaged_enter_vma(vma, vma->vm_flags); + khugepaged_enter_vma(vma); =20 if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a4f1c570b69b..fdbdc1a1cdd9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -447,7 +447,7 @@ int hugepage_madvise(struct vm_area_struct *vma, * register it here without waiting a page fault that * may not happen any time soon. */ - khugepaged_enter_vma(vma, *vm_flags); + khugepaged_enter_vma(vma); break; case MADV_NOHUGEPAGE: *vm_flags &=3D ~VM_HUGEPAGE; @@ -546,12 +546,20 @@ void __khugepaged_enter(struct mm_struct *mm) wake_up_interruptible(&khugepaged_wait); } =20 -void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags) +/* Check what orders are allowed based on the vma and collapse type */ +static unsigned long collapse_allowable_orders(struct vm_area_struct *vma, + enum tva_type tva_flags) +{ + unsigned long orders =3D BIT(HPAGE_PMD_ORDER); + + return thp_vma_allowable_orders(vma, vma->vm_flags, tva_flags, orders); +} + +void khugepaged_enter_vma(struct vm_area_struct *vma) { if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) + if (collapse_allowable_orders(vma, TVA_KHUGEPAGED)) __khugepaged_enter(vma->vm_mm); } } @@ -2664,7 +2672,7 @@ static void collapse_scan_mm_slot(unsigned int progre= ss_max, cc->progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORD= ER)) { + if (!collapse_allowable_orders(vma, TVA_KHUGEPAGED)) { cc->progress++; continue; } @@ -2973,7 +2981,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); =20 - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) + if (!collapse_allowable_orders(vma, TVA_FORCED_COLLAPSE)) return -EINVAL; =20 cc =3D kmalloc_obj(*cc); diff --git a/mm/vma.c b/mm/vma.c index 377321b48734..c0398fb597b3 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -989,7 +989,7 @@ static __must_check struct vm_area_struct *vma_merge_ex= isting_range( goto abort; =20 vma_set_flags_mask(vmg->target, sticky_flags); - khugepaged_enter_vma(vmg->target, vmg->vm_flags); + khugepaged_enter_vma(vmg->target); vmg->state =3D VMA_MERGE_SUCCESS; return vmg->target; =20 @@ -1110,7 +1110,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma= _merge_struct *vmg) * following VMA if we have VMAs on both sides. */ if (vmg->target && !vma_expand(vmg)) { - khugepaged_enter_vma(vmg->target, vmg->vm_flags); + khugepaged_enter_vma(vmg->target); vmg->state =3D VMA_MERGE_SUCCESS; return vmg->target; } @@ -2589,7 +2589,7 @@ static int __mmap_new_vma(struct mmap_state *map, str= uct vm_area_struct **vmap, * call covers the non-merge case. */ if (!vma_is_anonymous(vma)) - khugepaged_enter_vma(vma, map->vm_flags); + khugepaged_enter_vma(vma); *vmap =3D vma; return 0; =20 diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/= stubs.h index a30b8bc84955..3d9a2daa2712 100644 --- a/tools/testing/vma/include/stubs.h +++ b/tools/testing/vma/include/stubs.h @@ -182,8 +182,7 @@ static inline bool mpol_equal(struct mempolicy *a, stru= ct mempolicy *b) return true; } =20 -static inline void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags) +static inline void khugepaged_enter_vma(struct vm_area_struct *vma) { } =20 --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9C1614BF97 for ; Sun, 19 Apr 2026 19:01:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625296; cv=none; b=Wq4rTUjyt/NFaK+CLGzFgDOUhnE+KDC9RTmF9terIn04nk8apK9xdNJv7MVxmTI40ToQqMa4wq9MjwBxnIr7/Db67+2cYHok2kSo/7vwMygcWGv4p6esXRk+6E5NCSXFACHsEnTZtELevQEaEM5ZfpLW2kDUb7ezhNEHjpj3h00= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625296; c=relaxed/simple; bh=rYtJjet2Rk9CcwvLQhPUzdGR1rdximZwTGKut13rKTA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=o8nXrbMYGTT5IPvu1NaN27+r979vdftB6SRfBALLM1/gEqwHUHQmIoHSpYr9gsztaGweu7hCvGM5ZqaeRzGK7vvkDO11Rcq5219Mk/ATCgVpy2Q9z1KzGU9ePXYlbU/fN6WQ1rGzWPsSMd21eh1Nmzh6Vnah7Ujxc1ngx3DQqFs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZsVG57Ij; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZsVG57Ij" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625294; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ONU7fSHKm2U1odQ7++PjRfR+D9CqF0wAgdWqnA1FCok=; b=ZsVG57IjeqWZO6zOgQ9Et/VEf7Cs+8jdv/GdweZriZC4i/3qHmqDDEG7LRHO8f+xq38D7R phNe6L8sja4+9NGYC0jPj4Whilmc+KuFB3iYW8RAgjYrW75yCKSqbV2hnueO/ZkNNqbby5 P35399gNtluupL89GIMjBgiXWLcpjQE= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-678-eAVAcy6HPgqIxUzBOZCB4Q-1; Sun, 19 Apr 2026 15:01:30 -0400 X-MC-Unique: eAVAcy6HPgqIxUzBOZCB4Q-1 X-Mimecast-MFC-AGG-ID: eAVAcy6HPgqIxUzBOZCB4Q_1776625290 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D9C90195608F; Sun, 19 Apr 2026 19:01:29 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 79413195608E; Sun, 19 Apr 2026 19:01:14 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com Subject: [PATCH 7.2 v16 10/13] mm/khugepaged: Introduce mTHP collapse support Date: Sun, 19 Apr 2026 12:57:47 -0600 Message-ID: <20260419185750.260784-11-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and a stack structure that allows us to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, an alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement collapse_scan_bitmap() to perform binary recursion on the bitmap and determine the best eligible order for the collapse. A stack structure is used instead of traditional recursion to manage the search. This also prevents a traditional recursive approach when the kernel stack struct is limited. The algorithm recursively splits the bitmap into smaller chunks to find the highest order mTHPs that satisfy the collapse criteria. We start by attempting the PMD order, then moved on the consecutively lower orders (mTHP collapse). The stack maintains a pair of variables (offset, order), indicating the number of PTEs from the start of the PMD, and the order of the potential collapse candidate. The algorithm for consuming the bitmap works as such: 1) push (0, HPAGE_PMD_ORDER) onto the stack 2) pop the stack 3) check if the number of set bits in that (offset,order) pair statisfy the max_ptes_none threshold for that order 4) if yes, attempt collapse 5) if no (or collapse fails), push two new stack items representing the left and right halves of the current bitmap range, at the next lower order 6) repeat at step (2) until stack is empty. Below is a diagram representing the algorithm and stack items: offset mid_offset | | | | v v ____________________________________ | PTE Page Table | -------------------------------------- <-------><-------> order-1 order-1 mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order mTHP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=3D0: Never introduce new empty pages during collapse - max_ptes_none=3DHPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and skip mTHP collapse attempts. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Signed-off-by: Nico Pache --- mm/khugepaged.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 174 insertions(+), 7 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fdbdc1a1cdd9..81ea7cbc54b2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -99,6 +99,31 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SL= OTS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __ro_after_init; =20 +#define KHUGEPAGED_MIN_MTHP_ORDER 2 +/* + * The maximum number of mTHP ranges that can be stored on the stack. + * This is calculated based on the number of PTE entries in a PTE page tab= le + * and the minimum mTHP order. + * + * ilog2 is needed in place of HPAGE_PMD_ORDER due to some architectures + * (ie ppc64le) not defining HPAGE_PMD_ORDER until after build time. + * + * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP= ranges + */ +#define MTHP_STACK_SIZE (1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_= MTHP_ORDER)) + +/* + * Defines a range of PTE entries in a PTE page table which are being + * considered for mTHP collapse. + * + * @offset: the offset of the first PTE entry in a PMD range. + * @order: the order of the PTE entries being considered for collapse. + */ +struct mthp_range { + u16 offset; + u8 order; +}; + struct collapse_control { bool is_khugepaged; =20 @@ -110,6 +135,12 @@ struct collapse_control { =20 /* nodemask for allocation fallback */ nodemask_t alloc_nmask; + + /* Each bit represents a single occupied (!none/zero) page. */ + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE); + /* A mask of the current range being considered for mTHP collapse. */ + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE); + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; }; =20 /** @@ -1389,22 +1420,142 @@ static enum scan_result collapse_huge_page(struct = mm_struct *mm, unsigned long s return result; } =20 +static void collapse_mthp_stack_push(struct collapse_control *cc, int *sta= ck_size, + u16 offset, u8 order) +{ + const int size =3D *stack_size; + struct mthp_range *stack =3D &cc->mthp_bitmap_stack[size]; + + VM_WARN_ON_ONCE(size >=3D MTHP_STACK_SIZE); + stack->order =3D order; + stack->offset =3D offset; + (*stack_size)++; +} + +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *= cc, + int *stack_size) +{ + const int size =3D *stack_size; + + VM_WARN_ON_ONCE(size <=3D 0); + (*stack_size)--; + return cc->mthp_bitmap_stack[size - 1]; +} + +static unsigned int collapse_mthp_count_present(struct collapse_control *c= c, + u16 offset, unsigned int nr_ptes) +{ + bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE); + bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes); + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_= PER_PTE); +} + +/* + * mthp_collapse() consumes the bitmap that is generated during + * collapse_scan_pmd() to determine what regions and mTHP orders fit best. + * + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) p= age. + * A stack structure cc->mthp_bitmap_stack is used to check different regi= ons + * of the bitmap for collapse eligibility. The stack maintains a pair of + * variables (offset, order), indicating the number of PTEs from the start= of + * the PMD, and the order of the potential collapse candidate respectively= . We + * start at the PMD order and check if it is eligible for collapse; if not= , we + * add two entries to the stack at a lower order to represent the left and= right + * halves of the PTE page table we are examining. + * + * offset mid_offset + * | | + * | | + * v v + * -------------------------------------- + * | cc->mthp_bitmap | + * -------------------------------------- + * <-------><-------> + * order-1 order-1 + * + * For each of these, we determine how many PTE entries are occupied in the + * range of PTE entries we propose to collapse, then we compare this to a + * threshold number of PTE entries which would need to be occupied for a + * collapse to be permitted at that order (accounting for max_ptes_none). + * + * If a collapse is permitted, we attempt to collapse the PTE range into a + * mTHP. + */ +static int mthp_collapse(struct mm_struct *mm, unsigned long address, + int referenced, int unmapped, struct collapse_control *cc, + unsigned long enabled_orders) +{ + unsigned int nr_occupied_ptes, nr_ptes; + int max_ptes_none, collapsed =3D 0, stack_size =3D 0; + unsigned long collapse_address; + struct mthp_range range; + u16 offset; + u8 order; + + collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); + + while (stack_size) { + range =3D collapse_mthp_stack_pop(cc, &stack_size); + order =3D range.order; + offset =3D range.offset; + nr_ptes =3D 1UL << order; + + if (!test_bit(order, &enabled_orders)) + goto next_order; + + max_ptes_none =3D collapse_max_ptes_none(cc, NULL, order); + + if (max_ptes_none < 0) + return collapsed; + + nr_occupied_ptes =3D collapse_mthp_count_present(cc, offset, + nr_ptes); + + if (nr_occupied_ptes >=3D nr_ptes - max_ptes_none) { + int ret; + + collapse_address =3D address + offset * PAGE_SIZE; + ret =3D collapse_huge_page(mm, collapse_address, referenced, + unmapped, cc, order); + if (ret =3D=3D SCAN_SUCCEED) { + collapsed +=3D nr_ptes; + continue; + } + } + +next_order: + if (order > KHUGEPAGED_MIN_MTHP_ORDER) { + const u8 next_order =3D order - 1; + const u16 mid_offset =3D offset + (nr_ptes / 2); + + collapse_mthp_stack_push(cc, &stack_size, mid_offset, + next_order); + collapse_mthp_stack_push(cc, &stack_size, offset, + next_order); + } + } + return collapsed; +} + static enum scan_result collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, bool *lock_dropped, struct collapse_control *cc) { pmd_t *pmd; - pte_t *pte, *_pte; - int none_or_zero =3D 0, shared =3D 0, referenced =3D 0; + pte_t *pte, *_pte, pteval; + int i; + int none_or_zero =3D 0, shared =3D 0, nr_collapsed =3D 0, referenced =3D = 0; enum scan_result result =3D SCAN_FAIL; struct page *page =3D NULL; struct folio *folio =3D NULL; unsigned long addr; + unsigned long enabled_orders; spinlock_t *ptl; int node =3D NUMA_NO_NODE, unmapped =3D 0; int max_ptes_none =3D collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER); unsigned int max_ptes_shared =3D collapse_max_ptes_shared(cc, HPAGE_PMD_O= RDER); unsigned int max_ptes_swap =3D collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER= ); + enum tva_type tva_flags =3D cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORC= ED_COLLAPSE; =20 VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK); =20 @@ -1414,8 +1565,19 @@ static enum scan_result collapse_scan_pmd(struct mm_= struct *mm, goto out; } =20 + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + enabled_orders =3D collapse_allowable_orders(vma, tva_flags); + + /* + * If PMD is the only enabled order, enforce max_ptes_none, otherwise + * scan all pages to populate the bitmap for mTHP collapse. + */ + if (enabled_orders !=3D BIT(HPAGE_PMD_ORDER)) + max_ptes_none =3D KHUGEPAGED_MAX_PTES_LIMIT; + pte =3D pte_offset_map_lock(mm, pmd, start_addr, &ptl); if (!pte) { cc->progress++; @@ -1423,11 +1585,13 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, goto out; } =20 - for (addr =3D start_addr, _pte =3D pte; _pte < pte + HPAGE_PMD_NR; - _pte++, addr +=3D PAGE_SIZE) { + for (i =3D 0; i < HPAGE_PMD_NR; i++) { + _pte =3D pte + i; + addr =3D start_addr + i * PAGE_SIZE; + pteval =3D ptep_get(_pte); + cc->progress++; =20 - pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { if (++none_or_zero > max_ptes_none) { result =3D SCAN_EXCEED_NONE_PTE; @@ -1507,6 +1671,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, } } =20 + /* Set bit for occupied pages */ + __set_bit(i, cc->mthp_bitmap); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1570,10 +1736,11 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, * that. We will recheck the vma after taking it again in write mode. */ mmap_read_unlock(mm); - result =3D collapse_huge_page(mm, start_addr, referenced, - unmapped, cc, HPAGE_PMD_ORDER); + nr_collapsed =3D mthp_collapse(mm, start_addr, referenced, unmapped, + cc, enabled_orders); /* collapse_huge_page will return with the mmap_lock released */ *lock_dropped =3D true; + result =3D nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL; } out: trace_mm_khugepaged_scan_pmd(mm, folio, referenced, --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 275323BB4A for ; Sun, 19 Apr 2026 19:01:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625312; cv=none; b=Lg7hDm+klcr47TeSisnAO55WIN8ypHcNLEO1J+gF9Y564+3lantwQunOw+MDLRUZsoiE+VGI9UYUBNmVlHkbxiRIosVC5uj/qbqwjLN6uEQd+yx54qiuwNyHiEPvSLt1q/wQO8eFbQ6qWTFcWIF8RNQxBvwQzBHB6iU/dwD+dAQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625312; c=relaxed/simple; bh=cg4HHttOK3LS7adaKKnqBG4sxffYT1VRLBtDBEuKwOM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=T8GvpyFQetZv9vnd1e6CtmCz9grCpejXnmgeXHab5D9gkmjyleVBKRCpt/KslVKGw4W38oqPN4kDF5CRA+RTKN4UIExTIbdv1lCeydqry0tFBdHD3QwaS3JTfKQm0C1jExqiWFnn1dGPvRG80H8HPZ2H93JO8dSRRYHzPzZE9RI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WNqt6i8E; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WNqt6i8E" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625310; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dtPu1R73592OQl/hyf/icZ1+ryewqEPPOJyrb7kPdHM=; b=WNqt6i8Ehq37wGoy9mVZPhQo+9giWJtofpCRs0mS11/rgGw0CeIPPNoKNSgh38DdWo3R1X KHm8Hn88HWtEMMT5Z0PAQFH0SjzJBf+QjrgT4znFOoTL9EsN+oJ0JRN5Q6HsMJ2N0N+BMb y8W7BXDv0kO8i5IvMzk2MQHgp+3QD30= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-168-h05RKntDMjKljXgJaXA4dw-1; Sun, 19 Apr 2026 15:01:47 -0400 X-MC-Unique: h05RKntDMjKljXgJaXA4dw-1 X-Mimecast-MFC-AGG-ID: h05RKntDMjKljXgJaXA4dw_1776625306 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E2ACF18005A8; Sun, 19 Apr 2026 19:01:45 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6AB5D195608E; Sun, 19 Apr 2026 19:01:30 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH 7.2 v16 11/13] mm/khugepaged: avoid unnecessary mTHP collapse attempts Date: Sun, 19 Apr 2026 12:57:48 -0600 Message-ID: <20260419185750.260784-12-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" There are cases where, if an attempted collapse fails, all subsequent orders are guaranteed to also fail. Avoid these collapse attempts by bailing out early. Reviewed-by: Lorenzo Stoakes Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Signed-off-by: Nico Pache --- mm/khugepaged.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 81ea7cbc54b2..13b05bbb08e7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1517,9 +1517,31 @@ static int mthp_collapse(struct mm_struct *mm, unsig= ned long address, collapse_address =3D address + offset * PAGE_SIZE; ret =3D collapse_huge_page(mm, collapse_address, referenced, unmapped, cc, order); - if (ret =3D=3D SCAN_SUCCEED) { + + switch (ret) { + /* Cases where we continue to next collapse candidate */ + case SCAN_SUCCEED: collapsed +=3D nr_ptes; + fallthrough; + case SCAN_PTE_MAPPED_HUGEPAGE: continue; + /* Cases where lower orders might still succeed */ + case SCAN_LACK_REFERENCED_PAGE: + case SCAN_EXCEED_NONE_PTE: + case SCAN_EXCEED_SWAP_PTE: + case SCAN_EXCEED_SHARED_PTE: + case SCAN_PAGE_LOCK: + case SCAN_PAGE_COUNT: + case SCAN_PAGE_LRU: + case SCAN_PAGE_NULL: + case SCAN_DEL_PAGE_LRU: + case SCAN_PTE_NON_PRESENT: + case SCAN_PTE_UFFD_WP: + case SCAN_ALLOC_HUGE_PAGE_FAIL: + goto next_order; + /* Cases where no further collapse is possible */ + default: + return collapsed; } } =20 --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1C98C282F23 for ; Sun, 19 Apr 2026 19:02:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625328; cv=none; b=Oj3ZkIl67Qgm3x8827/ThbXbzrAc04OOLTg6t34e/S5SUJ94AN5Bcfe6kWgPwFzB8R7U0j/ecNfl1rTewDKSb2ShmYcrQRZxxfm8d8kwXNSoyUGXL5O+vyFvPQ1jo+LXFYEXFWHzWHqrzyYG9qj2o154Pz/wLE5vRyRC+RnnGsU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625328; c=relaxed/simple; bh=dPuhQ2MtMro2FYzX2+xa0fO+JAfuEOxHsmRGmEbZLFs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YlKGuMPAlP4zJhjx5ReaVoKY6AANBa+xDxIwtqPNZc1ENh8bX50jks6TEsHQ2wH+wdrVBmwO0Gp0ApV0DmrzCD3noeOYX1T7fLwIK1BHlq8jAlhQZfppy+9e0q6sidyipVQjWwqMLf9yi5L9iRlilIKtYSzx4agK+O5bQ/Eqitw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UdUHsX1u; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UdUHsX1u" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625326; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gyeOJMKykrcVsdpKCbB17DZJ1LlgoGe0DTBMLcgdEIs=; b=UdUHsX1u4w4CAcGjUxWB40BsCSePSpQutAg3fG3silxQR7cbwvKYFIcTMC9B0sk9TbAIDS fFQHCh3SpHfM02oC8QktkzTmLgJ2yWs2XyobC9AlMw1RYsyeEwbLXQOiIBDaQteCm8CVwX 2cfMAXShrGoth+bSTy1F4xYNo+ZpKUs= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-37-idYZdLlxNXK_POCbpOjZ6w-1; Sun, 19 Apr 2026 15:02:02 -0400 X-MC-Unique: idYZdLlxNXK_POCbpOjZ6w-1 X-Mimecast-MFC-AGG-ID: idYZdLlxNXK_POCbpOjZ6w_1776625321 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B7FEB1800282; Sun, 19 Apr 2026 19:02:01 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6317F1956095; Sun, 19 Apr 2026 19:01:46 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Usama Arif Subject: [PATCH 7.2 v16 12/13] mm/khugepaged: run khugepaged for all orders Date: Sun, 19 Apr 2026 12:57:49 -0600 Message-ID: <20260419185750.260784-13-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" From: Baolin Wang If any order (m)THP is enabled we should allow running khugepaged to attempt scanning and collapsing mTHPs. In order for khugepaged to operate when only mTHP sizes are specified in sysfs, we must modify the predicate function that determines whether it ought to run to do so. This function is currently called hugepage_pmd_enabled(), this patch renames it to hugepage_enabled() and updates the logic to check to determine whether any valid orders may exist which would justify khugepaged running. We must also update collapse_allowable_orders() to check all orders if the vma is anonymous and the collapse is khugepaged. After this patch khugepaged mTHP collapse is fully enabled. Reviewed-by: Lorenzo Stoakes Reviewed-by: Lance Yang Acked-by: Usama Arif Acked-by: David Hildenbrand (Arm) Signed-off-by: Baolin Wang Signed-off-by: Nico Pache --- mm/khugepaged.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 13b05bbb08e7..7d48d4fbd5f3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -524,23 +524,23 @@ static inline int collapse_test_exit_or_disable(struc= t mm_struct *mm) mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm); } =20 -static bool hugepage_pmd_enabled(void) +static bool hugepage_enabled(void) { /* * We cover the anon, shmem and the file-backed case here; file-backed * hugepages, when configured in, are determined by the global control. - * Anon pmd-sized hugepages are determined by the pmd-size control. + * Anon hugepages are determined by its per-size mTHP control. * Shmem pmd-sized hugepages are also determined by its pmd-size control, * except when the global shmem_huge is set to SHMEM_HUGE_DENY. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && hugepage_global_enabled()) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_always)) + if (READ_ONCE(huge_anon_orders_always)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_madvise)) + if (READ_ONCE(huge_anon_orders_madvise)) return true; - if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) && + if (READ_ONCE(huge_anon_orders_inherit) && hugepage_global_enabled()) return true; if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled()) @@ -581,7 +581,13 @@ void __khugepaged_enter(struct mm_struct *mm) static unsigned long collapse_allowable_orders(struct vm_area_struct *vma, enum tva_type tva_flags) { - unsigned long orders =3D BIT(HPAGE_PMD_ORDER); + unsigned long orders; + + /* If khugepaged is scanning an anonymous vma, allow mTHP collapse */ + if ((tva_flags & TVA_KHUGEPAGED) && vma_is_anonymous(vma)) + orders =3D THP_ORDERS_ALL_ANON; + else + orders =3D BIT(HPAGE_PMD_ORDER); =20 return thp_vma_allowable_orders(vma, vma->vm_flags, tva_flags, orders); } @@ -589,7 +595,7 @@ static unsigned long collapse_allowable_orders(struct v= m_area_struct *vma, void khugepaged_enter_vma(struct vm_area_struct *vma) { if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && - hugepage_pmd_enabled()) { + hugepage_enabled()) { if (collapse_allowable_orders(vma, TVA_KHUGEPAGED)) __khugepaged_enter(vma->vm_mm); } @@ -2936,7 +2942,7 @@ static void collapse_scan_mm_slot(unsigned int progre= ss_max, =20 static int khugepaged_has_work(void) { - return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled(); + return !list_empty(&khugepaged_scan.mm_head) && hugepage_enabled(); } =20 static int khugepaged_wait_event(void) @@ -3009,7 +3015,7 @@ static void khugepaged_wait_work(void) return; } =20 - if (hugepage_pmd_enabled()) + if (hugepage_enabled()) wait_event_freezable(khugepaged_wait, khugepaged_wait_event()); } =20 @@ -3040,7 +3046,7 @@ void set_recommended_min_free_kbytes(void) int nr_zones =3D 0; unsigned long recommended_min; =20 - if (!hugepage_pmd_enabled()) { + if (!hugepage_enabled()) { calculate_min_free_kbytes(); goto update_wmarks; } @@ -3090,7 +3096,7 @@ int start_stop_khugepaged(void) int err =3D 0; =20 mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled()) { + if (hugepage_enabled()) { if (!khugepaged_thread) khugepaged_thread =3D kthread_run(khugepaged, NULL, "khugepaged"); @@ -3116,7 +3122,7 @@ int start_stop_khugepaged(void) void khugepaged_min_free_kbytes_update(void) { mutex_lock(&khugepaged_mutex); - if (hugepage_pmd_enabled() && khugepaged_thread) + if (hugepage_enabled() && khugepaged_thread) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } --=20 2.53.0 From nobody Sun Apr 26 08:12:48 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 488AB41A8F for ; Sun, 19 Apr 2026 19:02:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625344; cv=none; b=EVMB1UHi2KIjhwdsuERNFopqijivVKBFvZhHjbU39qjlNK4+I1UG6TXC1Cc5AjXBbYHCePVf1wYGdfa2UDPtL0UFgbwDjkdsHJwU7MVP76FJsUr1Iy2+Q4CcSW59uPORAp83PaAG1vQrqgEigO6MmiNvNjrL/C4YSi1v6AUbmDU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776625344; c=relaxed/simple; bh=zansVm5T2WAjzye9o7pv93toFYW0Nj4Zn6PSIfwQxDo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NJLFl7UwshCy8umOQ3QxM6MS5iJy2HoafGOtIZKxwc+ESbOuHqHA93nUwaoxR3oNRzNWY4Armc5/nJdxtMNyaqbec+pvwmU8AzMWGZh7/fZNvMzjPovC7OF1BIdI6m1W22EBY2ljrrICuOrA0J5GEmrA/hWuAiwvesa33SwylJA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=B1588n4r; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="B1588n4r" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776625342; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WbaolfbnRxGZaI+LbysHzTSE/a3XUQ1Jm7g+xQscRIs=; b=B1588n4riZcHUAPZjNbKOiBA3bOAXGNebjcP6v1zjcPuv7jsRQJ4eIKIje+O8WEVwWGPZG dVy16XZ5zVUTVr7IvXQVpuOizPcq+oXnPcKAB8WfJJ5MyHpGtQoY6roosdwzg1bd53/n1P h0gA8l49cdgSVJKXB9ZvgiUkLf2SEos= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-433-ph47BixWNc64nih335XQRQ-1; Sun, 19 Apr 2026 15:02:19 -0400 X-MC-Unique: ph47BixWNc64nih335XQRQ-1 X-Mimecast-MFC-AGG-ID: ph47BixWNc64nih335XQRQ_1776625337 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id D31211956089; Sun, 19 Apr 2026 19:02:17 +0000 (UTC) Received: from p1.redhat.com (unknown [10.22.74.5]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5E2CC195608E; Sun, 19 Apr 2026 19:02:02 +0000 (UTC) From: Nico Pache To: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org Cc: aarcange@redhat.com, akpm@linux-foundation.org, anshuman.khandual@arm.com, apopple@nvidia.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, byungchul@sk.com, catalin.marinas@arm.com, cl@gentwo.org, corbet@lwn.net, dave.hansen@linux.intel.com, david@kernel.org, dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org, hughd@google.com, jack@suse.cz, jackmanb@google.com, jannh@google.com, jglisse@google.com, joshua.hahnjy@gmail.com, kas@kernel.org, lance.yang@linux.dev, Liam.Howlett@oracle.com, ljs@kernel.org, mathieu.desnoyers@efficios.com, matthew.brost@intel.com, mhiramat@kernel.org, mhocko@suse.com, npache@redhat.com, peterx@redhat.com, pfalcato@suse.de, rakie.kim@sk.com, raquini@redhat.com, rdunlap@infradead.org, richard.weiyang@gmail.com, rientjes@google.com, rostedt@goodmis.org, rppt@kernel.org, ryan.roberts@arm.com, shivankg@amd.com, sunnanyong@huawei.com, surenb@google.com, thomas.hellstrom@linux.intel.com, tiwai@suse.de, usamaarif642@gmail.com, vbabka@suse.cz, vishal.moola@gmail.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yang@os.amperecomputing.com, ying.huang@linux.alibaba.com, ziy@nvidia.com, zokeefe@google.com, Bagas Sanjaya Subject: [PATCH 7.2 v16 13/13] Documentation: mm: update the admin guide for mTHP collapse Date: Sun, 19 Apr 2026 12:57:50 -0600 Message-ID: <20260419185750.260784-14-npache@redhat.com> In-Reply-To: <20260419185750.260784-1-npache@redhat.com> References: <20260419185750.260784-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Content-Type: text/plain; charset="utf-8" Now that we can collapse to mTHPs lets update the admin guide to reflect these changes and provide proper guidance on how to utilize it. Reviewed-by: Lorenzo Stoakes Reviewed-by: Bagas Sanjaya Signed-off-by: Nico Pache --- Documentation/admin-guide/mm/transhuge.rst | 49 +++++++++++++--------- 1 file changed, 29 insertions(+), 20 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index eebb1f6bbc6c..0ef13c451ac8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -63,7 +63,8 @@ often. THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into PMD-sized huge pages. +collapses sequences of basic pages into huge pages of either PMD size +or mTHP sizes, if the system is configured to do so. =20 The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and = enable it by writing echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused =20 -khugepaged will be automatically started when PMD-sized THP is enabled +khugepaged will be automatically started when any THP size is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when -PMD-sized THP is disabled (when both the per-size anon control and the +all THP sizes are disabled (when both the per-size anon control and the top-level control are "never") =20 process THP controls @@ -264,11 +265,6 @@ support the following arguments:: Khugepaged controls ------------------- =20 -.. note:: - khugepaged currently only searches for opportunities to collapse to - PMD-sized THP and no attempt is made to collapse to other THP - sizes. - khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation at= tempt:: The khugepaged progress can be seen in the number of pages collapsed (note that this counter may not be an exact count of the number of pages collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by -one 2M hugepage. Each may happen independently, or together, depending on -the type of memory and the failures that occur. As such, this value should -be interpreted roughly as a sign of progress, and counters in /proc/vmstat -consulted for more accurate accounting):: +being replaced by a PMD mapping, or (2) physical pages replaced by one +hugepage of various sizes (PMD-sized or mTHP). Each may happen independent= ly, +or together, depending on the type of memory and the failures that occur. +As such, this value should be interpreted roughly as a sign of progress, +and counters in /proc/vmstat consulted for more accurate accounting):: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed =20 @@ -308,16 +304,20 @@ for each pass:: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans =20 -``max_ptes_none`` specifies how many extra small pages (that are -not already mapped) can be allocated when collapsing a group -of small pages into one large page:: +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed +when collapsing a group of small pages into one large page:: =20 /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none =20 -A higher value leads to use additional memory for programs. -A lower value leads to gain less thp performance. Value of -max_ptes_none can waste cpu time very little, you can -ignore it. +For PMD-sized THP collapse, this directly limits the number of empty pages +allowed in the 2MB region. + +For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. Any other v= alue +will emit a warning and no mTHP collapse will be attempted. + +A higher value allows more empty pages, potentially leading to more memory +usage but better THP performance. A lower value is more conservative and +may result in fewer THP collapses. =20 ``max_ptes_swap`` specifies how many pages can be brought in from swap when collapsing a group of pages into a transparent huge page:: @@ -337,6 +337,15 @@ that THP is shared. Exceeding the number would block t= he collapse:: =20 A higher value may increase memory footprint for some workloads. =20 +.. note:: + For mTHP collapse, khugepaged does not support collapsing regions that + contain shared or swapped out pages, as this could lead to continuous + promotion to higher orders. The collapse will fail if any shared or + swapped PTEs are encountered during the scan. + + Currently, madvise_collapse only supports collapsing to PMD-sized THPs + and does not attempt mTHP collapses. + Boot parameters =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.53.0