From nobody Mon Dec 1 21:33:00 2025 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 43C93338935 for ; Mon, 1 Dec 2025 17:49:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764611390; cv=none; b=NFGg4HTVROiSZy3uk7RRJzeEAAi1QAq/d2rSKfzR7DukKmTboSr+nGg0rCNkDnerMtF7W5zFCUJvGlhgEDFXISlefNB7PjlGablpeB95n80lC10ucq5SHr/Fogsus6JyTAWV3mqJY0scW7GJWJZlNujEWk2ypZeElTwafdBAfZk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764611390; c=relaxed/simple; bh=q5sWbqgC6zHT4FLzunhodiwM2upVhbh/3MnbF+h7srY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qSw5PeA3zG4UaxK7avca+imOtH2poereHvbSAtHtfeQRRX8bBf9QUYgLpIQEXHm5rG5ZXOvMT7sFUlpQUFpZtaxCxe7RP3GeN9Pol5PvyOxsIFf7FGClugO73Dz9U0AenxLm/NbYByWYa341xQHvwSH3/eahfO5uGpgRsfWIq6k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=AMs6Gx5O; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AMs6Gx5O" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1764611386; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=glTf5l82CAyS34gmQNiiAmB1Y1gWQru3yGukP2BZCmM=; b=AMs6Gx5Oz0rKC37RLwlr1i0I+Ng8hS/SCsISOEL3+Uj2AB7dxEkdqe0AHSQ3ji67uL1opF 8CVRHzgEUKZ7WG08R6rXh32lqAzPFLRfXbbSVu2LMlR37qodcfyXKtF0HBjs9dRwvna8db M6GYHtJOiZz5KCCQoGHatzxRMdWO1kU= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-159-Dhv_RTf1NDGtLAt_24XNXw-1; Mon, 01 Dec 2025 12:49:44 -0500 X-MC-Unique: Dhv_RTf1NDGtLAt_24XNXw-1 X-Mimecast-MFC-AGG-ID: Dhv_RTf1NDGtLAt_24XNXw_1764611375 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id ABA4A19560B0; Mon, 1 Dec 2025 17:49:35 +0000 (UTC) Received: from h1.redhat.com (unknown [10.22.66.60]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 28E411800451; Mon, 1 Dec 2025 17:49:24 +0000 (UTC) From: Nico Pache To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org Cc: david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, dev.jain@arm.com, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, akpm@linux-foundation.org, baohua@kernel.org, willy@infradead.org, peterx@redhat.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kas@kernel.org, aarcange@redhat.com, raquini@redhat.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org, hughd@google.com, richard.weiyang@gmail.com, lance.yang@linux.dev, vbabka@suse.cz, rppt@kernel.org, jannh@google.com, pfalcato@suse.de Subject: [PATCH v13 mm-new 13/16] khugepaged: Introduce mTHP collapse support Date: Mon, 1 Dec 2025 10:46:24 -0700 Message-ID: <20251201174627.23295-14-npache@redhat.com> In-Reply-To: <20251201174627.23295-1-npache@redhat.com> References: <20251201174627.23295-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Content-Type: text/plain; charset="utf-8" Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and a stack structure that allows us to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, a alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement collapse_scan_bitmap() to perform binary recursion on the bitmap and determine the best eligible order for the collapse. A stack structure is used instead of traditional recursion to manage the search. The algorithm recursively splits the bitmap into smaller chunks to find the highest order mTHPs that satisfy the collapse criteria. We start by attempting the PMD order, then moved on the consecutively lower orders (mTHP collapse). The stack maintains a pair of variables (offset, order), indicating the number of PTEs from the start of the PMD, and the order of the potential collapse candidate. The algorithm for consuming the bitmap works as such: 1) push (0, HPAGE_PMD_ORDER) onto the stack 2) pop the stack 3) check if the number of set bits in that (offset,order) pair statisfy the max_ptes_none threshold for that order 4) if yes, attempt collapse 5) if no (or collapse fails), push two new stack items representing the left and right halves of the current bitmap range, at the next lower order 6) repeat at step (2) until stack is empty. Below is a diagram representing the algorithm and stack items: offset mid_offset | | | | v v ____________________________________ | PTE Page Table | -------------------------------------- <-------><-------> order-1 order-1 We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=3D0: Never introduce new empty pages during collapse - max_ptes_none=3DHPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and skip mTHP collapse attempts. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order (m)THP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Reviewed-by: Baolin Wang Tested-by: Baolin Wang Signed-off-by: Nico Pache --- mm/khugepaged.c | 183 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 175 insertions(+), 8 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 33b70ca070b4..fcb927cd5456 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -92,6 +92,32 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SL= OTS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __ro_after_init; =20 +#define KHUGEPAGED_MIN_MTHP_ORDER 2 +/* + * The maximum number of mTHP ranges that can be stored on the stack. + * This is calculated based on the number of PTE entries in a PTE page tab= le + * and the minimum mTHP order. + * + * ilog2(MAX_PTRS_PER_PTE) is log2 of the maximum number of PTE entries. + * This gives you the PMD_ORDER, and is needed in place of HPAGE_PMD_ORDER= due + * to restrictions of some architectures (ie ppc64le). + * + * At most there will be 1 << (PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER) mTHP= ranges + */ +#define MTHP_STACK_SIZE (1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_= MTHP_ORDER)) + +/* + * Defines a range of PTE entries in a PTE page table which are being + * considered for (m)THP collapse. + * + * @offset: the offset of the first PTE entry in a PMD range. + * @order: the order of the PTE entries being considered for collapse. + */ +struct mthp_range { + u16 offset; + u8 order; +}; + struct collapse_control { bool is_khugepaged; =20 @@ -100,6 +126,11 @@ struct collapse_control { =20 /* nodemask for allocation fallback */ nodemask_t alloc_nmask; + + /* bitmap used for mTHP collapse */ + DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE); + DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE); + struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE]; }; =20 /** @@ -1375,6 +1406,121 @@ static int collapse_huge_page(struct mm_struct *mm,= unsigned long start_addr, return result; } =20 +static void mthp_stack_push(struct collapse_control *cc, int *stack_size, + u16 offset, u8 order) +{ + const int size =3D *stack_size; + struct mthp_range *stack =3D &cc->mthp_bitmap_stack[size]; + + VM_WARN_ON_ONCE(size >=3D MTHP_STACK_SIZE); + stack->order =3D order; + stack->offset =3D offset; + (*stack_size)++; +} + +static struct mthp_range mthp_stack_pop(struct collapse_control *cc, int *= stack_size) +{ + const int size =3D *stack_size; + + VM_WARN_ON_ONCE(size <=3D 0); + (*stack_size)--; + return cc->mthp_bitmap_stack[size - 1]; +} + +static unsigned int mthp_nr_occupied_pte_entries(struct collapse_control *= cc, + u16 offset, unsigned long nr_pte_entries) +{ + bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR); + bitmap_set(cc->mthp_bitmap_mask, offset, nr_pte_entries); + return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, HPAGE_PMD= _NR); +} + +/* + * mthp_collapse() consumes the bitmap that is generated during + * collapse_scan_pmd() to determine what regions and mTHP orders fit best. + * + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) p= age. + * A stack structure cc->mthp_bitmap_stack is used to check different regi= ons + * of the bitmap for collapse eligibility. The stack maintains a pair of + * variables (offset, order), indicating the number of PTEs from the start= of + * the PMD, and the order of the potential collapse candidate respectively= . We + * start at the PMD order and check if it is eligible for collapse; if not= , we + * add two entries to the stack at a lower order to represent the left and= right + * halves of the PTE page table we are examining. + * + * offset mid_offset + * | | + * | | + * v v + * -------------------------------------- + * | cc->mthp_bitmap | + * -------------------------------------- + * <-------><-------> + * order-1 order-1 + * + * For each of these, we determine how many PTE entries are occupied in the + * range of PTE entries we propose to collapse, then we compare this to a + * threshold number of PTE entries which would need to be occupied for a + * collapse to be permitted at that order (accounting for max_ptes_none). + + * If a collapse is permitted, we attempt to collapse the PTE range into a + * mTHP. + */ +static int mthp_collapse(struct mm_struct *mm, unsigned long address, + int referenced, int unmapped, struct collapse_control *cc, + bool *mmap_locked, unsigned long enabled_orders) +{ + unsigned int max_ptes_none, nr_occupied_ptes; + struct mthp_range range; + unsigned long collapse_address; + int collapsed =3D 0, stack_size =3D 0; + unsigned long nr_pte_entries; + u16 offset; + u8 order; + + mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER); + + while (stack_size > 0) { + range =3D mthp_stack_pop(cc, &stack_size); + order =3D range.order; + offset =3D range.offset; + nr_pte_entries =3D 1UL << order; + + if (!test_bit(order, &enabled_orders)) + goto next_order; + + max_ptes_none =3D collapse_max_ptes_none(order, !cc->is_khugepaged); + + if (max_ptes_none =3D=3D -EINVAL) + return collapsed; + + nr_occupied_ptes =3D mthp_nr_occupied_pte_entries(cc, offset, nr_pte_ent= ries); + + if (nr_occupied_ptes >=3D nr_pte_entries - max_ptes_none) { + int ret; + + collapse_address =3D address + offset * PAGE_SIZE; + ret =3D collapse_huge_page(mm, collapse_address, referenced, + unmapped, cc, mmap_locked, + order); + if (ret =3D=3D SCAN_SUCCEED) { + collapsed +=3D nr_pte_entries; + continue; + } + } + +next_order: + if (order > KHUGEPAGED_MIN_MTHP_ORDER) { + const u8 next_order =3D order - 1; + const u16 mid_offset =3D offset + (nr_pte_entries / 2); + + mthp_stack_push(cc, &stack_size, mid_offset, next_order); + mthp_stack_push(cc, &stack_size, offset, next_order); + } + } + return collapsed; +} + static int collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked, @@ -1382,11 +1528,15 @@ static int collapse_scan_pmd(struct mm_struct *mm, { pmd_t *pmd; pte_t *pte, *_pte; + int i; int result =3D SCAN_FAIL, referenced =3D 0; - int none_or_zero =3D 0, shared =3D 0; + int none_or_zero =3D 0, shared =3D 0, nr_collapsed =3D 0; struct page *page =3D NULL; + unsigned int max_ptes_none; struct folio *folio =3D NULL; unsigned long addr; + unsigned long enabled_orders; + bool full_scan =3D true; spinlock_t *ptl; int node =3D NUMA_NO_NODE, unmapped =3D 0; =20 @@ -1396,22 +1546,34 @@ static int collapse_scan_pmd(struct mm_struct *mm, if (result !=3D SCAN_SUCCEED) goto out; =20 + bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + enabled_orders =3D collapse_allowable_orders(vma, vma->vm_flags, cc->is_k= hugepaged); + + /* + * If PMD is the only enabled order, enforce max_ptes_none, otherwise + * scan all pages to populate the bitmap for mTHP collapse. + */ + if (cc->is_khugepaged && enabled_orders =3D=3D BIT(HPAGE_PMD_ORDER)) + full_scan =3D false; + max_ptes_none =3D collapse_max_ptes_none(HPAGE_PMD_ORDER, full_scan); + pte =3D pte_offset_map_lock(mm, pmd, start_addr, &ptl); if (!pte) { result =3D SCAN_NO_PTE_TABLE; goto out; } =20 - for (addr =3D start_addr, _pte =3D pte; _pte < pte + HPAGE_PMD_NR; - _pte++, addr +=3D PAGE_SIZE) { + for (i =3D 0; i < HPAGE_PMD_NR; i++) { + _pte =3D pte + i; + addr =3D start_addr + i * PAGE_SIZE; pte_t pteval =3D ptep_get(_pte); if (pte_none_or_zero(pteval)) { ++none_or_zero; if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <=3D khugepaged_max_ptes_none)) { + none_or_zero <=3D max_ptes_none) { continue; } else { result =3D SCAN_EXCEED_NONE_PTE; @@ -1479,6 +1641,8 @@ static int collapse_scan_pmd(struct mm_struct *mm, } } =20 + /* Set bit for occupied pages */ + bitmap_set(cc->mthp_bitmap, i, 1); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1535,9 +1699,12 @@ static int collapse_scan_pmd(struct mm_struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); if (result =3D=3D SCAN_SUCCEED) { - result =3D collapse_huge_page(mm, start_addr, referenced, - unmapped, cc, mmap_locked, - HPAGE_PMD_ORDER); + nr_collapsed =3D mthp_collapse(mm, start_addr, referenced, unmapped, + cc, mmap_locked, enabled_orders); + if (nr_collapsed > 0) + result =3D SCAN_SUCCEED; + else + result =3D SCAN_FAIL; } out: trace_mm_khugepaged_scan_pmd(mm, folio, referenced, --=20 2.51.1