From nobody Mon Jun 8 08:36:53 2026 Received: from outbound.mr.icloud.com (mr-2001f-snip4-5.eps.apple.com [57.103.68.58]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B921522259F for ; Sun, 31 May 2026 04:27:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.68.58 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201671; cv=none; b=QNdFKsXGAtE2xXnaEPdPCPM71T1ogPceWC5YxotbdnYSy3FlGOPeub/YFUhgBOPBqEhbls0njysXEXNmUlr911y5H4VH20LmT3/AHoXf7yp8Dz5/R6L1JPEmyhI01/c0KYsJY1cVVJJeKZaWMitnS+CfGWrBevhjFuaAed5Jp40= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201671; c=relaxed/simple; bh=084sMr3iQzhpJqGMXyuB8m7l4qfHvL4h3EmXTtfZBiY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=WHiHEIP4JzN9rzUHIV3AoJQ4JL/7hiQjJBpR8vhRba2Ltd3AM3IO/PJoQiTDTT+1B5T6xki1ftoQuxNBxZijvrdqj7enrGM8G4Us1iKhfL6dOjELvuRQgJ3hr1pMaaOJuAA6CY4RE3NWe/4F9bcB2/MHG7ZCDa+Dm35axjrD7ss= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=mVBpWhAU; arc=none smtp.client-ip=57.103.68.58 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="mVBpWhAU" Received: from outbound.mr.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPS id 4819C1800158; Sun, 31 May 2026 04:27:45 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhJB0MFXwteDUAdVAVLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEEVABWB5WXloXXk1FCA9CAVhbCFsEDx9MDFECQgVWXlQKHQRUB10FXVZQAlpLQgRLRWhcBVwcQBdIHV9qS1YUBBFQAVgeVl5aF15NWgJWTQVKA18BWwdDCFVHBUc0UR9VFFIdRA5tGFAWR0BBWh9BFEAFWwRYCxNdTFBfVitGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1780201669; x=1782793669; bh=8oGvyeNEYGpSfA/4MLm942YXzmR0k9AWn0ouuVGHwI8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=mVBpWhAU74vlRlwTdWZFB5cADCcKL72fGpcQOLiXThFFL/fCQydsljYkaAcyoTz4ZOlEarh2qIu45z9UAQG8KBkhP9JHGyNkGMC7nDvci0NnO/j3xUJRzoDH7lc7jbo4uSUxSVbjl6n/6SYrkb0H/76K64MDr+kCxsB15E7hZpPhbGvI6Rxa7aeIh781NGr7SdfCRwOWEeHpMB9XeMG5XvCtNRMPjP+0/BCvcLLF26IKhdNK26VI7MeFKb4reymh6eoN42HCpqNQG7y58dFwxeZSyo8T0fKPVusxvJiHoNgVeBQWp/hds0f008wpxIBKZX6WwrQ0mWq+56VLxgYv4w== Received: from [127.0.0.1] (unknown [17.57.152.38]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPSA id 9F6201800119; Sun, 31 May 2026 04:27:34 +0000 (UTC) From: Luka Bai Date: Sun, 31 May 2026 12:27:17 +0800 Subject: [PATCH 1/5] mm/khugepaged: add framework for khugepaged collapse hint Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260531-thp_collapse_hint-v1-1-866339cd4c2a@tencent.com> References: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> In-Reply-To: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Kairui Song , Qi Zheng , Shakeel Butt , Axel Rasmussen , Yuanchu Xie , Wei Xu , Rik van Riel , Harry Yoo , Jann Horn , Johannes Weiner , linux-kernel@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1780201643; l=20295; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=Qc6j7M54Jdgp5Rr/qpD9Ky5jlwdbb7ICSoG/i3XfXBk=; b=pNNF5Uh/IkP1VMu0QtLH/9u5na1YR9uuBGE0RujWThhBjAlfDjzNUGZg/xAiXGn+GHGREofIW nwZUwJtqxZ9DXQvZDRrhZSNJyjpy4+7ie7rI755fB/WY5PPqEqmwlfY X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Proofpoint-ORIG-GUID: CRy-x07Sgl5wvKKtz6DT6bjsmscF4q-U X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTMxMDA0NiBTYWx0ZWRfXySqhpiHcFi9l +uAwyxXwYAf3ItjJYfX9hGaNV1+1Nn9bTpmFXK25Zg24OlSznqX3O9WPvVV3sC7gwQbe03MmZfQ FWqLTa7CFWooqPujABORjVZPlkwoaQH7QK9iGos1VxYJKgnzYQhjtnDm/bAH7jyeHKyHgU6Lh4r iTdZ0ALxs/HPULds4yveoj44CnzQHzAOQx+xwKYKGIKdpYTQAFk3W/C293EYrivf2VortYAzaY+ SXYZvOO6zMcW4pm1LDmPKiRAoEXJ96kj/nvLCnfu49NW47KX/c6LqzBlsffPad32cNDZ7MEyYMu 49H7yfzI1YetwJwn3+koVuXkQYl4T0rDl7AWx6G/Ua1uNIusvN9mso1mB+tak4= X-Proofpoint-GUID: CRy-x07Sgl5wvKKtz6DT6bjsmscF4q-U X-Authority-Info-Out: v=2.4 cv=ULfQ3Sfy c=1 sm=1 tr=0 ts=6a1bb8c4 cx=c_apl:c_pps:t_out a=9OgfyREA4BUYbbCgc0Y0oA==:117 a=9OgfyREA4BUYbbCgc0Y0oA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=ptP84xlJL3H7308lbwQA:9 a=QEXdDO2ut3YA:10 From: Luka Bai Currently we just have a simple Round-Robin scanning for all the feasible mm_structs in khugepaged to do collapsing. It is not very efficient when memory space is huge, and it may waste precious large folio resources on some cold memory areas that are seldomly accessed. While at the same time, khugepaged is a very useful tool for asynchronous large folio merging. So we introduced khugepaged collapse hint framework in this patch to try to give khugepaged some priorities for the hot memory areas when doing collapsing. The hot area indications are regarded as "collapse hint". Each "collapse hint" has an address and a vma associated with it to represent a specific hot area that is preferred to be collapsed. All these hints are aggregated by both priority and their belonging mm_struct. When khugepaged tries to collapse, it will first scan the global priority queues that store these hints, and find the first khugepaged_mm_slot (We added struct khugepaged_mm_slot and wrapped the old mm_slot for each mm_struct inside it) that has hints inside it, then try to do collapse on the address given by the hint. One example is like below (the mm_slot represents khugepaged_mm_slot I mentioned above): prio 0 ------()----------------------------------()--------------- mm_slot0(process A) mm_slot1(process B) | | hint0---hint1---hint2---hint3 hint4---hint5---hint6 prio 1 ------()----------------------------------()--------------- mm_slot0(process A) mm_slot1(process B) | | ------- hint7---hint8 The khugepaged will firstly try to scan queue of prio 0 (lower prio number means higher priority), then go through the list, and check the first khugepaged_mm_slot, which is mm_slot0, then go through all the hints in it (hint0 ~ hint3 in the above graph). After handling this hint (no mater success or fail for collapsing), the hint will be deleted. If one khugepaged_mm_slot doesn't have any hints in it, khugepaged will scan the next mm_slot; if there is no hint in prio 0 anymore, khugepaged will scan prio 1; if there is no hints in any prio queues, then it will fallback to do Round-Robin scanning like before. We added a number of NR_KHUGEPAGED_PRIORITY_LEVEL(which is 2 currently) struct khugepaged_collapse_requests into each struct khugepaged_mm_slot. Each struct khugepaged_collapse_requests is used for this mm_struct to be put into the global priority queue. We give each mm_struct a node in each priority queue for hint dispersion and balancing that may be introduced in the future and for a better lock pattern. Currently the khugepaged_collapse_requests[] are linked into the global queues in __khugepaged_enter() and will live there a lifetime of the mm_struct. Caller can call khugepaged_add_collapse_hint() to add a new hint for a specific mm_struct. There is still no callers introduced in this patch. We will add callers in the following patches. Signed-off-by: Luka Bai --- include/linux/khugepaged.h | 13 ++ mm/khugepaged.c | 348 +++++++++++++++++++++++++++++++++++++++++= +++- 2 files changed, 355 insertions(+), 6 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index d7a9053ff4fe..815ae87f0f8e 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -17,6 +17,10 @@ extern void khugepaged_enter_vma(struct vm_area_struct *= vma, vm_flags_t vm_flags); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); +extern void khugepaged_add_collapse_hint(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + int priority, int max_order); void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd); =20 @@ -31,6 +35,9 @@ static inline void khugepaged_exit(struct mm_struct *mm) if (mm_flags_test(MMF_VM_HUGEPAGE, mm)) __khugepaged_exit(mm); } + +#define NR_KHUGEPAGED_PRIORITY_LEVEL 2 + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct = *oldmm) { @@ -55,6 +62,12 @@ static inline bool current_is_khugepaged(void) { return false; } +static inline void khugepaged_add_collapse_hint(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + int priority, int max_order) +{ +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 #endif /* _LINUX_KHUGEPAGED_H */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 35a5f8c44c18..5090ffae73f3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -99,6 +99,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLO= TS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __ro_after_init; =20 +#define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10 + #define KHUGEPAGED_MIN_MTHP_ORDER 2 /* * mthp_collapse() does an iterative DFS over a binary tree, from @@ -160,6 +162,53 @@ static struct khugepaged_scan khugepaged_scan =3D { .mm_head =3D LIST_HEAD_INIT(khugepaged_scan.mm_head), }; =20 +/** + * struct khugepaged_collapse_hint - one collapse hint for a specific addr= ess + * @node: list node on khugepaged_collapse_requests.hints + * @vma: hint pointer to the target VMA + * @address: PMD-aligned virtual address inside @vma to attempt collapsing= on + */ +struct khugepaged_collapse_hint { + struct list_head node; + struct vm_area_struct *vma; + unsigned long address; +}; + +/** + * struct khugepaged_collapse_requests - per-mm, per-priority collapse hin= ts list + * @node: list node on the matching khugepaged_priority_queue[] list + * @hints: list of pending struct khugepaged_collapse_hint for this mm at + * this priority level + * + * Each khugepaged_mm_slot embeds one request struct per priority level. At + * __khugepaged_enter() time, every request is added to the corresponding + * khugepaged_priority_queue[] list and stays on that list until the mm + * exits khugepaged. While queued, hints for the mm at a given priority are + * appended to that priority's @hints; + */ +struct khugepaged_collapse_requests { + struct list_head node; + struct list_head hints; +}; + +/** + * struct khugepaged_mm_slot - khugepaged information per mm that is being= scanned + * @slot: hash lookup from mm to mm_slot + * @request: per-mm collapse requests, one per priority level, each linked + * into the corresponding khugepaged_priority_queue[] list + */ +struct khugepaged_mm_slot { + struct mm_slot slot; + struct khugepaged_collapse_requests request[NR_KHUGEPAGED_PRIORITY_LEVEL]; +}; + +/* + * One queue per priority level. Lower index means higher priority. The + * scanner drains queues in ascending index order, so all hints at higher + * priority are processed before any hint at a lower priority. + */ +static struct list_head khugepaged_priority_queue[NR_KHUGEPAGED_PRIORITY_L= EVEL]; + #ifdef CONFIG_SYSFS static ssize_t scan_sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, @@ -500,10 +549,15 @@ int hugepage_madvise(struct vm_area_struct *vma, =20 int __init khugepaged_init(void) { - mm_slot_cache =3D KMEM_CACHE(mm_slot, 0); + int i; + + mm_slot_cache =3D KMEM_CACHE(khugepaged_mm_slot, 0); if (!mm_slot_cache) return -ENOMEM; =20 + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + INIT_LIST_HEAD(&khugepaged_priority_queue[i]); + khugepaged_pages_to_scan =3D HPAGE_PMD_NR * 8; khugepaged_max_ptes_none =3D KHUGEPAGED_MAX_PTES_LIMIT; khugepaged_max_ptes_swap =3D HPAGE_PMD_NR / 8; @@ -560,21 +614,27 @@ static bool hugepage_enabled(void) =20 void __khugepaged_enter(struct mm_struct *mm) { + struct khugepaged_mm_slot *khp_mm_slot; struct mm_slot *slot; int wakeup; + int i; =20 /* __khugepaged_exit() must not run from under us */ VM_BUG_ON_MM(collapse_test_exit(mm), mm); =20 - slot =3D mm_slot_alloc(mm_slot_cache); - if (!slot) + khp_mm_slot =3D mm_slot_alloc(mm_slot_cache); + if (!khp_mm_slot) return; =20 if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm))) { - mm_slot_free(mm_slot_cache, slot); + mm_slot_free(mm_slot_cache, khp_mm_slot); return; } =20 + slot =3D &khp_mm_slot->slot; + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + INIT_LIST_HEAD(&khp_mm_slot->request[i].hints); + spin_lock(&khugepaged_mm_lock); mm_slot_insert(mm_slots_hash, mm, slot); /* @@ -583,6 +643,12 @@ void __khugepaged_enter(struct mm_struct *mm) */ wakeup =3D list_empty(&khugepaged_scan.mm_head); list_add_tail(&slot->mm_node, &khugepaged_scan.mm_head); + /* + * Link this mm into every priority queue. + */ + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + list_add_tail(&khp_mm_slot->request[i].node, + &khugepaged_priority_queue[i]); spin_unlock(&khugepaged_mm_lock); =20 mmgrab(mm); @@ -613,23 +679,59 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, __khugepaged_enter(vma->vm_mm); } =20 +static void khugepaged_release_collapse_hints( + struct khugepaged_collapse_requests *req) +{ + struct khugepaged_collapse_hint *hint, *tmp; + + list_for_each_entry_safe(hint, tmp, &req->hints, node) { + list_del(&hint->node); + kfree(hint); + } +} + +/* + * Caller must hold khugepaged_mm_lock when removing the request nodes from + * the priority queues; + */ +static void khugepaged_remove_priority_requests(struct khugepaged_mm_slot = *khp_mm_slot) +{ + int i; + + lockdep_assert_held(&khugepaged_mm_lock); + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + list_del(&khp_mm_slot->request[i].node); +} + +static void khugepaged_release_all_hints(struct khugepaged_mm_slot *khp_mm= _slot) +{ + int i; + + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + khugepaged_release_collapse_hints(&khp_mm_slot->request[i]); +} + void __khugepaged_exit(struct mm_struct *mm) { + struct khugepaged_mm_slot *khp_mm_slot =3D NULL; struct mm_slot *slot; int free =3D 0; =20 spin_lock(&khugepaged_mm_lock); slot =3D mm_slot_lookup(mm_slots_hash, mm); if (slot && khugepaged_scan.mm_slot !=3D slot) { + khp_mm_slot =3D mm_slot_entry(slot, struct khugepaged_mm_slot, slot); hash_del(&slot->hash); list_del(&slot->mm_node); + khugepaged_remove_priority_requests(khp_mm_slot); free =3D 1; } spin_unlock(&khugepaged_mm_lock); =20 if (free) { mm_flags_clear(MMF_VM_HUGEPAGE, mm); - mm_slot_free(mm_slot_cache, slot); + khugepaged_release_all_hints(khp_mm_slot); + mm_slot_free(mm_slot_cache, khp_mm_slot); mmdrop(mm); } else if (slot) { /* @@ -1804,6 +1906,8 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, =20 static void collect_mm_slot(struct mm_slot *slot) { + struct khugepaged_mm_slot *khp_mm_slot =3D + mm_slot_entry(slot, struct khugepaged_mm_slot, slot); struct mm_struct *mm =3D slot->mm; =20 lockdep_assert_held(&khugepaged_mm_lock); @@ -1812,6 +1916,7 @@ static void collect_mm_slot(struct mm_slot *slot) /* free mm_slot */ hash_del(&slot->hash); list_del(&slot->mm_node); + khugepaged_remove_priority_requests(khp_mm_slot); =20 /* * Not strictly needed because the mm exited already. @@ -1820,7 +1925,8 @@ static void collect_mm_slot(struct mm_slot *slot) */ =20 /* khugepaged_mm_lock actually not necessary for the below */ - mm_slot_free(mm_slot_cache, slot); + khugepaged_release_all_hints(khp_mm_slot); + mm_slot_free(mm_slot_cache, khp_mm_slot); mmdrop(mm); } } @@ -2848,6 +2954,211 @@ static enum scan_result collapse_single_pmd(unsigne= d long addr, return result; } =20 +/* + * khugepaged_add_collapse_hint - enqueue a collapse hint + * @mm: target mm + * @vma: hint pointer to the VMA covering @address (treated as a h= int) + * @address: virtual address; rounded down to HPAGE_PMD_SIZE + * @priority: priority bucket the hint should land in. Lower number =3D= =3D higher + * priority; must be in [0, NR_KHUGEPAGED_PRIORITY_LEVEL). + * @max_order: max order of continuous pt entries inside this target pmd= , used + * to decide whether we need to collapse it. + * + * Tell khugepaged to prioritize collapsing the PMD covering @address in @= mm. + * The next time collapse_scan_mm_slot() runs it will drain these entries + * before the regular round-robin scan, walking priority queues from + * highest priority (lowest index) to lowest. + * + * Hints are aggregated per-mm and per-priority: __khugepaged_enter() + * pre-installs one collapse_request per priority level on the matching + * khugepaged_priority_queue[] list, and this function appends a + * (vma, address) hint to the request that matches @priority. + * + * Caller must keep @vma alive across this call (mmap_lock, per-VMA lock, + * or a corresponding rmap-side lock such as anon_vma_lock_read / + * i_mmap_lock_read are all sufficient). + * + * @vma->vm_flags is read with collapse_allowable_orders(). When the + * caller does not hold mmap_lock or a per-VMA lock, the result is + * advisory; the real validation happens later in + * collapse_scan_one_priority_entry() under mmap_read_lock. + * + * Caller must also guarantee @mm is alive across this call so the underly= ing + * mm_slot cannot be freed while we append. + */ +void khugepaged_add_collapse_hint(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long address, + int priority, int max_order) +{ + struct khugepaged_mm_slot *khp_mm_slot; + struct khugepaged_collapse_hint *hint; + struct mm_slot *slot; + int orders; + + if (!mm || !vma) + return; + if (priority < 0 || priority >=3D NR_KHUGEPAGED_PRIORITY_LEVEL) + return; + + orders =3D collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED); + if (highest_order(orders) <=3D max_order) + return; + + /* + * Make sure the mm is enrolled in khugepaged so that its embedded + * collapse_request[] entries are on khugepaged_priority_queue[]. + */ + khugepaged_enter_vma(vma, vma->vm_flags); + if (!mm_flags_test(MMF_VM_HUGEPAGE, mm)) + return; + + hint =3D kmalloc_obj(struct khugepaged_collapse_hint); + if (!hint) + return; + + hint->vma =3D vma; + hint->address =3D address & HPAGE_PMD_MASK; + + /* + * Just use try lock to avoid lock contention because collapse hints are + * just "best-effort" optimization. + */ + if (!spin_trylock(&khugepaged_mm_lock)) { + kfree(hint); + return; + } + + slot =3D mm_slot_lookup(mm_slots_hash, mm); + if (!slot) { + spin_unlock(&khugepaged_mm_lock); + kfree(hint); + return; + } + khp_mm_slot =3D mm_slot_entry(slot, struct khugepaged_mm_slot, slot); + list_add_tail(&hint->node, &khp_mm_slot->request[priority].hints); + spin_unlock(&khugepaged_mm_lock); + + wake_up_interruptible(&khugepaged_wait); +} + +/* + * Each enrolled mm owns one request struct per priority level, all of whi= ch + * live on the matching khugepaged_priority_queue[] list for the lifetime = of + * the mm_slot. The caller iterates priorities from highest to lowest, and + * call collapse_scan_one_priority_entry() to process all mms at this prio= rity, + * and handle pending collapse hints for each mm. Repeat until either + * @progress_max is reached, the per-mm-slot failure exceeds certain thres= hold, + * or no hints remain for this mm at this priority. + * + * Caller must hold khugepaged_mm_lock. + * + * Returns 1 if an mm was processed at this priority, 0 if no mm on + * khugepaged_priority_queue[@priority] had any pending hints. + */ +static int collapse_scan_one_priority_entry(unsigned int progress_max, + enum scan_result *result, + struct collapse_control *cc, + int priority, + int *fail_count) + __releases(&khugepaged_mm_lock) + __acquires(&khugepaged_mm_lock) +{ + struct khugepaged_collapse_requests *iter_req; + struct khugepaged_mm_slot *khp_mm_slot =3D NULL, *iter_slot; + struct mm_struct *mm =3D NULL; + bool lock_dropped =3D true; + + /* + * We have to call mmget_not_zero() under khugepaged_mm_lock so that + * __khugepaged_exit() cannot free the embedding khugepaged_mm_slot from + * under us once we drop the spinlock. + */ + list_for_each_entry(iter_req, &khugepaged_priority_queue[priority], node)= { + if (list_empty(&iter_req->hints)) + continue; + iter_slot =3D container_of(iter_req, struct khugepaged_mm_slot, + request[priority]); + if (mmget_not_zero(iter_slot->slot.mm)) { + khp_mm_slot =3D iter_slot; + mm =3D iter_slot->slot.mm; + break; + } + } + if (!khp_mm_slot) + return 0; + + spin_unlock(&khugepaged_mm_lock); + + /* + * Drain hints for this mm while we hold mmap_read_lock. + * collapse_single_pmd() may drop the mmap_lock; if so, try once to + * retake it for the next hint. + */ + while (cc->progress < progress_max && + *fail_count < KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL) { + struct khugepaged_collapse_hint *hint =3D NULL; + struct vm_area_struct *vma; + unsigned long addr; + + if (lock_dropped) { + if (!mmap_read_trylock(mm)) { + (*fail_count)++; + continue; + } + lock_dropped =3D false; + } + + spin_lock(&khugepaged_mm_lock); + if (!list_empty(&khp_mm_slot->request[priority].hints)) { + hint =3D list_first_entry(&khp_mm_slot->request[priority].hints, + struct khugepaged_collapse_hint, + node); + list_del(&hint->node); + } + spin_unlock(&khugepaged_mm_lock); + + if (!hint) + break; + + cc->progress++; + addr =3D hint->address; + + if (unlikely(collapse_test_exit_or_disable(mm))) { + kfree(hint); + break; + } + + /* + * Re-validate the cached VMA hint under mmap_read_lock. If the + * address is now covered by a different VMA, or no VMA at all, + * drop the entry. Note that the vma may be a different object + * than the one passed in at enqueue time, but that's a false + * positive that we can safely ignore. + */ + vma =3D vma_lookup(mm, addr); + if (!vma || vma !=3D hint->vma) + goto skip_hint; + if (!collapse_allowable_orders(vma, vma->vm_flags, TVA_KHUGEPAGED)) + goto skip_hint; + if (addr < ALIGN(vma->vm_start, HPAGE_PMD_SIZE) || + addr + HPAGE_PMD_SIZE > ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE)) + goto skip_hint; + + *result =3D collapse_single_pmd(addr, vma, &lock_dropped, cc); + if (*result !=3D SCAN_SUCCEED) + (*fail_count)++; +skip_hint: + kfree(hint); + } + + if (!lock_dropped) + mmap_read_unlock(mm); + mmput(mm); + spin_lock(&khugepaged_mm_lock); + return 1; +} + static void collapse_scan_mm_slot(unsigned int progress_max, enum scan_result *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock) @@ -2858,10 +3169,35 @@ static void collapse_scan_mm_slot(unsigned int prog= ress_max, struct mm_struct *mm; struct vm_area_struct *vma; unsigned int progress_prev =3D cc->progress; + int priority_queue_fail_times =3D 0; + int prio; =20 lockdep_assert_held(&khugepaged_mm_lock); *result =3D SCAN_FAIL; =20 + /* + * Drain explicit hints in priority order before the mm_slot scan. + * Iterate priorities from highest (lowest index) to lowest. For each + * priority, handle every mm with hints queued at that priority + * before we move on to the next, lower priority. + */ + for (prio =3D 0; prio < NR_KHUGEPAGED_PRIORITY_LEVEL; prio++) { + while (priority_queue_fail_times < KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL && + cc->progress < progress_max) { + if (collapse_scan_one_priority_entry(progress_max, result, cc, + prio, &priority_queue_fail_times) =3D=3D 0) + break; + } + + if (cc->progress >=3D progress_max || + priority_queue_fail_times >=3D KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL) + break; + } + + if (list_empty(&khugepaged_scan.mm_head) || + cc->progress >=3D progress_max) + return; + if (khugepaged_scan.mm_slot) { slot =3D khugepaged_scan.mm_slot; } else { --=20 2.52.0 From nobody Mon Jun 8 08:36:53 2026 Received: from outbound.mr.icloud.com (mr-2001f-snip4-6.eps.apple.com [57.103.68.59]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77EA02D3750 for ; Sun, 31 May 2026 04:27:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.68.59 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201679; cv=none; b=A4YGElZuEGyP2uPwLh34kUHwnT+2sn/12y/lUJRIRBiHj+tN1BeS0abIpIKsuOgiN2rwfCc/Jt4zK4DDSYTm9zHZWH4jTzGqaTXRlmsdaQ2Ng9kRdOPlH+Q/Nq0ZAewruIrY9SF0Ja0bwS7QaDN/dBrobfIEGZmUA9S2fcpPFtc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201679; c=relaxed/simple; bh=74V7JE4yvLIG0JN3uxxDarz9QqLQvqPMNULvc4WFvUI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=d1p4NpHp0H/YyIyBxvUzjoj8b5PgEGPeB5aq4KVAWoIqwDwRwG/RSdDXshO4V+5nbeALkcl4fQImT1AcnWvGAufe+t9hOrcsKIymqb7Rjel+FoHLKjKg2N9/iADDUiPfNwlh4+FxAsy3+sNIJ9brWAzBsUi1X1fIAtfbYjEwLbs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=RIb6NQoC; arc=none smtp.client-ip=57.103.68.59 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="RIb6NQoC" Received: from outbound.mr.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPS id 60997180013D; Sun, 31 May 2026 04:27:55 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhJB0MFXwteDUAdVAVLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEEVABWB5WXloXXk1FCA9CAVhbCFsEDx9MDFECQgVWXlQKHQRUB10FXVZQAlpLQgRLRWhcBVwcQBdIHV9qS1YUBBFQAVgeVl5aF15NWgJWTQVKA18BWwdDCFVHBUc0UR9VFFIdRA5tGFAWR0BBWh9CFEAFWwRYCxNdTFBfVitGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1780201678; x=1782793678; bh=t9h/Ynv08eT/7kBNC5M5hKRxLaGxANkcJA7nimD4RAY=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=RIb6NQoCWqkSi9d8/57Seu80aFsgI8o7zbmXDvewzfEnKpyd4abxPqmg7JUvcEdPeF767vTiPvjSCZeC0jLxbYlklqg1bxTVQbyqKYETw/2poKWKh2iygi9MUxcuDdGVxms9DL8EN1ugB8CFZOxRk72ep4WXzlvPmTCIP6taHUKVWgmXKwHbhLEfy7Fibr220vL8ZE7LGcwJrrc3yXqw+EgrnSloUvBKfdgXLQSXvMICfXS+cytRgClkr66gQtfqWnwgl91KHwUr8NB9alxHX/rZr7xC+DRvoZgHdccCionoq3Ohsx0BHMnmIGOitW3batTBjAKjd7gSxlQNaoS6+Q== Received: from [127.0.0.1] (unknown [17.57.152.38]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPSA id 472341800114; Sun, 31 May 2026 04:27:44 +0000 (UTC) From: Luka Bai Date: Sun, 31 May 2026 12:27:18 +0800 Subject: [PATCH 2/5] mm/khugepaged: use slab cache instead of normal kmalloc Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260531-thp_collapse_hint-v1-2-866339cd4c2a@tencent.com> References: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> In-Reply-To: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Kairui Song , Qi Zheng , Shakeel Butt , Axel Rasmussen , Yuanchu Xie , Wei Xu , Rik van Riel , Harry Yoo , Jann Horn , Johannes Weiner , linux-kernel@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1780201643; l=2998; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=IK4miDrmdbabkQyFGW59A5b6GPx3fnGu+q5Q4xvrQwg=; b=25BP/QidmIIMdY6i6uWSXAANFJ6EHVM+SNF76MvPHXem3VllQIhb9Nvnq+lwJnpu0kI7vaE5S 3jx0hHZdCprDk+Lb0G+7XRUQnZJiFQTPMntsOsR0LRastmpEfON1GBL X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Proofpoint-GUID: Lk8E2F88_Z5a2ahnBfM5agzJL3mWuPqD X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTMxMDA0NiBTYWx0ZWRfX/80evBJfXAM+ q1w3c3bdQEPP8SHCaqi28rjymeG/kLhpf+OaZxD6rrTOd2gtH49AwAaK9fXPb0/gR0Ua8lIzYOQ hLKm23YnwH4rawxd/vQfCxDyqd6AE/AeoV2PCvHo7fyiORbalzRRKsfXxzTXIzVRLGYk7pNCIDY emMAS+Cn8bqJn9ieuAvaZGftx84yGga7r22ii9weVkjhnFfPb0g/TYTi9ZVQm4fHtffI3Ghs3/1 Zwlq5S2gjQ8w4C4XdRNQj49drENXSBeEJgCgOiyAOWuRgGHsU3KFJIayYKX4dBcLLIdCGuF4TQZ CwiWsEtOOCNcCLyh7DuUnR1xNiHJgqD/MXlZTPExUHcHtBJA1A+gN9PDm0WjYI= X-Proofpoint-ORIG-GUID: Lk8E2F88_Z5a2ahnBfM5agzJL3mWuPqD X-Authority-Info-Out: v=2.4 cv=H57WAuYi c=1 sm=1 tr=0 ts=6a1bb8cd cx=c_apl:c_pps:t_out a=9OgfyREA4BUYbbCgc0Y0oA==:117 a=9OgfyREA4BUYbbCgc0Y0oA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=gYzHG_maUEIFqAho4fMA:9 a=QEXdDO2ut3YA:10 From: Luka Bai We added a kmem slab cached called collapse_hint_cache for khugepaged collapse hint, to improve the performance in allocation and freeing for the hint structs. Signed-off-by: Luka Bai --- mm/khugepaged.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5090ffae73f3..04cf85ea5557 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -98,6 +98,7 @@ static unsigned int khugepaged_max_ptes_shared __read_mos= tly; static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); =20 static struct kmem_cache *mm_slot_cache __ro_after_init; +static struct kmem_cache *collapse_hint_cache __ro_after_init; =20 #define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10 =20 @@ -555,6 +556,13 @@ int __init khugepaged_init(void) if (!mm_slot_cache) return -ENOMEM; =20 + collapse_hint_cache =3D KMEM_CACHE(khugepaged_collapse_hint, 0); + if (!collapse_hint_cache) { + kmem_cache_destroy(mm_slot_cache); + mm_slot_cache =3D NULL; + return -ENOMEM; + } + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) INIT_LIST_HEAD(&khugepaged_priority_queue[i]); =20 @@ -569,6 +577,7 @@ int __init khugepaged_init(void) void __init khugepaged_destroy(void) { kmem_cache_destroy(mm_slot_cache); + kmem_cache_destroy(collapse_hint_cache); } =20 static inline int collapse_test_exit(struct mm_struct *mm) @@ -686,7 +695,7 @@ static void khugepaged_release_collapse_hints( =20 list_for_each_entry_safe(hint, tmp, &req->hints, node) { list_del(&hint->node); - kfree(hint); + kmem_cache_free(collapse_hint_cache, hint); } } =20 @@ -3013,7 +3022,7 @@ void khugepaged_add_collapse_hint(struct mm_struct *m= m, if (!mm_flags_test(MMF_VM_HUGEPAGE, mm)) return; =20 - hint =3D kmalloc_obj(struct khugepaged_collapse_hint); + hint =3D kmem_cache_alloc(collapse_hint_cache, GFP_KERNEL); if (!hint) return; =20 @@ -3025,14 +3034,14 @@ void khugepaged_add_collapse_hint(struct mm_struct = *mm, * just "best-effort" optimization. */ if (!spin_trylock(&khugepaged_mm_lock)) { - kfree(hint); + kmem_cache_free(collapse_hint_cache, hint); return; } =20 slot =3D mm_slot_lookup(mm_slots_hash, mm); if (!slot) { spin_unlock(&khugepaged_mm_lock); - kfree(hint); + kmem_cache_free(collapse_hint_cache, hint); return; } khp_mm_slot =3D mm_slot_entry(slot, struct khugepaged_mm_slot, slot); @@ -3125,7 +3134,7 @@ static int collapse_scan_one_priority_entry(unsigned = int progress_max, addr =3D hint->address; =20 if (unlikely(collapse_test_exit_or_disable(mm))) { - kfree(hint); + kmem_cache_free(collapse_hint_cache, hint); break; } =20 @@ -3149,7 +3158,7 @@ static int collapse_scan_one_priority_entry(unsigned = int progress_max, if (*result !=3D SCAN_SUCCEED) (*fail_count)++; skip_hint: - kfree(hint); + kmem_cache_free(collapse_hint_cache, hint); } =20 if (!lock_dropped) --=20 2.52.0 From nobody Mon Jun 8 08:36:53 2026 Received: from outbound.mr.icloud.com (mr-2001e-snip4-7.eps.apple.com [57.103.68.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2622E1A9F85 for ; Sun, 31 May 2026 04:28:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.68.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201694; cv=none; b=mYpKPKlMRTKnb4o3LzKkezrvd9uYyBIO0HnmB84W117Pa9vQX2jy/J1MieKv4YiAy6R3ldNWW2LB5MDEwZvDDDHGb07djWSk2pxZIAdQGBvTRQSp/pvztPr5uXeuI8yHrIWoGoKKd5jtB+IAEzZV1y1oCXyIkMSMORkUqGyr/Uc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201694; c=relaxed/simple; bh=x5fMpXle0vA6Me0ePh/LPjrQrdUhc2UQl/+13ZVwlrc=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=lomZ69/hxDXwaUoaDgd03Jd+0l+cgfzz9mm9RyNQhrrytkge/RRBgz2AH6W5AWBQDRsNKe3Og/+tRjDCnmnHyO5strGOto5kF/B+44i26Pu7o+LI8pAhjLZj8bwrfyMjiuBKeE/2lB2Y4Xwm/iuF6IQIQJx4DIJVSkiGieSZ18U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=y1Jjwsgf; arc=none smtp.client-ip=57.103.68.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="y1Jjwsgf" Received: from outbound.mr.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPS id 158A11800119; Sun, 31 May 2026 04:28:06 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhJB0MFXwteDUAdVAVLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEEVABWB5WXloXXk1FCA9CAVhbCFsEDx9MDFECQgVWXlQKHQRUB10FXVZQAlpLQgRLRWhcBVwcQBdIHV9qS1YUBBFQAVgeVl5aF15NWgJWTQVKA18BWwdDCFVHBUc0UR9VFFIdRA5tGFAWR0BBWh9DFEAFWwRYCxNdTFBfVitGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1780201692; x=1782793692; bh=dismuyxgoUu32oVKxariHZH1KvI6gzuA1yZho9b0qn4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=y1JjwsgfBvMs52AxyxBeGlVUqgjCWgV0TX+Nzge/UYjFeQWgOQY+NMXTkzUNM2qPb/L7wHfJ9KVyQa3HnewtBs2VrANFUg7rlHoW9T1eMCuSiZc3GFG0EGdI38rmOdhLGjI8WD4Lw38iwOx73fK26I4Hilh4//l/jICZ71sjtGvOiwlPwA1fvxve5FO+6PSWK3yecYS7JhAkkZffUMxHsRlMXfH0uqgMuHqM6d2VZpMPJhzx2W3Xfrm5yxNvuBgkXc3LtnY0rlgMU/K7ffgRsRm0R851O/YcIF9pwef/VTvdnODYaNCjcQ0iNb+BoJ/tTdssYGk6BSgiiulJs+hLuQ== Received: from [127.0.0.1] (unknown [17.57.152.38]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPSA id 7EF4E18000A8; Sun, 31 May 2026 04:27:55 +0000 (UTC) From: Luka Bai Date: Sun, 31 May 2026 12:27:19 +0800 Subject: [PATCH 3/5] mm/khugepaged: add deduplication when adding new collapse hint Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260531-thp_collapse_hint-v1-3-866339cd4c2a@tencent.com> References: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> In-Reply-To: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Kairui Song , Qi Zheng , Shakeel Butt , Axel Rasmussen , Yuanchu Xie , Wei Xu , Rik van Riel , Harry Yoo , Jann Horn , Johannes Weiner , linux-kernel@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1780201643; l=7996; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=N3P/iJQ8BlWvrseSb9JBmO8QdT3darHooCxwnlA5qr0=; b=i5egUqtHfUbtbATcvTL8RDNSKrSYd3FUakErTbxmly9nv8AO94Z/hCdS8r0+l/XMoUa1O1SKH fgD9aPwDeTiBnAYX2j6+OH3DtOeUICslAoC2Ed3/S0y7SMtuuUCHH5O X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Authority-Info-Out: v=2.4 cv=F9xat6hN c=1 sm=1 tr=0 ts=6a1bb8da cx=c_apl:c_pps:t_out a=9OgfyREA4BUYbbCgc0Y0oA==:117 a=9OgfyREA4BUYbbCgc0Y0oA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=nMWqTqnFzTP9CSPNFE4A:9 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: 3TSgtxcgck9PoY_i5WmbPItoV4rX2K-- X-Proofpoint-ORIG-GUID: 3TSgtxcgck9PoY_i5WmbPItoV4rX2K-- X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTMxMDA0NiBTYWx0ZWRfX8vlJYMDh8xS4 Kkwu3bFmLwfyB5pwK7bGaUVIJ6ShuM6wgX0voWelHxDosLbaLHhTTgDH9R6CK21Sa13FH5cGhyI 6jgDXmnqapoLJfQZ2/ZDbZr/eSsjmi1ELRPbH3K9g18mGKuWF05xCllDc/vFPu2UEGXEu4zdMdH 0ipt2lQ2EU+50Cs0x7/iQ7js7ht04CHmVm6uVpAUHh3zXBGoAhZmrv6TwZRfQgSo4FS/mb3REqq eernGYxQBgoiuPGhW4JaXSNrX+qKcWUl727Ntq1xVsep6tME+ApQKCqdnH3AOWYNwp9v7Da1LoM gMxVQlDOQnGwgV0sgrfAdB6q2VzIf9N2J3faJnrxgJzCyWFft0a3fEW/Dv8Lzc= From: Luka Bai We need to check for duplication before we add a new collapse hint, and we want the searching and adding to be faster. So there are several options for doing that: Option 1. Add a Blooming filter for the hint addresses, but that will make the hint hard to be deleted after handling. Option 2. Add a hashtable for each khugepaged_mm_slot. But for a efficient setup, the hashtable should have maybe 16 ~ 32 slots, which will cost 128 bytes to 256 bytes for each mm_struct. Seems a little wasteful. Option 3. Add an xarray for each khugepaged_mm_slot, which only takes 16 bytes for each mm_struct. However, each time when we try to add a new entry into the xarray, it may cause memory allocation. Collapse hint is supposed to be a best-effort machanism, introducing xarray seems to be a little too heavy for the calling function. Option 4. Add a global hashtable for all the memory hints, setup key by their address and mm_struct ptr. The global hashtable mixes mm_struct ptr and address as key, but the deduplication only looks at address for saving memory. As a result, there may be collision on different mms with a same address. But as we claimed above, collapse hint is only a best-effort thing, and the collision is also rare to happen because the address is always 0 for the lower PMD_SHIFT bits, which normally gives mm struct about 2M size to scatter (the key is calculated by (ptr of mm ^ pmd aligned address). By choosing option 4, since the hashtable is global, we decided to directly use a global lock (we directly use khugepaged_mm_lock here). To avoid uncessary lock spinning, we used trylock when we try to add a new hint, and exit when the contension happened. Still, this is harmless for the correctness of the machanism. Signed-off-by: Luka Bai --- mm/khugepaged.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++= ---- 1 file changed, 78 insertions(+), 5 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 04cf85ea5557..3f5eb8be06d1 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -100,6 +100,24 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_= SLOTS_HASH_BITS); static struct kmem_cache *mm_slot_cache __ro_after_init; static struct kmem_cache *collapse_hint_cache __ro_after_init; =20 +/* + * Global lookup table used by khugepaged_add_collapse_hint() to deduplica= te + * pending hints against an existing address. The key mixes mm and address + * but the dedup comparison only looks at @address. As a result, two + * different mms hinting the same address may collapse. This is rare + * since the aligned_addr is always 0 for the lower PMD_SHIFT bits, which + * normally gives mm struct about 2M size for scattering (for 4K paging). + * And it's also harmless if the collision happens. + */ +#define KHUGEPAGED_HINTS_HASH_BITS 9 +static DEFINE_HASHTABLE(khugepaged_hint_lookup, KHUGEPAGED_HINTS_HASH_BITS= ); + +static inline unsigned long khugepaged_hint_key(struct mm_struct *mm, + unsigned long aligned_addr) +{ + return (unsigned long)mm ^ aligned_addr; +} + #define KHUGEPAGED_PRIORITY_QUEUE_MAX_FAIL 10 =20 #define KHUGEPAGED_MIN_MTHP_ORDER 2 @@ -165,12 +183,15 @@ static struct khugepaged_scan khugepaged_scan =3D { =20 /** * struct khugepaged_collapse_hint - one collapse hint for a specific addr= ess - * @node: list node on khugepaged_collapse_requests.hints - * @vma: hint pointer to the target VMA - * @address: PMD-aligned virtual address inside @vma to attempt collapsing= on + * @node: list node on khugepaged_collapse_requests.hints + * @hash_node: hlist node on the global khugepaged_hint_lookup table, used + * for deduplication. + * @vma: hint pointer to the target VMA + * @address: PMD-aligned virtual address inside @vma to attempt collapsi= ng on */ struct khugepaged_collapse_hint { struct list_head node; + struct hlist_node hash_node; struct vm_area_struct *vma; unsigned long address; }; @@ -688,6 +709,29 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, __khugepaged_enter(vma->vm_mm); } =20 +/* + * Unhash any hints still queued under @req. Caller must hold + * khugepaged_mm_lock so we can safely unhash each hint from the global + * khugepaged_hint_lookup table. + */ +static void khugepaged_unhash_collapse_hints( + struct khugepaged_collapse_requests *req) +{ + struct khugepaged_collapse_hint *hint, *tmp; + + lockdep_assert_held(&khugepaged_mm_lock); + + list_for_each_entry_safe(hint, tmp, &req->hints, node) { + hash_del(&hint->hash_node); + } +} + +/* + * Free any hints still queued under @req. No lock need to be held. Caller + * must make sure the hints are already unhashed from the global + * khugepaged_hint_lookup table and the mm_slot is removed from the + * khugepaged_priority_queue[]. + */ static void khugepaged_release_collapse_hints( struct khugepaged_collapse_requests *req) { @@ -712,6 +756,14 @@ static void khugepaged_remove_priority_requests(struct= khugepaged_mm_slot *khp_m list_del(&khp_mm_slot->request[i].node); } =20 +static void khugepaged_unhash_all_hints(struct khugepaged_mm_slot *khp_mm_= slot) +{ + int i; + + for (i =3D 0; i < NR_KHUGEPAGED_PRIORITY_LEVEL; i++) + khugepaged_unhash_collapse_hints(&khp_mm_slot->request[i]); +} + static void khugepaged_release_all_hints(struct khugepaged_mm_slot *khp_mm= _slot) { int i; @@ -733,6 +785,7 @@ void __khugepaged_exit(struct mm_struct *mm) hash_del(&slot->hash); list_del(&slot->mm_node); khugepaged_remove_priority_requests(khp_mm_slot); + khugepaged_unhash_all_hints(khp_mm_slot); free =3D 1; } spin_unlock(&khugepaged_mm_lock); @@ -1933,6 +1986,7 @@ static void collect_mm_slot(struct mm_slot *slot) * mm_flags_clear(MMF_VM_HUGEPAGE, mm); */ =20 + khugepaged_unhash_all_hints(khp_mm_slot); /* khugepaged_mm_lock actually not necessary for the below */ khugepaged_release_all_hints(khp_mm_slot); mm_slot_free(mm_slot_cache, khp_mm_slot); @@ -3001,8 +3055,9 @@ void khugepaged_add_collapse_hint(struct mm_struct *m= m, int priority, int max_order) { struct khugepaged_mm_slot *khp_mm_slot; - struct khugepaged_collapse_hint *hint; + struct khugepaged_collapse_hint *hint, *existing; struct mm_slot *slot; + unsigned long aligned_addr, key; int orders; =20 if (!mm || !vma) @@ -3022,12 +3077,15 @@ void khugepaged_add_collapse_hint(struct mm_struct = *mm, if (!mm_flags_test(MMF_VM_HUGEPAGE, mm)) return; =20 + aligned_addr =3D address & HPAGE_PMD_MASK; + key =3D khugepaged_hint_key(mm, aligned_addr); + hint =3D kmem_cache_alloc(collapse_hint_cache, GFP_KERNEL); if (!hint) return; =20 hint->vma =3D vma; - hint->address =3D address & HPAGE_PMD_MASK; + hint->address =3D aligned_addr; =20 /* * Just use try lock to avoid lock contention because collapse hints are @@ -3045,7 +3103,21 @@ void khugepaged_add_collapse_hint(struct mm_struct *= mm, return; } khp_mm_slot =3D mm_slot_entry(slot, struct khugepaged_mm_slot, slot); + + /* + * For deduplication. The comparison only checks @address here. See comme= nts + * above khugepaged_hint_lookup definition for details. + */ + hash_for_each_possible(khugepaged_hint_lookup, existing, hash_node, key) { + if (existing->address =3D=3D aligned_addr) { + spin_unlock(&khugepaged_mm_lock); + kmem_cache_free(collapse_hint_cache, hint); + return; + } + } + list_add_tail(&hint->node, &khp_mm_slot->request[priority].hints); + hash_add(khugepaged_hint_lookup, &hint->hash_node, key); spin_unlock(&khugepaged_mm_lock); =20 wake_up_interruptible(&khugepaged_wait); @@ -3124,6 +3196,7 @@ static int collapse_scan_one_priority_entry(unsigned = int progress_max, struct khugepaged_collapse_hint, node); list_del(&hint->node); + hash_del(&hint->hash_node); } spin_unlock(&khugepaged_mm_lock); =20 --=20 2.52.0 From nobody Mon Jun 8 08:36:53 2026 Received: from outbound.mr.icloud.com (mr-2001f-snip4-8.eps.apple.com [57.103.68.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88DF22EC09B for ; Sun, 31 May 2026 04:28:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.68.61 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201704; cv=none; b=fx43kXr79tuRx9CSagHqqxkj6BioQ6p16RPFdRPaJBIUu7N6l2LCIBjXiqSLiIePzwoMc+szBz13vTT6aUdiB7ZaiUbUuIeh83eLqToPWHyuutHwZrKFLI2YN7sx1xAtvbxu4lZ0pnXd1VIziCRGR0891I+IoKRjjG7vZJeTeeE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201704; c=relaxed/simple; bh=STH885JstUH7y5ZxJPHL4Ir8bFwC0BDenu+H0aByyxc=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=sWv9rf5yIiGZPzQMvslytxGx2OQ8mENgeEjZLwf7B/9zVC6XGnnLHRu8XapBAmw73sOeacHCKIrGSDpW8T3Hdy4V8jIbGZ2WN0oDi/Mwo2NyI1cCly2+o8nBNYFhEMhEudApPIwrj4iitairBgaaCDD7J4UtL4qgBsf+VMmT9rc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=VR3BYHPX; arc=none smtp.client-ip=57.103.68.61 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="VR3BYHPX" Received: from outbound.mr.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPS id 11D151800171; Sun, 31 May 2026 04:28:16 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhJB0MFXwteDUAdVAVLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEEVABWB5WXloXXk1FCA9CAVhbCFsEDx9MDFECQgVWXlQKHQRUB10FXVZQAlpLQgRLRWhcBVwcQBdIHV9qS1YUBBFQAVgeVl5aF15NWgJWTQVKA18BWwdDCFVHBUc0UR9VFFIdRA5tGFAWR0BBWh9EFEAFWwRYCxNdTFBfVitGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1780201703; x=1782793703; bh=bhbmzU8f+DtLeo7/JiKJOt5bFOzMaSbF+v57Q12naMo=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=VR3BYHPXC/5ooK91OfM9H7foeW+CQSHUnC+y9ZCelBzSsdaOv6XMTtXaOQs+HSqO5UeF7fLQ2IJIZA3Z7uVVyeQI+vCvWKqu/wstdMaHb70qGkAbdM2PLjM8YFq0I8Zp1zbcur1RDH9YmLLBI4iHjB9bElUEKgepwYJoZVOy6V7HeOZNjl10X8Phe+QJNFCyMgbqgYPVuYpC0ciuBjk7Rd6fCWR9Z89PJEJ9+TThbcYh5FsDyQo+W6XrCifuaaXxNH8T2pteqrRCQ7kbD5FVBoBgvNf4GRzV72pa2RbKE5nwCcddgBpyEIClC2/j4giwkKgpOh6wS1wH5awxvFgOEQ== Received: from [127.0.0.1] (unknown [17.57.152.38]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPSA id C45731800114; Sun, 31 May 2026 04:28:06 +0000 (UTC) From: Luka Bai Date: Sun, 31 May 2026 12:27:20 +0800 Subject: [PATCH 4/5] mm/khugepaged: add accounting for successful hint or non-hint collapse Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260531-thp_collapse_hint-v1-4-866339cd4c2a@tencent.com> References: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> In-Reply-To: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Kairui Song , Qi Zheng , Shakeel Butt , Axel Rasmussen , Yuanchu Xie , Wei Xu , Rik van Riel , Harry Yoo , Jann Horn , Johannes Weiner , linux-kernel@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1780201643; l=3543; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=fMQSPKl1zRvOYMdFUia4IDMTy4aTjy06OEcQUlcbylM=; b=g6hskuajwDifuMlmQvwXr7m3gvsQsG3xboa6bdOJ6SgjH/Fyfhd7w2b5Q80kpTF2fGF4LNo8F 0yMQCfVmh3mArC+6szHFufSG++WLY6mSpGGl0aVod/sTQ7MAhojQ/Pz X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Authority-Info-Out: v=2.4 cv=AvTjHe9P c=1 sm=1 tr=0 ts=6a1bb8e4 cx=c_apl:c_pps:t_out a=9OgfyREA4BUYbbCgc0Y0oA==:117 a=9OgfyREA4BUYbbCgc0Y0oA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=0Ztl7DgICfUfD0EaiaYA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-ORIG-GUID: yz54fiqE8TjX4RhqkROvSzCb1_eTk_JV X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTMxMDA0NiBTYWx0ZWRfX6jolbC6oTORJ jDnOhHXnV+NXz2xs37S9CoZAHvweRKv5SyS3VtzwTOkjyvw7C6aQkmgdFQ6Zim/eeFsF2jhmPod AoOFBOPvCz9EX/SSpHehGjp0fT6niK7Npb3Dzhac6aEIvkF5m09cSyMTw8CZbPRnww38EO4NYQp naXS1Wi4WIMDWuT6nN8LnShZL9p32T1od0eDceDXhBXp9Ob09iHPWS2okUGhXpCr/Rgq71n8Z5O Us38k33MJiQkrXn1TfNeAaQUGWiOz3yRh3JZiZEcqe1EH9oUY7uoib0htE5dLIX0fJpTFTcpAjU zIkJ/PsXG/MvhBUpS0PdxgFMxtZPOHe+yjr9erkbx0hLH+cdLl636DHL9DyjhU= X-Proofpoint-GUID: yz54fiqE8TjX4RhqkROvSzCb1_eTk_JV From: Luka Bai Add two mthp attributes for the accounting of the number of successful khugepaged collapse, either by hint or not by hint so that we can know them easily from userspace. Note that these two statistics only care about the collapse initiated by khugepaged, and they will not consider the collapse raised by MADV_COLLAPSE. Signed-off-by: Luka Bai --- include/linux/huge_mm.h | 2 ++ mm/huge_memory.c | 4 ++++ mm/khugepaged.c | 18 +++++++++++++++++- 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index edece3e26985..9df0d7f71e95 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -147,6 +147,8 @@ enum mthp_stat_item { MTHP_STAT_COLLAPSE_EXCEED_SWAP, MTHP_STAT_COLLAPSE_EXCEED_NONE, MTHP_STAT_COLLAPSE_EXCEED_SHARED, + MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT, + MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT, __MTHP_STAT_COUNT }; =20 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bf9b480bb3b0..0031fb4b0b09 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -720,6 +720,8 @@ DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_ST= AT_NR_ANON_PARTIALLY_MAPP DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_= SWAP); DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_= NONE); DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEE= D_SHARED); +DEFINE_MTHP_STAT_ATTR(khugepaged_collapse_hint, MTHP_STAT_KHUGEPAGED_COLLA= PSE_HINT); +DEFINE_MTHP_STAT_ATTR(khugepaged_collapse_non_hint, MTHP_STAT_KHUGEPAGED_C= OLLAPSE_NON_HINT); =20 =20 static struct attribute *anon_stats_attrs[] =3D { @@ -775,6 +777,8 @@ static struct attribute *any_stats_attrs[] =3D { &split_failed_attr.attr, &collapse_alloc_attr.attr, &collapse_alloc_failed_attr.attr, + &khugepaged_collapse_hint_attr.attr, + &khugepaged_collapse_non_hint_attr.attr, NULL, }; =20 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3f5eb8be06d1..2f21c0b6ab46 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -147,6 +147,15 @@ struct mthp_range { struct collapse_control { bool is_khugepaged; =20 + /* + * True while khugepaged is draining a collapse hint queued via + * khugepaged_add_collapse_hint(). Used by collapse_single_pmd() to + * attribute a successful collapse to MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT + * or MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT. Only meaningful when the + * collapse is initiated by khugepaged (is_khugepaged =3D=3D true). + */ + bool from_priority_hint; + /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; =20 @@ -3012,8 +3021,13 @@ static enum scan_result collapse_single_pmd(unsigned= long addr, mmap_read_unlock(mm); } end: - if (cc->is_khugepaged && result =3D=3D SCAN_SUCCEED) + if (cc->is_khugepaged && result =3D=3D SCAN_SUCCEED) { ++khugepaged_pages_collapsed; + count_mthp_stat(HPAGE_PMD_ORDER, + cc->from_priority_hint ? + MTHP_STAT_KHUGEPAGED_COLLAPSE_HINT : + MTHP_STAT_KHUGEPAGED_COLLAPSE_NON_HINT); + } return result; } =20 @@ -3227,7 +3241,9 @@ static int collapse_scan_one_priority_entry(unsigned = int progress_max, addr + HPAGE_PMD_SIZE > ALIGN_DOWN(vma->vm_end, HPAGE_PMD_SIZE)) goto skip_hint; =20 + cc->from_priority_hint =3D true; *result =3D collapse_single_pmd(addr, vma, &lock_dropped, cc); + cc->from_priority_hint =3D false; if (*result !=3D SCAN_SUCCEED) (*fail_count)++; skip_hint: --=20 2.52.0 From nobody Mon Jun 8 08:36:53 2026 Received: from outbound.mr.icloud.com (mr-2005a-snip4-7.eps.apple.com [57.103.71.110]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47DFF2F3C07 for ; Sun, 31 May 2026 04:28:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=57.103.71.110 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201712; cv=none; b=WVmd4SKGxAwnAmWcPv4vodS9cdPG/IlMebWniqs1fwGM9Sg/PbkSHWkG9b+6uzqukFAtdmQbzPcSjDV0sdULoEZk+M8Sw/UQM+Iizflx8eZzxuv46cJ6wxiW8sS6DkFBDg0YS0wqF89ydX0OYcqlMPHq4m9yYTvvh2myI7Z3D5o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780201712; c=relaxed/simple; bh=0IoEA3iaPVlGIi3GbIOyhrlOIYwXDN6lMrvSKjm2mi0=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=DVDNz3CozdsE1+J7e5rDWy0dGf2ctC4fB+v2c2vyu6U/+7rNHg6qcozgzR3N1VHYlk+Jm898h74f/cLvLbsvwG3TsWXh726P6tVN0NBqnZ+GMByes45Hn1CJk8kms/2P6h2WmHjlRnkPVJ1c081tiQYCt57E9PHJOhYUSsHPE9w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com; spf=pass smtp.mailfrom=icloud.com; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b=bIgoOTdO; arc=none smtp.client-ip=57.103.71.110 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=icloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=icloud.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=icloud.com header.i=@icloud.com header.b="bIgoOTdO" Received: from outbound.mr.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPS id 58840180016D; Sun, 31 May 2026 04:28:28 +0000 (UTC) X-ICL-Out-Info: HUtFAUMEWwJACUgBTUQeDx5WFlZNRAJCTQhJB0MFXwteDUAdVAVLVxQEFEYGVg1dE0wLcwRUB10FXVZQAlpLVBQEEVABWB5WXloXXk1FCA9CAVhbCFsEDx9MDFECQgVWXlQKHQRUB10FXVZQAlpLQgRLRWhcBVwcQBdIHV9qS1YUBBFQAVgeVl5aF15NWgJWTQVKA18BWwdDCFVHBUc0UR9VFFIdRA5tGFAWR0BBWh9FFEAFWwRYCxNdTFBfVitGFVcbVgNDRVEfVEYTGU4bV01QG18CQg8= Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1780201710; x=1782793710; bh=aJuP8KM5OC30RVX6IAZKOJALOHyJ5CznciVNyRZ7oso=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:To:x-icloud-hme; b=bIgoOTdOpAwylAwKjgQxuhyRzeag/KbhuUHNBEXFCDFP6Pjql+r4QMHTAJSx6q1/PWq+2uxxEjn/PcymrNxBjaqAGkbIox3Kbt3h7mp3TKh5LCkeZ/Prc1lUgLvuNojcdf3AMbybbm/FVGtxkAH5LG0PzxnsaZcSkM09BBmcBoel4rRybCxZDhwODeg3BcNvwEAobjRRX79G6Y1ChmVxFQ3Y3e0WDfsgm1hn0ICKehdo6NMp0EJ0JkygolNN6fpWLCXHLiersfwRuBgGPYGj0F7SS5wCas0sEguS+sSKkzHXZJuQnrXjITly+Fs2p3qzlOpFfPYfJcax9AziER6r7A== Received: from [127.0.0.1] (unknown [17.57.152.38]) by p00-icloudmta-asmtp-us-west-2a-100-percent-4 (Postfix) with ESMTPSA id 1AA7D1800172; Sun, 31 May 2026 04:28:16 +0000 (UTC) From: Luka Bai Date: Sun, 31 May 2026 12:27:21 +0800 Subject: [PATCH 5/5] mm/khugepaged: add khugepaged collapse hint in mglru reference checking Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260531-thp_collapse_hint-v1-5-866339cd4c2a@tencent.com> References: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> In-Reply-To: <20260531-thp_collapse_hint-v1-0-866339cd4c2a@tencent.com> To: linux-mm@kvack.org Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , Zi Yan , Baolin Wang , "Liam R. Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Lance Yang , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Kairui Song , Qi Zheng , Shakeel Butt , Axel Rasmussen , Yuanchu Xie , Wei Xu , Rik van Riel , Harry Yoo , Jann Horn , Johannes Weiner , linux-kernel@vger.kernel.org, Luka Bai X-Mailer: b4 0.15.2 X-Developer-Signature: v=1; a=ed25519-sha256; t=1780201643; l=11044; i=lukabai@tencent.com; s=20260501; h=from:subject:message-id; bh=EoL6aiFTXZnLAEbw4iXTl+sM+J/nPblpmXjGgAwVipM=; b=oVy+ULdgJO194sz343uXkhFD88xvpUSgqR8u90G61/cMZuZX8QMYmgOb/gX884rYQQ7PEFWSr CuBjstk1ETLBtNUXvC5Gcc++AnNBpcJgHASwl7auzluTwUV12U+cNSW X-Developer-Key: i=lukabai@tencent.com; a=ed25519; pk=KeaVteSWd00GIAjFyWZnuFsKAKixjga1ZkLMcI66nPM= X-Proofpoint-ORIG-GUID: bAljHtwMxCkq2ssXyWcEEq_Fx4voKYmt X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTMxMDA0NiBTYWx0ZWRfX7utu1cuWg6Mk YX90SsXpWD6LlExkptCSwgAph1bl1zbOVg6AAo/Ec4Xt8BbFT5/juy0QyoZVkN1wMqwrRu3o3hx TdOXUhwhOCcsJah6gJRi92YjUHb6pDX3KMlvR9UULjiBQUrBt7VRo11F1JGZYs2aEvcQV3OOyMK 9OLvDvW8jrxwuxHkcpFHT1V2KNutBW+Gpk3km1i0evRw/IFBr844RAqaMAKnpnau8usRCDrxzRa X+R24heHLTKeBBHVWusgZgQTpcUAg2AZBNPlsqFeDL35OLqqTqtbBr6HSsTcGqeyNTnS5nukNzk TqjCcevBwPrTD+Di5BaoRj/sSdgb+fcmqWTxA6uQ3RWMdje71PNiL9tQOuRL7M= X-Authority-Info-Out: v=2.4 cv=Jov8bc4C c=1 sm=1 tr=0 ts=6a1bb8ee cx=c_apl:c_pps:t_out a=9OgfyREA4BUYbbCgc0Y0oA==:117 a=9OgfyREA4BUYbbCgc0Y0oA==:17 a=IkcTkHD0fZMA:10 a=NGcC8JguVDcA:10 a=x7bEGLp0ZPQA:10 a=UaoJkeuwEpQA:10 a=VkNPw1HP01LnGYTKEx00:22 a=GvQkQWPkAAAA:8 a=6HC9NpuDrtEngJUpRywA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: bAljHtwMxCkq2ssXyWcEEq_Fx4voKYmt From: Luka Bai Function lru_gen_look_around() works for mglru, which is a good way to reduce the rmap iteration. It is called in folio_referenced_one() when it tried to reclaim a cold page. By the time it gets the page table entry lock, it will also check the nearby ptes and try to update their generation if they are also accessed because of locality in most of workloads, and put the pmd that it thinks full of hot pages into a Bloom filter, for the walk through in next aging. Function walk_mm() is used in mglru during aging. It will go through all the pmds of a mm_struct if certain pmd is set in the Bloom filter, which is setup in lru_gen_look_around() above, and indicates that pmd is frequently accessed in many pages. Now that lru_gen_look_around() and walk_mm() found hot pmd area, we can also use their findings as good sources of khugepaged collapse hint, so we make up collapse hints from there. Note that lru_gen_look_around() is called with ptl lock locked, so we don't want to directly call khugepaged_add_collapse_hint() inside it because it may try to allocate memory. So we introduced a new struct area_access_info, and use it to get the access info from inside, and do collapse after the ptl released. Signed-off-by: Luka Bai --- include/linux/khugepaged.h | 7 +++++++ include/linux/mmzone.h | 17 +++++++++++++++-- mm/khugepaged.c | 12 ++++++++++++ mm/rmap.c | 27 ++++++++++++++++++++++++++- mm/vmscan.c | 33 +++++++++++++++++++++++++++++---- 5 files changed, 89 insertions(+), 7 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 815ae87f0f8e..e0793569a9f0 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -17,6 +17,7 @@ extern void khugepaged_enter_vma(struct vm_area_struct *v= ma, vm_flags_t vm_flags); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); +extern int get_khp_collapse_priority(int total, int young); extern void khugepaged_add_collapse_hint(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, @@ -62,6 +63,12 @@ static inline bool current_is_khugepaged(void) { return false; } + +static inline int get_khp_collapse_priority(int total, int young) +{ + return 0; +} + static inline void khugepaged_add_collapse_hint(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1331a7b93f33..643dd500c121 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -441,6 +441,18 @@ enum lruvec_flags { =20 #endif /* !__GENERATING_BOUNDS_H */ =20 +/* + * Used to get the young and total counts for a memory area, + * and also the maximum order of all the page table entries + * during scanning. + */ +struct area_access_info { + unsigned long address; + int total; + int young; + int max_order; +}; + /* * Evictable folios are divided into multiple generations. The youngest an= d the * oldest generation numbers, max_seq and min_seq, are monotonically incre= asing. @@ -689,7 +701,8 @@ struct lru_gen_memcg { =20 void lru_gen_init_pgdat(struct pglist_data *pgdat); void lru_gen_init_lruvec(struct lruvec *lruvec); -bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int n= r); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int n= r, + struct area_access_info **acc_info_ptr); =20 void lru_gen_init_memcg(struct mem_cgroup *memcg); void lru_gen_exit_memcg(struct mem_cgroup *memcg); @@ -712,7 +725,7 @@ static inline void lru_gen_init_lruvec(struct lruvec *l= ruvec) } =20 static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, - unsigned int nr) + unsigned int nr, struct area_access_info **acc_info_ptr) { return false; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2f21c0b6ab46..50c363846720 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -3031,6 +3031,18 @@ static enum scan_result collapse_single_pmd(unsigned= long addr, return result; } =20 +/* + * The caller needs to make sure the pmd is at least qualified for the + * lowest priority of collapsing since this function will always return + * a legal priority value. + */ +int get_khp_collapse_priority(int total, int young) +{ + if (young * 2 >=3D total) + return 0; + return NR_KHUGEPAGED_PRIORITY_LEVEL - 1; +} + /* * khugepaged_add_collapse_hint - enqueue a collapse hint * @mm: target mm diff --git a/mm/rmap.c b/mm/rmap.c index 1c77d5dc06e9..1cd111e7b299 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -75,6 +75,7 @@ #include #include #include +#include =20 #include =20 @@ -911,6 +912,12 @@ struct folio_referenced_arg { struct mem_cgroup *memcg; }; =20 +/* + * acc_info is currently only used to track access patterns for khugepaged + * collapse hints. 3 entries are enough for most cases, and it's totally + * safe if we missed some hints. + */ +#define NR_ACC_INFO_EACH_ITER 3 /* * arg: folio_referenced_arg will be passed */ @@ -921,6 +928,8 @@ static bool folio_referenced_one(struct folio *folio, DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int ptes =3D 0, referenced =3D 0; unsigned int nr; + struct area_access_info acc_info[NR_ACC_INFO_EACH_ITER] =3D {0}; + int acc_info_count =3D 0; =20 while (page_vma_mapped_walk(&pvmw)) { address =3D pvmw.address; @@ -979,8 +988,16 @@ static bool folio_referenced_one(struct folio *folio, * simplest approach is to disable this look-around optimization. */ if (lru_gen_enabled() && !lru_gen_switching() && pvmw.pte) { - if (lru_gen_look_around(&pvmw, nr)) + struct area_access_info *acc_info_ptr =3D NULL; + + /* If the acc_info is full, skip the remaining ones */ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && + acc_info_count < NR_ACC_INFO_EACH_ITER) + acc_info_ptr =3D &acc_info[acc_info_count]; + if (lru_gen_look_around(&pvmw, nr, &acc_info_ptr)) referenced++; + if (acc_info_ptr && acc_info_ptr !=3D &acc_info[acc_info_count]) + acc_info_count++; } else if (pvmw.pte) { if (clear_flush_young_ptes_notify(vma, address, pvmw.pte, nr)) referenced++; @@ -1019,6 +1036,14 @@ static bool folio_referenced_one(struct folio *folio, pra->vm_flags |=3D vma->vm_flags & ~VM_LOCKED; } =20 + for (--acc_info_count; acc_info_count >=3D 0; acc_info_count--) { + khugepaged_add_collapse_hint(vma->vm_mm, vma, + acc_info[acc_info_count].address, + get_khp_collapse_priority(acc_info[acc_info_count].total, + acc_info[acc_info_count].young), + acc_info[acc_info_count].max_order); + } + if (!pra->mapcount) return false; /* To break the loop */ =20 diff --git a/mm/vmscan.c b/mm/vmscan.c index e8a90911bf88..a0caf5cac951 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3463,7 +3463,7 @@ static void walk_update_folio(struct lru_gen_mm_walk = *walk, struct folio *folio, } =20 static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long = end, - struct mm_walk *args) + struct mm_walk *args, struct area_access_info *acc_info) { int i; bool dirty; @@ -3472,6 +3472,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, unsigned long addr; int total =3D 0; int young =3D 0; + int max_order =3D 0; struct folio *last =3D NULL; struct lru_gen_mm_walk *walk =3D args->private; struct mem_cgroup *memcg =3D lruvec_memcg(walk->lruvec); @@ -3522,6 +3523,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, max_nr, FPB_MERGE_YOUNG_DIRTY); total +=3D nr - 1; walk->mm_stats[MM_LEAF_TOTAL] +=3D nr - 1; + max_order =3D max(max_order, folio_order(folio)); } =20 if (!test_and_clear_young_ptes_notify(args->vma, addr, cur_pte, nr)) @@ -3550,6 +3552,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, lazy_mmu_mode_disable(); pte_unmap_unlock(pte, ptl); =20 + acc_info->young =3D young; + acc_info->max_order =3D max_order; + acc_info->total =3D total; return suitable_to_scan(total, young); } =20 @@ -3667,6 +3672,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, vma =3D args->vma; for (i =3D pmd_index(start), addr =3D start; addr !=3D end; i++, addr =3D= next) { pmd_t val =3D pmdp_get_lockless(pmd + i); + struct area_access_info acc_info =3D {0}; =20 next =3D pmd_addr_end(addr, end); =20 @@ -3699,11 +3705,16 @@ static void walk_pmd_range(pud_t *pud, unsigned lon= g start, unsigned long end, =20 walk->mm_stats[MM_NONLEAF_FOUND]++; =20 - if (!walk_pte_range(&val, addr, next, args)) + if (!walk_pte_range(&val, addr, next, args, &acc_info)) continue; =20 walk->mm_stats[MM_NONLEAF_ADDED]++; =20 + /* When acc_info has valid value */ + if (acc_info.total > 0) + khugepaged_add_collapse_hint(vma->vm_mm, vma, addr, + get_khp_collapse_priority(acc_info.total, acc_info.young), + acc_info.max_order); /* carry over to the next generation */ update_bloom_filter(mm_state, walk->seq + 1, pmd + i); } @@ -4183,7 +4194,8 @@ static void lru_gen_age_node(struct pglist_data *pgda= t, struct scan_control *sc) * the PTE table to the Bloom filter. This forms a feedback loop between t= he * eviction and the aging. */ -bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int n= r) +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw, unsigned int n= r, + struct area_access_info **acc_info_ptr) { int i; bool dirty; @@ -4202,6 +4214,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk = *pvmw, unsigned int nr) struct lru_gen_mm_state *mm_state; unsigned long max_seq; int gen; + unsigned int max_order =3D 0; =20 lockdep_assert_held(pvmw->ptl); VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio); @@ -4265,6 +4278,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk = *pvmw, unsigned int nr) =20 nr =3D folio_pte_batch_flags(folio, NULL, pte, &ptent, max_nr, FPB_MERGE_YOUNG_DIRTY); + max_order =3D max(folio_order(folio), max_order); } =20 if (!test_and_clear_young_ptes_notify(vma, addr, pte, nr)) @@ -4288,8 +4302,19 @@ bool lru_gen_look_around(struct page_vma_mapped_walk= *pvmw, unsigned int nr) lazy_mmu_mode_disable(); =20 /* feedback from rmap walkers to page table walkers */ - if (mm_state && suitable_to_scan(i, young)) + if (mm_state && suitable_to_scan(i, young)) { + if (*acc_info_ptr) { + struct area_access_info acc_info =3D { + .address =3D start, + .total =3D i, + .young =3D young, + .max_order =3D max_order + }; + *(*acc_info_ptr) =3D acc_info; + (*acc_info_ptr)++; + } update_bloom_filter(mm_state, max_seq, pvmw->pmd); + } =20 mem_cgroup_put(memcg); =20 --=20 2.52.0