From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 62F2224BBEC for ; Wed, 15 Oct 2025 14:17:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537863; cv=none; b=VemwyGO+RPX8E+JbgaWlIPefPhfMVUvdWCcaetWf6bWSLEbqvZL+fsZiVjQFE+d4zJoDyZ2hwgFL5pmW/lvA6eKgLyGpEhfH2YcQ/QCBmJYR4zKEH5/aLF9pv65U9sLRonOt5H2GUatZDAFosmqTCB8WpDUSZk/TYyZ0KiokDA0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537863; c=relaxed/simple; bh=QKc1hKn0Jaib+kOHavXhIsp+9pbidvHuEURYso6g3oE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=C69/Tmi7JG2NxwtVwscBtNbY0eHMDtquLGZ6QqH5iqyX8FxKIvmal5a1CEQuzCpgYoW2ceqavemHgIpXln8Be6S5YSUeAO3UKlcjpCTtm0kP3c2Aud0MXXd5lnARWQM31v7xN+2DOvZhZDxR4Ba98rNC/nXKM/36fgRN+ohsaL8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=cT4s4hz5; arc=none smtp.client-ip=209.85.215.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cT4s4hz5" Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-b4fb8d3a2dbso4596173a12.3 for ; Wed, 15 Oct 2025 07:17:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537860; x=1761142660; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=IX/EZDn+5jjqePiAgsr/tPwX5d1al1/qKcHV0+MDT+Q=; b=cT4s4hz5Mz8nChI0oliTiC7JLpZVx0SD0s+HL4yii56V2QtOkaWGxdG3EtjZLFB5Ug Hx8iH7DEzQF44rANGD1QynFtO7JXTLaG04YY+buJsObV4mw322oF+x3f1Ft20WdVqb1l Nxny3y4+zTKa64DZogdm5xY2x4vcORmZX/xV/Ah516Lxe35RZ33E2lFtLhZvZVdLL+UG GtCOk3XwvpDG2+lSqe7BnpSeCkldRPsrXGtoxelveFdmERuQ2iDkxl26ji4fO6i0968C hdpG+trZo8Ts2klnPvKIV0sueQdwmhXPHwyP7SZP7NYzLk+PinkQO6+HOg5xID6sLI4k r6cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537860; x=1761142660; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IX/EZDn+5jjqePiAgsr/tPwX5d1al1/qKcHV0+MDT+Q=; b=fyNUa6iiyCzkjlJuM9o00B2sIq702a4zfsZZbK7fyyD0QyVo7fq0NYfjW42naQRsAP 29DHYZBvWRlcZT2YLS0lELF5SBzubyquOXvJaLMHwdONleZqz3oT6aohvBvt07od3bHV yf9sZCtvsheXrdbmHJC5OekPp1Kzw/+Hfs/URfZVH6pcf2lywuizUfYyR2TT1zWKNpZ8 TpgFr9x6+nlGhTqpCCFrZU3Njf0IQGNbFhXKssRaQlyIsdHQFfVB7QT3rpA9y3e+5fW8 xiqEx9D7HFmkQfJthOPEEyvl+5ktBcVLU8anc1jU/ogGJ6Bd23shOor/HtbN8eGLZdx+ 8Pwg== X-Forwarded-Encrypted: i=1; AJvYcCXJjNceD2+g7RDJVmOL5/AklDP75+ltUQXqMHr1P30nenM4XJC2I/4GqtP+kSydUEc4DG9zpsnyRLADj3w=@vger.kernel.org X-Gm-Message-State: AOJu0YwXGq9ShM3s0Oz93c84dFARyTABVGAmbMFrypkdMvH4OblS+qv1 R/vMYlpP4+RuJHXtqJB+P/tmyZwHG1RUpC4sPKyz/Z0DUGEoBrhuNPxx X-Gm-Gg: ASbGncty6Y3jXcLGsEBB/wUIfSNZEb8aOVFADdKlgdlIk13lC3qrleT+6iMxjS7yxla jV3DIcJP1VwsLZXbax5AmFeTHq65bKonQsP2z7Vq4uk9MKrGjoCn5ufj9+KZ0ELZGpNYuSUi66N R1vyART76x7JgF8r2DjOYpygb7f3YwfB7U7LpPqCz8WKu6BsfelUodWxka9pv1lbLSmVYzuOxen W+bb08CeOPYpvQNvYKPegm3KTE9h6IF5ODqmSBnj71LY2vFgRc4zk8mzcKhCPiC7YuHQ5/QhtU5 FFzj5auxTBtYnwMkBgtT2/bF7UZmC4+JWJF1WzeWBD7D8amiHM2zo5oAros50F6urpUiYg39k/2 FX1/uBHXGhVjaIdV3/p8ou1nAqAQEFMvJt15ISdz4XHJGI8oawoKFT2khZ574dSrRyO/NguvTGA DyzqkYvgH+FsTLqHdt X-Google-Smtp-Source: AGHT+IFp2HkeSJPTm8Jz89gAC0GGdTodgo+Rdq6QKzLu2VX0OPkPInkpCJBEYxlCq7SYTb+D8YpIug== X-Received: by 2002:a17:902:ecc6:b0:25c:e2c:6653 with SMTP id d9443c01a7336-290272e1b44mr365455255ad.48.1760537860292; Wed, 15 Oct 2025 07:17:40 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.17.31 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:17:39 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao , Yang Shi Subject: [RFC PATCH v10 mm-new 1/9] mm: thp: remove vm_flags parameter from khugepaged_enter_vma() Date: Wed, 15 Oct 2025 22:17:08 +0800 Message-Id: <20251015141716.887-2-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The khugepaged_enter_vma() function requires handling in two specific scenarios: 1. New VMA creation When a new VMA is created (for anon vma, it is deferred to pagefault), if vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In this case, khugepaged_enter_vma() is called after vma->vm_flags have been set, allowing direct use of the VMA's flags. 2. VMA flag modification When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set), the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot. Currently, khugepaged_enter_vma() is called before the flag update, so the call must be relocated to occur after vma->vm_flags have been set. In the VMA merging path, khugepaged_enter_vma() is also called. For this case, since VMA merging only occurs when the vm_flags of both VMAs are identical (excluding special flags like VM_SOFTDIRTY), we can safely use target->vm_flags instead. (It is worth noting that khugepaged_enter_vma() can be removed from the VMA merging path because the VMA has already been added in the two aforementioned cases. We will address this cleanup in a separate patch.) After this change, we can further remove vm_flags parameter from thp_vma_allowable_order(). That will be handled in a followup patch. Signed-off-by: Yafang Shao Cc: Yang Shi Cc: Usama Arif --- include/linux/khugepaged.h | 10 ++++++---- mm/huge_memory.c | 2 +- mm/khugepaged.c | 27 ++++++++++++++------------- mm/madvise.c | 7 +++++++ mm/vma.c | 6 +++--- 5 files changed, 31 insertions(+), 21 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index eb1946a70cff..b30814d3d665 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -13,8 +13,8 @@ extern void khugepaged_destroy(void); extern int start_stop_khugepaged(void); extern void __khugepaged_enter(struct mm_struct *mm); extern void __khugepaged_exit(struct mm_struct *mm); -extern void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags); +extern void khugepaged_enter_vma(struct vm_area_struct *vma); +extern void khugepaged_enter_mm(struct mm_struct *mm); extern void khugepaged_min_free_kbytes_update(void); extern bool current_is_khugepaged(void); extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long add= r, @@ -38,8 +38,10 @@ static inline void khugepaged_fork(struct mm_struct *mm,= struct mm_struct *oldmm static inline void khugepaged_exit(struct mm_struct *mm) { } -static inline void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags) +static inline void khugepaged_enter_vma(struct vm_area_struct *vma) +{ +} +static inline void khugepaged_enter_mm(struct mm_struct *mm) { } static inline int collapse_pte_mapped_thp(struct mm_struct *mm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1b81680b4225..ac6601f30e65 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1346,7 +1346,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault= *vmf) ret =3D vmf_anon_prepare(vmf); if (ret) return ret; - khugepaged_enter_vma(vma, vma->vm_flags); + khugepaged_enter_vma(vma); =20 if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 7ab2d1a42df3..5088eedafc35 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -353,12 +353,6 @@ int hugepage_madvise(struct vm_area_struct *vma, #endif *vm_flags &=3D ~VM_NOHUGEPAGE; *vm_flags |=3D VM_HUGEPAGE; - /* - * If the vma become good for khugepaged to scan, - * register it here without waiting a page fault that - * may not happen any time soon. - */ - khugepaged_enter_vma(vma, *vm_flags); break; case MADV_NOHUGEPAGE: *vm_flags &=3D ~VM_HUGEPAGE; @@ -460,14 +454,21 @@ void __khugepaged_enter(struct mm_struct *mm) wake_up_interruptible(&khugepaged_wait); } =20 -void khugepaged_enter_vma(struct vm_area_struct *vma, - vm_flags_t vm_flags) +void khugepaged_enter_mm(struct mm_struct *mm) { - if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) && - hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) - __khugepaged_enter(vma->vm_mm); - } + if (mm_flags_test(MMF_VM_HUGEPAGE, mm)) + return; + if (!hugepage_pmd_enabled()) + return; + + __khugepaged_enter(mm); +} + +void khugepaged_enter_vma(struct vm_area_struct *vma) +{ + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDE= R)) + return; + khugepaged_enter_mm(vma->vm_mm); } =20 void __khugepaged_exit(struct mm_struct *mm) diff --git a/mm/madvise.c b/mm/madvise.c index fb1c86e630b6..8de7c39305dd 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1425,6 +1425,13 @@ static int madvise_vma_behavior(struct madvise_behav= ior *madv_behavior) VM_WARN_ON_ONCE(madv_behavior->lock_mode !=3D MADVISE_MMAP_WRITE_LOCK); =20 error =3D madvise_update_vma(new_flags, madv_behavior); + /* + * If the vma become good for khugepaged to scan, + * register it here without waiting a page fault that + * may not happen any time soon. + */ + if (!error && new_flags & VM_HUGEPAGE) + khugepaged_enter_mm(vma->vm_mm); out: /* * madvise() returns EAGAIN if kernel resources, such as diff --git a/mm/vma.c b/mm/vma.c index a1ec405bda25..6a548b0d64cd 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -973,7 +973,7 @@ static __must_check struct vm_area_struct *vma_merge_ex= isting_range( if (err || commit_merge(vmg)) goto abort; =20 - khugepaged_enter_vma(vmg->target, vmg->vm_flags); + khugepaged_enter_vma(vmg->target); vmg->state =3D VMA_MERGE_SUCCESS; return vmg->target; =20 @@ -1093,7 +1093,7 @@ struct vm_area_struct *vma_merge_new_range(struct vma= _merge_struct *vmg) * following VMA if we have VMAs on both sides. */ if (vmg->target && !vma_expand(vmg)) { - khugepaged_enter_vma(vmg->target, vmg->vm_flags); + khugepaged_enter_vma(vmg->target); vmg->state =3D VMA_MERGE_SUCCESS; return vmg->target; } @@ -2520,7 +2520,7 @@ static int __mmap_new_vma(struct mmap_state *map, str= uct vm_area_struct **vmap) * call covers the non-merge case. */ if (!vma_is_anonymous(vma)) - khugepaged_enter_vma(vma, map->vm_flags); + khugepaged_enter_vma(vma); *vmap =3D vma; return 0; =20 --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EEC52262FC2 for ; Wed, 15 Oct 2025 14:17:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537870; cv=none; b=aY0m1unEloi7RMqbEphwFCNXnunHHg3TrnNPBf7WOaJwphaXO+RUloznZk93akPCUnQ1p5S8xFZCk7+bAUNKs1igBR8UtNCi5gfiTceEOWPrHbxu1naijUTIq4dPEupZNX/Z+3xrnt5ev+oxBJGGr8fzHOq8HtuxM5451FowHu0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537870; c=relaxed/simple; bh=QVWu+s4grXw7TZwDzhbSweg4ctwhYK5zTlTC5+DUPKE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BPnBw8mZltZ5JZWj2xJmzIrB6Nq7AiI25eG54dMY1Xw8Jw0AzK/B52siGAX41rLVll7+738ggE185IPv1WWQ8dSiqVZ56hTIbuyJZztSlF8p72U4YSvyz56ATEFJFryYR5DBibSEin65DzEFJ6//3E2C/S58X4ibIOjqe6vlEvM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=G1uknFb/; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="G1uknFb/" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-782bfd0a977so5519884b3a.3 for ; Wed, 15 Oct 2025 07:17:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537868; x=1761142668; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vy5v9XGDeS4JIoi1WBeevI4rwn9UOnGGFWMHMux6eVM=; b=G1uknFb/u7akOo0FBWlMk4dlF7RecPy230Me6DYf/ziFAW3jn6g8irxW2KT/pACkdG kz3gxNHTuBcwR4xGBSYN1lqROsTqkFMF1IeYjGzB52xoU8ryeyEYkIsfQRDneWnihwLY zw+eh/v7VQMnXJyKf1lM1g4eewTRU46e6Y+epXozZOjJDB5iuz2HGT7QdXkl+YyxYg7U XglFeaL26oJILUYYY0vuH7C3gLgDXd9KpmXbjt+QunmpM3XYJBqFNZqmb7fQ60vlRZb+ xeIVVLEKv8iERUY0x0/dDmUlvGt3WVeLrSY66GhIbYjbzPjnzlZXxrOYGhWxbsYzoc/k GTQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537868; x=1761142668; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vy5v9XGDeS4JIoi1WBeevI4rwn9UOnGGFWMHMux6eVM=; b=TrRihUQHHZfJDCgwK/QvWFAl5k38p0v/Qo2qAqYyuQzAwrhPsmnOBmcR1p5N4GRz0D zbHNwqs1ETawxZmOSNuKEfvGJdt0Fw1PW6wDtjwJGcAysRsgA0khKQE3iO9LZutTi1zL NsbRVgbXoWyNRWvSMjRLWlEPG2OndJ1WMnW1pUA+YqnH+rlfK+VnfyBQcnRAV65yv2FS 4h3z3mzixSczsMt9bNEwvgOvuvCNI7P8n9yypsq86PqS8srpzsx3tS8f4WuJLB2RIcNI idhWxbcl3nd0LrZoBjvbJTwemKaJUz+jG2Wv9BDoyWyhpyGLgGZ7hCEaTgSCpUNwRgmC OKmA== X-Forwarded-Encrypted: i=1; AJvYcCXEozYldbKNt0qQw/oICuU4puKydOa0DlzXmlDR2yXY+EVjkLyZUSMgQa3aXjfTKXCqDzhqQVm+lqfJrH8=@vger.kernel.org X-Gm-Message-State: AOJu0YyWUUEz9GE62k5mWjfWWC//vtkLMSTcdqYp8czzMZACg3qcL2qy JhYE4II7Mbr2vh8HdINr3KywR9cPpdt2gAUzuxQxZxWOXBF13zUXJ9mx X-Gm-Gg: ASbGnct15cPkuMGCOoeL0AWXWShb23syjaQCm8so9TLSUgx+1BdeSHzYiyEjBkOcGZe dYiUYXCHzYN7fArt/VOQppUXKUuwM5l7WSoc+CoNMgG0qftFJ0fIGmRWRhOUV8bPfVNGcp5nYks GWudD2YHCtrJqBXrvJCyomWhh74YFo47ebSgGzcBOFY/zSAzKMZwT+P9/ZbtQ+HvOxaCIm2Bg76 ve/IGC2JUJb/pa5Wg942Uboh8uCBKTM8lblykN3VB4DKeO6ASRfkijaEF/8V0/+hNPsrxHONw/8 NsRP/b02075VXI9wnbh119BXPilbf8RhthCRljL215IzcVUHWux5xeHUVJeyDugNNG2g0cPYbTb hQ8YIb9xveWQDnOpsXt0vCIwH/KOZX0t3G3vf1qdqR1KP7yKEwE/6ugA7V9odq9M6OAAC1xuuHW qs3uEeMO4AQgPAYDFDJOE4yNfM5ImcYG8JwgR1Hni7rDFRPg5Fxp8= X-Google-Smtp-Source: AGHT+IE4aDUQVPtLHGnbHLnFA18ywMHTFWJFVJOIoyE0GV+0uMGDRn+Q8e6beTOkXj50I7QjYtAYug== X-Received: by 2002:a17:903:17cf:b0:24c:b39f:baaa with SMTP id d9443c01a7336-290272e1fe5mr391867325ad.49.1760537867798; Wed, 15 Oct 2025 07:17:47 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.17.40 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:17:47 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 2/9] mm: thp: remove vm_flags parameter from thp_vma_allowable_order() Date: Wed, 15 Oct 2025 22:17:09 +0800 Message-Id: <20251015141716.887-3-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the vma_flags argument, we can remove the parameter and have the function access vma->vm_flags directly. Signed-off-by: Yafang Shao Acked-by: Usama Arif --- fs/proc/task_mmu.c | 3 +-- include/linux/huge_mm.h | 16 ++++++++-------- mm/huge_memory.c | 4 ++-- mm/khugepaged.c | 10 +++++----- mm/memory.c | 11 +++++------ mm/shmem.c | 2 +- 6 files changed, 22 insertions(+), 24 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index fc35a0543f01..e713d1905750 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1369,8 +1369,7 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); =20 seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS, - THP_ORDERS_ALL)); + !!thp_vma_allowable_orders(vma, TVA_SMAPS, THP_ORDERS_ALL)); =20 if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f327d62fc985..a635dcbb2b99 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -101,8 +101,8 @@ enum tva_type { TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */ }; =20 -#define thp_vma_allowable_order(vma, vm_flags, type, order) \ - (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order))) +#define thp_vma_allowable_order(vma, type, order) \ + (!!thp_vma_allowable_orders(vma, type, BIT(order))) =20 #define split_folio(f) split_folio_to_list(f, NULL) =20 @@ -266,14 +266,12 @@ static inline unsigned long thp_vma_suitable_orders(s= truct vm_area_struct *vma, } =20 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - vm_flags_t vm_flags, enum tva_type type, unsigned long orders); =20 /** * thp_vma_allowable_orders - determine hugepage orders that are allowed f= or vma * @vma: the vm area to check - * @vm_flags: use these vm_flags instead of vma->vm_flags * @type: TVA type * @orders: bitfield of all orders to consider * @@ -287,10 +285,11 @@ unsigned long __thp_vma_allowable_orders(struct vm_ar= ea_struct *vma, */ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, - vm_flags_t vm_flags, enum tva_type type, unsigned long orders) { + vm_flags_t vm_flags =3D vma->vm_flags; + /* * Optimization to check if required orders are enabled early. Only * forced collapse ignores sysfs configs. @@ -309,7 +308,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_s= truct *vma, return 0; } =20 - return __thp_vma_allowable_orders(vma, vm_flags, type, orders); + return __thp_vma_allowable_orders(vma, type, orders); } =20 struct thpsize { @@ -329,8 +328,10 @@ struct thpsize { * through madvise or prctl. */ static inline bool vma_thp_disabled(struct vm_area_struct *vma, - vm_flags_t vm_flags, bool forced_collapse) + bool forced_collapse) { + vm_flags_t vm_flags =3D vma->vm_flags; + /* Are THPs disabled for this VMA? */ if (vm_flags & VM_NOHUGEPAGE) return true; @@ -560,7 +561,6 @@ static inline unsigned long thp_vma_suitable_orders(str= uct vm_area_struct *vma, } =20 static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct= *vma, - vm_flags_t vm_flags, enum tva_type type, unsigned long orders) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index ac6601f30e65..1ac476fe6dc5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -98,7 +98,6 @@ static inline bool file_thp_enabled(struct vm_area_struct= *vma) } =20 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, - vm_flags_t vm_flags, enum tva_type type, unsigned long orders) { @@ -106,6 +105,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area= _struct *vma, const bool in_pf =3D type =3D=3D TVA_PAGEFAULT; const bool forced_collapse =3D type =3D=3D TVA_FORCED_COLLAPSE; unsigned long supported_orders; + vm_flags_t vm_flags =3D vma->vm_flags; =20 /* Check the intersection of requested and supported orders. */ if (vma_is_anonymous(vma)) @@ -122,7 +122,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area= _struct *vma, if (!vma->vm_mm) /* vdso */ return 0; =20 - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collap= se)) + if (thp_disabled_by_hw() || vma_thp_disabled(vma, forced_collapse)) return 0; =20 /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5088eedafc35..b60f1856714a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -466,7 +466,7 @@ void khugepaged_enter_mm(struct mm_struct *mm) =20 void khugepaged_enter_vma(struct vm_area_struct *vma) { - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDE= R)) + if (!thp_vma_allowable_order(vma, TVA_KHUGEPAGED, PMD_ORDER)) return; khugepaged_enter_mm(vma->vm_mm); } @@ -917,7 +917,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm= , unsigned long address, =20 if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, type, PMD_ORDER)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1531,7 +1531,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here and force collapse. */ - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) + if (!thp_vma_allowable_order(vma, TVA_FORCED_COLLAPSE, PMD_ORDER)) return SCAN_VMA_CHECK; =20 /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2426,7 +2426,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORD= ER)) { + if (!thp_vma_allowable_order(vma, TVA_KHUGEPAGED, PMD_ORDER)) { skip: progress++; continue; @@ -2757,7 +2757,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); =20 - if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) + if (!thp_vma_allowable_order(vma, TVA_FORCED_COLLAPSE, PMD_ORDER)) return -EINVAL; =20 cc =3D kmalloc(sizeof(*cc), GFP_KERNEL); diff --git a/mm/memory.c b/mm/memory.c index 7e32eb79ba99..cd04e4894725 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault= *vmf) * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ - orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + orders =3D thp_vma_allowable_orders(vma, TVA_PAGEFAULT, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); orders =3D thp_swap_suitable_orders(swp_offset(entry), @@ -5107,7 +5107,7 @@ static struct folio *alloc_anon_folio(struct vm_fault= *vmf) * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ - orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + orders =3D thp_vma_allowable_orders(vma, TVA_PAGEFAULT, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); =20 @@ -5379,7 +5379,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct fo= lio *folio, struct page *pa * PMD mappings if THPs are disabled. As we already have a THP, * behave as if we are forcing a collapse. */ - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags, + if (thp_disabled_by_hw() || vma_thp_disabled(vma, /* forced_collapse=3D*/ true)) return ret; =20 @@ -6280,7 +6280,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, .gfp_mask =3D __get_fault_gfp_mask(vma), }; struct mm_struct *mm =3D vma->vm_mm; - vm_flags_t vm_flags =3D vma->vm_flags; pgd_t *pgd; p4d_t *p4d; vm_fault_t ret; @@ -6295,7 +6294,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) { + thp_vma_allowable_order(vma, TVA_PAGEFAULT, PUD_ORDER)) { ret =3D create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -6329,7 +6328,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, goto retry_pud; =20 if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) { + thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) { ret =3D create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; diff --git a/mm/shmem.c b/mm/shmem.c index 4855eee22731..cc2c90656b66 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1780,7 +1780,7 @@ unsigned long shmem_allowable_huge_orders(struct inod= e *inode, vm_flags_t vm_flags =3D vma ? vma->vm_flags : 0; unsigned int global_orders; =20 - if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem= _huge_force))) + if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, shmem_huge_forc= e))) return 0; =20 global_orders =3D shmem_huge_global_enabled(inode, index, write_end, --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5049C2571BA for ; Wed, 15 Oct 2025 14:17:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537879; cv=none; b=uPNTISf1pfX3Auw6gQI/4A/sLtgKCrhzK+Wwo34JXK3rYh8aD7OjN3rdHLqnXOE3RSeJmMfz9yOsvRDC402f91w1G4n1/FNXY9PDLw7m7jTNyOkaWynFx21iKJYUx9OH7DOF9mFk1hujMfrayXMS3ZDbNfmAA0oBVTvEe3g0JDw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537879; c=relaxed/simple; bh=pC8uOHWSMt1hXt+Sh8GxekUGvDw0s7PG72UHJYWYOCk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=GJMlK/00mYoGsETrtTi4uM+tPV+Bm/UICxbvxcNxUN8K7LkbXThUer66GyAVoydRu2LRR17wJULuHblhw6RNm3i4UwnNhirNUwY5C8OYKG9rJB8Nqilm3bBiZWGXsulBB6/FhTs7sbF+9hsk4IlcYED4RFV147NDa54oZgkb8xo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Ia3QWWBN; arc=none smtp.client-ip=209.85.216.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ia3QWWBN" Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-339e71ccf48so9254636a91.3 for ; Wed, 15 Oct 2025 07:17:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537876; x=1761142676; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=t7qP30YuscvXl/NsExNoZv6tgYV2hHaM3n9ZCz1kVmk=; b=Ia3QWWBNTVgxaOjEIoY7/AIaumQ+KNuBUHrINdrzxBbEweIx3MrVXU8x7tLathV0Vc 2ZjObAhzGSkAsVgv5WOVTSxCj1J00+A1IBtvRZmB3acgE24VTQTDTGodrTUoG83qUF8H kkt2JGh9GsSsl+UbValU7JhAV3YrY93v1Yy6jk4NC19ziEW8YK36aFc8DoPkVWnegHW/ 2qzF95b6QBiI3K2JdRx/WtIWwo9Jm6qwpqF3e1qvH8ADaVLB+m+jp/qAm0+L1W99Srs+ zqbi1sEsqOM9HdrfgP4d3xlWKfzFzV0Z+Vo+boF1iZ/vyoIRY7b21/cbEoQ0yLhOI+Va XAtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537876; x=1761142676; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=t7qP30YuscvXl/NsExNoZv6tgYV2hHaM3n9ZCz1kVmk=; b=d4btODTzXgROml3lu1YRotjuQ/QUqkKknCsMn/8kvNsQkQNM/R5/9/nA+EhV5H3eCI 53bTTC+CrCVnSQ1mwhJrjGE6gou82lNcl0wocKHdHU/bxgUZb4vb3gfen9krlWl0OH6/ J0QTJpR0BSF+HNZRcXxkI7Iz2SAvFY1ukyHx79RE/4o5+BfVp3+8VsJPxcVRtsO/zYdg tIntYASAKopG7OFuVSn63fuPjvx9d+j3Z0cd/05jsvvbA+lW8INUU/PsSqo65lK3Qyy1 ZjGtzYKB0ojP7Z3odejl6jhzYJONZ8U1cO6lpWqxjfxmQOvJknrleUarTWWvWOYTFUrq HqIQ== X-Forwarded-Encrypted: i=1; AJvYcCUBDxrF0qaH5XkriB+6jPKvukN1mKAMI2OfWnmnn0GgKE2HO9pCEjnJAdD0Pezfz1Wiwnf9RG2rpPUhoJ0=@vger.kernel.org X-Gm-Message-State: AOJu0YzSFkGj0Md3Nbl6IYiiNeunM4+a70aT313DvmigHZZ+Yb3Z9qHM GDeFVxYNvKP6FwnWGkz2Hi78YIxQKV7LI9/d9dM+62JM+GATZZLSVJ1l X-Gm-Gg: ASbGncvZoVeEhmqGFKWSD9uCQNqbmFTpHt9j/TFjgqHMD9cXw1YKxwhnoowQsouWGY9 cB9SgMgoizgMh8fTVVRvA4CmnU3kcUNbhUCEz8LKyz3CI7k9rO5O3LP6QhOn96I5/bZdpHNXCPm fBObkz24bZPJijk3yIxU9x12jOMHII5FSOPc3DCfq/KC6WOQCoYUOfO1Q7VR+kcbSoVVKdLsc7j TyqbNxwHdUdrFRsNGeefUmcrshwXUJR/7Gz51u8zpz80nPMuw7QYMdZNbeLEfe1s00wJHhTD/dV ehsrFGOyQ9sdV1dqTqBVqs8qcY/5dqdmnJ8SshowrQ3Qt/ppJlkDMWxWjz9Vh+mZZiPR+hDpgzx tDY/yLVwm3fvnrZ6xCik9gvZLMxYt7vg4j8h9ze8vL1LPZDKdIPWdkvM7MFuTxgE6HGPZM4tJDb Mko8wd5EvscYd4CPcc8B7omNanxwA= X-Google-Smtp-Source: AGHT+IEZicTnAISK8MJWpWR+oAxwLSGeKmpYRt6FsHplpjvtISrx8Sv7/ANXu99FGY43F4YolcV5hw== X-Received: by 2002:a17:903:2405:b0:267:8049:7c7f with SMTP id d9443c01a7336-29027356377mr399915235ad.7.1760537876046; Wed, 15 Oct 2025 07:17:56 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.17.48 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:17:55 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 3/9] mm: thp: add support for BPF based THP order selection Date: Wed, 15 Oct 2025 22:17:10 +0800 Message-Id: <20251015141716.887-4-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF programs to influence THP order selection based on factors such as: - Workload identity For example, workloads running in specific containers or cgroups. - Allocation context Whether the allocation occurs during a page fault, khugepaged, swap or other paths. - VMA's memory advice settings MADV_HUGEPAGE or MADV_NOHUGEPAGE - Memory pressure PSI system data or associated cgroup PSI metrics The kernel API of this new BPF hook is as follows, /** * thp_order_fn_t: Get the suggested THP order from a BPF program for alloc= ation * @vma: vm_area_struct associated with the THP allocation * @type: TVA type for current @vma * @orders: Bitmask of available THP orders for this allocation * * Return: The suggested THP order for allocation from the BPF program. Mus= t be * a valid, available order. */ typedef int thp_order_fn_t(struct vm_area_struct *vma, enum tva_type type, unsigned long orders); Only a single BPF program can be attached at any given time, though it can be dynamically updated to adjust the policy. The implementation supports anonymous THP, shmem THP, and mTHP, with future extensions planned for file-backed THP. This functionality is only active when system-wide THP is configured to madvise or always mode. It remains disabled in never mode. Additionally, if THP is explicitly disabled for a specific task via prctl(), this BPF functionality will also be unavailable for that task. This BPF hook enables the implementation of flexible THP allocation policies at the system, per-cgroup, or per-task level. This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note that this capability is currently unstable and may undergo significant changes=E2=80=94including potential removal=E2=80=94in future kernel versio= ns. Signed-off-by: Yafang Shao --- MAINTAINERS | 1 + fs/exec.c | 1 + include/linux/huge_mm.h | 40 +++++ include/linux/mm_types.h | 18 +++ kernel/fork.c | 1 + mm/Kconfig | 22 +++ mm/Makefile | 1 + mm/huge_memory_bpf.c | 306 +++++++++++++++++++++++++++++++++++++++ mm/mmap.c | 1 + 9 files changed, 391 insertions(+) create mode 100644 mm/huge_memory_bpf.c diff --git a/MAINTAINERS b/MAINTAINERS index ca8e3d18eedd..7be34b2a64fd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16257,6 +16257,7 @@ F: include/linux/huge_mm.h F: include/linux/khugepaged.h F: include/trace/events/huge_memory.h F: mm/huge_memory.c +F: mm/huge_memory_bpf.c F: mm/khugepaged.c F: mm/mm_slot.h F: tools/testing/selftests/mm/khugepaged.c diff --git a/fs/exec.c b/fs/exec.c index dbac0e84cc3e..9500aafb7eb5 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -890,6 +890,7 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); + bpf_thp_retain_mm(mm, old_mm); lru_gen_add_mm(mm); task_unlock(tsk); lru_gen_use_mm(mm); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a635dcbb2b99..5ecc95f35453 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -269,6 +269,41 @@ unsigned long __thp_vma_allowable_orders(struct vm_are= a_struct *vma, enum tva_type type, unsigned long orders); =20 +#ifdef CONFIG_BPF_THP + +unsigned long +bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type, + unsigned long orders); + +void bpf_thp_exit_mm(struct mm_struct *mm); +void bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm); +void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm); + +#else + +static inline unsigned long +bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type, + unsigned long orders) +{ + return orders; +} + +static inline void bpf_thp_ops_exit(struct mm_struct *mm) +{ +} + +static inline void +bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm) +{ +} + +static inline void +bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm) +{ +} + +#endif + /** * thp_vma_allowable_orders - determine hugepage orders that are allowed f= or vma * @vma: the vm area to check @@ -290,6 +325,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_= struct *vma, { vm_flags_t vm_flags =3D vma->vm_flags; =20 + /* The BPF-specified order overrides which order is selected. */ + orders &=3D bpf_hook_thp_get_orders(vma, type, orders); + if (!orders) + return 0; + /* * Optimization to check if required orders are enabled early. Only * forced collapse ignores sysfs configs. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 394d50fd3c65..835fbfdf7657 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -33,6 +33,7 @@ struct address_space; struct futex_private_hash; struct mem_cgroup; +struct bpf_mm_ops; =20 typedef struct { unsigned long f; @@ -976,6 +977,19 @@ struct mm_cid { }; #endif =20 +#ifdef CONFIG_BPF_THP +struct bpf_thp_ops; +#endif + +#ifdef CONFIG_BPF_MM +struct bpf_mm_ops { +#ifdef CONFIG_BPF_THP + struct bpf_thp_ops __rcu *bpf_thp; + struct list_head bpf_thp_list; +#endif +}; +#endif + /* * Opaque type representing current mm_struct flag state. Must be accessed= via * mm_flags_xxx() helper functions. @@ -1268,6 +1282,10 @@ struct mm_struct { #ifdef CONFIG_MM_ID mm_id_t mm_id; #endif /* CONFIG_MM_ID */ + +#ifdef CONFIG_BPF_MM + struct bpf_mm_ops bpf_mm; +#endif } __randomize_layout; =20 /* diff --git a/kernel/fork.c b/kernel/fork.c index 157612fd669a..6b7d56ecb19a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1130,6 +1130,7 @@ static inline void __mmput(struct mm_struct *mm) exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ + bpf_thp_exit_mm(mm); exit_mmap(mm); mm_put_huge_zero_folio(mm); set_mm_exe_file(mm, NULL); diff --git a/mm/Kconfig b/mm/Kconfig index bde9f842a4a8..18a83c0cbb51 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1371,6 +1371,28 @@ config PT_RECLAIM config FIND_NORMAL_PAGE def_bool n =20 +menuconfig BPF_MM + bool "BPF-based Memory Management (EXPERIMENTAL)" + depends on BPF_SYSCALL + + help + Enable BPF-based Memory Management Policy. This feature is currently + experimental. + + WARNING: This feature is unstable and may change in future kernel + +if BPF_MM +config BPF_THP + bool "BPF-based THP Policy (EXPERIMENTAL)" + depends on TRANSPARENT_HUGEPAGE && BPF_MM + + help + Enable dynamic THP policy adjustment using BPF programs. This feature + is currently experimental. + + WARNING: This feature is unstable and may change in future kernel +endif # BPF_MM + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index 21abb3353550..4efca1c8a919 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) +=3D migrate.o obj-$(CONFIG_NUMA) +=3D memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o +obj-$(CONFIG_BPF_THP) +=3D huge_memory_bpf.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o obj-$(CONFIG_MEMCG_V1) +=3D memcontrol-v1.o obj-$(CONFIG_MEMCG) +=3D memcontrol.o vmpressure.o diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c new file mode 100644 index 000000000000..24ab432cbbaa --- /dev/null +++ b/mm/huge_memory_bpf.c @@ -0,0 +1,306 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * BPF-based THP policy management + * + * Author: Yafang Shao + */ + +#include +#include +#include +#include + +/** + * @thp_order_fn_t: Get the suggested THP order from a BPF program for all= ocation + * @vma: vm_area_struct associated with the THP allocation + * @type: TVA type for current @vma + * @orders: Bitmask of available THP orders for this allocation + * + * Return: The suggested THP order for allocation from the BPF program. Mu= st be + * a valid, available order. + */ +typedef int thp_order_fn_t(struct vm_area_struct *vma, + enum tva_type type, + unsigned long orders); + +struct bpf_thp_mm_list { + struct list_head list; +}; + +struct bpf_thp_ops { + pid_t pid; /* The pid to attach */ + thp_order_fn_t *thp_get_order; + + /* private*/ + /* The list of mm_struct this ops is operated on */ + struct bpf_thp_mm_list mm_list; +}; + +static DEFINE_SPINLOCK(thp_ops_lock); + +void bpf_thp_exit_mm(struct mm_struct *mm) +{ + if (!rcu_access_pointer(mm->bpf_mm.bpf_thp)) + return; + + spin_lock(&thp_ops_lock); + if (!rcu_access_pointer(mm->bpf_mm.bpf_thp)) { + spin_unlock(&thp_ops_lock); + return; + } + list_del(&mm->bpf_mm.bpf_thp_list); + RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, NULL); + spin_unlock(&thp_ops_lock); + +} + +void bpf_thp_retain_mm(struct mm_struct *mm, struct mm_struct *old_mm) +{ + struct bpf_thp_ops *bpf_thp; + + if (!old_mm || !rcu_access_pointer(old_mm->bpf_mm.bpf_thp)) + return; + + spin_lock(&thp_ops_lock); + bpf_thp =3D rcu_dereference_protected(old_mm->bpf_mm.bpf_thp, + lockdep_is_held(&thp_ops_lock)); + if (!bpf_thp) { + spin_unlock(&thp_ops_lock); + return; + } + + /* The new mm is still under initilization */ + RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp); + + /* The old mm is destroying */ + RCU_INIT_POINTER(old_mm->bpf_mm.bpf_thp, NULL); + list_replace(&old_mm->bpf_mm.bpf_thp_list, &mm->bpf_mm.bpf_thp_list); + spin_unlock(&thp_ops_lock); +} + +void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm) +{ + struct bpf_thp_mm_list *mm_list; + struct bpf_thp_ops *bpf_thp; + + if (!rcu_access_pointer(old_mm->bpf_mm.bpf_thp)) + return; + + spin_lock(&thp_ops_lock); + bpf_thp =3D rcu_dereference_protected(old_mm->bpf_mm.bpf_thp, + lockdep_is_held(&thp_ops_lock)); + if (!bpf_thp) { + spin_unlock(&thp_ops_lock); + return; + } + + /* The new mm is still under initilization */ + RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp); + + mm_list =3D &bpf_thp->mm_list; + list_add_tail(&mm->bpf_mm.bpf_thp_list, &mm_list->list); + spin_unlock(&thp_ops_lock); +} + +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma, + enum tva_type type, + unsigned long orders) +{ + struct mm_struct *mm =3D vma->vm_mm; + struct bpf_thp_ops *bpf_thp; + int bpf_order; + + if (!mm) + return orders; + + rcu_read_lock(); + bpf_thp =3D rcu_dereference(mm->bpf_mm.bpf_thp); + if (!bpf_thp || !bpf_thp->thp_get_order) + goto out; + + bpf_order =3D bpf_thp->thp_get_order(vma, type, orders); + orders &=3D BIT(bpf_order); + +out: + rcu_read_unlock(); + return orders; +} + +static bool bpf_thp_ops_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +static const struct bpf_func_proto * +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *pr= og) +{ + return bpf_base_func_proto(func_id, prog); +} + +static const struct bpf_verifier_ops thp_bpf_verifier_ops =3D { + .get_func_proto =3D bpf_thp_get_func_proto, + .is_valid_access =3D bpf_thp_ops_is_valid_access, +}; + +static int bpf_thp_init(struct btf *btf) +{ + return 0; +} + +static int bpf_thp_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + /* The call site operates under RCU protection. */ + if (prog->sleepable) + return -EINVAL; + return 0; +} + +static int bpf_thp_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct bpf_thp_ops *ubpf_thp; + struct bpf_thp_ops *kbpf_thp; + u32 moff; + + ubpf_thp =3D (const struct bpf_thp_ops *)udata; + kbpf_thp =3D (struct bpf_thp_ops *)kdata; + + moff =3D __btf_member_bit_offset(t, member) / 8; + switch (moff) { + case offsetof(struct bpf_thp_ops, pid): + kbpf_thp->pid =3D ubpf_thp->pid; + return 1; + } + return 0; +} + +static int bpf_thp_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_thp_ops *bpf_thp =3D kdata; + struct bpf_thp_mm_list *mm_list; + struct task_struct *p; + struct mm_struct *mm; + int err =3D -EINVAL; + pid_t pid; + + pid =3D bpf_thp->pid; + p =3D find_get_task_by_vpid(pid); + if (!p || p->flags & PF_EXITING) + return -EINVAL; + + mm =3D get_task_mm(p); + put_task_struct(p); + if (!mm) + goto out; + + err =3D -EBUSY; + spin_lock(&thp_ops_lock); + if (rcu_access_pointer(mm->bpf_mm.bpf_thp)) + goto out_lock; + err =3D 0; + rcu_assign_pointer(mm->bpf_mm.bpf_thp, bpf_thp); + + mm_list =3D &bpf_thp->mm_list; + INIT_LIST_HEAD(&mm_list->list); + list_add_tail(&mm->bpf_mm.bpf_thp_list, &mm_list->list); +out_lock: + spin_unlock(&thp_ops_lock); +out: + mmput(mm); + return err; +} + + +static void bpf_thp_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_thp_ops *bpf_thp =3D kdata; + struct bpf_mm_ops *bpf_mm; + struct list_head *pos, *n; + + spin_lock(&thp_ops_lock); + list_for_each_safe(pos, n, &bpf_thp->mm_list.list) { + bpf_mm =3D list_entry(pos, struct bpf_mm_ops, bpf_thp_list); + WARN_ON_ONCE(!bpf_mm); + rcu_replace_pointer(bpf_mm->bpf_thp, NULL, lockdep_is_held(&thp_ops_lock= )); + list_del(pos); + } + spin_unlock(&thp_ops_lock); + + synchronize_rcu(); +} + +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *l= ink) +{ + struct bpf_thp_ops *old_bpf_thp =3D old_kdata; + struct bpf_thp_ops *bpf_thp =3D kdata; + struct bpf_mm_ops *bpf_mm; + struct list_head *pos, *n; + + INIT_LIST_HEAD(&bpf_thp->mm_list.list); + + spin_lock(&thp_ops_lock); + list_for_each_safe(pos, n, &old_bpf_thp->mm_list.list) { + bpf_mm =3D list_entry(pos, struct bpf_mm_ops, bpf_thp_list); + WARN_ON_ONCE(!bpf_mm); + rcu_replace_pointer(bpf_mm->bpf_thp, bpf_thp, lockdep_is_held(&thp_ops_l= ock)); + list_del(pos); + list_add_tail(&bpf_mm->bpf_thp_list, &bpf_thp->mm_list.list); + } + spin_unlock(&thp_ops_lock); + + synchronize_rcu(); + return 0; +} + +static int bpf_thp_validate(void *kdata) +{ + struct bpf_thp_ops *ops =3D kdata; + + if (!ops->thp_get_order) { + pr_err("bpf_thp: required ops isn't implemented\n"); + return -EINVAL; + } + return 0; +} + +static int bpf_thp_get_order(struct vm_area_struct *vma, + enum tva_type type, + unsigned long orders) +{ + return -1; +} + +static struct bpf_thp_ops __bpf_thp_ops =3D { + .thp_get_order =3D (thp_order_fn_t __rcu *)bpf_thp_get_order, +}; + +static struct bpf_struct_ops bpf_bpf_thp_ops =3D { + .verifier_ops =3D &thp_bpf_verifier_ops, + .init =3D bpf_thp_init, + .check_member =3D bpf_thp_check_member, + .init_member =3D bpf_thp_init_member, + .reg =3D bpf_thp_reg, + .unreg =3D bpf_thp_unreg, + .update =3D bpf_thp_update, + .validate =3D bpf_thp_validate, + .cfi_stubs =3D &__bpf_thp_ops, + .owner =3D THIS_MODULE, + .name =3D "bpf_thp_ops", +}; + +static int __init bpf_thp_ops_init(void) +{ + int err; + + err =3D register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops); + if (err) + pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err); + return err; +} +late_initcall(bpf_thp_ops_init); diff --git a/mm/mmap.c b/mm/mmap.c index 5fd3b80fda1d..8ac7d3046a33 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1844,6 +1844,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, s= truct mm_struct *oldmm) vma_iter_free(&vmi); if (!retval) { mt_set_in_rcu(vmi.mas.tree); + bpf_thp_fork(mm, oldmm); ksm_fork(mm, oldmm); khugepaged_fork(mm, oldmm); } else { --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA3052517AC for ; Wed, 15 Oct 2025 14:18:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537887; cv=none; b=nvWSkCKgLRrJBm9WRh0tBg8pzIPkZrQOaIa0gQKzAUgaTMBNeK9/CTTZyeMcoJT99/ejH0eMguEFxziAfCSYa/AUL6nhAjFbp08rQwbT4Nhu7ZpphvOS0HjsUrBhSFI/vGJSRK/RTkeY+gGHYNs0Xtboga0clidl2QH1G4Iv8OU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537887; c=relaxed/simple; bh=KluHk1QWZzn5NMc+VsC+MNBgvBYeQNDP8rGgjwP2v0c=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=C6itfj64PR2A0sTpOSmdYPncfl0vJk2j+OuwQw6UiQrTSQmfzyLUOSjuztLOaWgHh6y7Y7mH+L6/KSg33xpzkIg4Bve53pdLoODyjr9oYRC6Fhp8MIGMoyR2V1V1j325HuXlDUK55gpWGXh+j0khm+oPR3HiMbA5ZPThD+D5x3M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jX+JnIV5; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jX+JnIV5" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-3327f8ed081so8125091a91.1 for ; Wed, 15 Oct 2025 07:18:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537885; x=1761142685; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lnem9WQoa9CrDKPFqRiLEj1JeBs2tMhwu9umKx+yfco=; b=jX+JnIV5GL0SE3y5tfD5skoaSosu6/cBMWqlGllqhw/tgO8dvEBy7xyclqP6r9TE4R /tQ7vJVtyOkCmn4vZj0Cu/DrJtY1sRH2zFJeOGlDDhX2LxZ5oJJe04neWEvuUnt3rr+P /3FuBGifhT0kPtk7H+GHrokCfnS6VAOiO1QVKI6BbkCz12kbQ0liP9n2n/mKW77tDoEy 0koGJLi5tLHl2qekZTCCIQnnhFocuo0/cI/JUJuG9Fvws5C/5z7BtHUQqekn5cp20snx oYpLBgAtrpLe7NRj+PzqHdKE2ioGazp+541HGq/3PcY+3aR4F55wbYf3cjjCfk9DUs8K Hgpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537885; x=1761142685; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lnem9WQoa9CrDKPFqRiLEj1JeBs2tMhwu9umKx+yfco=; b=XhiZJ3cY/wotPPF1lDvWI/YVjogzkOChZhwcCRHxby39IorV2yZyozoZJ/62fZxuPI bQ4rzzkb8S6zhXtKcdpdD0fhXOVaUKejRvnc0aOSg1EhiexcohveBp3zC4M6P3xHBDz7 baYH49Ftf5rUFPollUB8QVPyawkDC3lLEXX7QYsExJFeH5YNaLhkdAhNv2yMWoOXnv2q 4iwlhISpeI2PKHikIpBGZX0uofKJMRuHGjEldpVa6iIYToyNdPthA2sLiYd0iTxEZDNX sEqaeh+2QUWopHHGA40Hg8+MmOyEqNtRRsq01mNjCPJQI4fCiQtV5BBjbxynom8v1Cg8 s5qw== X-Forwarded-Encrypted: i=1; AJvYcCWlKYvN3tqWxqRt/MffOBDhMJRVwT4lSfFll4eSn+7PC/uD+v4VUBb6wN79eflmCc9AlltZFZkdPVhaWpU=@vger.kernel.org X-Gm-Message-State: AOJu0YzABI4Xue7sBS4dZ5i6kzwDRIafN3El4dab19horIgYMQ+C1iPB NLgugV73BNYmZ7ScTurEKZ28KS7u4hz+Zh78FYcNR2BJqcCL5ZIQO7Mc X-Gm-Gg: ASbGnctHb1dSGRtEJDddGF2/rwDMFeOLduzAPCmqgZb8OU/CDRa/Xo6VER6EBJkYXU5 P7NQ0sjZD780mq/2PPNDH53okPIaeA3EJeaGT3i8ellSaDFasaFnzal14QkKV1JUyB7xcq0EKc+ agWpVPPr+wjbrHi3tmX7si5SDIhhjR/eLUP96RQ6tml96uc4YhH83440UBQVpzg4MuAnnoXYLbl ZpaTAQyhKftd1rzh7l5T36OtSLLaudUfR8T3azDzi3WFH+10h5JfLJbltbOrtGzXusMM2NUIQ8V 9HE2wIy3HM80McdOk4WtbN4Oer5xMEXn7ze4fT2uYclxjO8jxQ0etSwG1PTzeCcSw684x3Xysg7 5JdHDlMgMPkTFwFYHBKByROkiEdyaQT1FkCX84xrS1aZpiNxZqei479nv0zy2CZUH2yknefhfOB +ze81Peg== X-Google-Smtp-Source: AGHT+IH94/LKl7eMp9XLhTF4MNSw5y/puAoaithR0MMgu0BDg3idBsTz0DwsyFW6zRb6V4pYAII+Og== X-Received: by 2002:a17:902:e54f:b0:28d:18d3:46bc with SMTP id d9443c01a7336-2902723d619mr412289645ad.19.1760537884479; Wed, 15 Oct 2025 07:18:04 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.17.56 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:03 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 4/9] mm: thp: decouple THP allocation between swap and page fault paths Date: Wed, 15 Oct 2025 22:17:11 +0800 Message-Id: <20251015141716.887-5-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The new BPF capability enables finer-grained THP policy decisions by introducing separate handling for swap faults versus normal page faults. As highlighted by Barry: We=E2=80=99ve observed that swapping in large folios can lead to more swap thrashing for some workloads- e.g. kernel build. Consequently, some workloads might prefer swapping in smaller folios than those allocated by alloc_anon_folio(). While prtcl() could potentially be extended to leverage this new policy, doing so would require modifications to the uAPI. Signed-off-by: Yafang Shao Reviewed-by: Lorenzo Stoakes Acked-by: Usama Arif Cc: Barry Song <21cnbao@gmail.com> --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c | 2 +- mm/memory.c | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5ecc95f35453..9e4088ae0a32 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -96,9 +96,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; =20 enum tva_type { TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */ - TVA_PAGEFAULT, /* Serving a page fault. */ + TVA_PAGEFAULT, /* Serving a non-swap page fault. */ TVA_KHUGEPAGED, /* Khugepaged collapse. */ TVA_FORCED_COLLAPSE, /* Forced collapse (e.g. MADV_COLLAPSE). */ + TVA_SWAP_PAGEFAULT, /* serving a swap page fault. */ }; =20 #define thp_vma_allowable_order(vma, type, order) \ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1ac476fe6dc5..08372dfcb41a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -102,7 +102,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area= _struct *vma, unsigned long orders) { const bool smaps =3D type =3D=3D TVA_SMAPS; - const bool in_pf =3D type =3D=3D TVA_PAGEFAULT; + const bool in_pf =3D (type =3D=3D TVA_PAGEFAULT || type =3D=3D TVA_SWAP_P= AGEFAULT); const bool forced_collapse =3D type =3D=3D TVA_FORCED_COLLAPSE; unsigned long supported_orders; vm_flags_t vm_flags =3D vma->vm_flags; diff --git a/mm/memory.c b/mm/memory.c index cd04e4894725..58ea0f93f79e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4558,7 +4558,7 @@ static struct folio *alloc_swap_folio(struct vm_fault= *vmf) * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ - orders =3D thp_vma_allowable_orders(vma, TVA_PAGEFAULT, + orders =3D thp_vma_allowable_orders(vma, TVA_SWAP_PAGEFAULT, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); orders =3D thp_swap_suitable_orders(swp_offset(entry), --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20E0F25F98E for ; Wed, 15 Oct 2025 14:18:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537897; cv=none; b=ARIgVIjSRvyqQey1WkV3s+IzuQ15FB5OKfE4lbTVc2SZ8aXgbGbZW6CWVaPSUxeYazuRg/4fXJpm9uGTc3D7XMUI7u857evfIGq+4MQ1wHP8Ysguz5HN2ydfyFcBIZnz3ZP9KkTKjCBS48u81BzKmO/wl68GJRDNYGkebJQ0CUo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537897; c=relaxed/simple; bh=PdHzrFnTfY/BX2nYC+VH4KXSooOxeg+vmzUyNWX3Omw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=aQMK/TiXXYy7wiq+NkcGFTeZLLtMYeBfzutezvlTE2N4PydOC6oGkPeRwVoXUodY7x0bqqgMXz5pZcHf4MZ/Xr74RCOm7ebr02ggoIMh3tXsa0vldA5ltvtrG5WQpHCqkvrXc/PqNPcVOhZxFtH+d4//WmHW7kY4M8DJGisYtVI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jODZ0Lda; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jODZ0Lda" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-2697899a202so7394985ad.0 for ; Wed, 15 Oct 2025 07:18:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537894; x=1761142694; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1tfnp2VgNI7X5NDB06OaZAZrurrAWQnxjDDxodnKF9g=; b=jODZ0LdaRXSqRmVsN3wl29fRoNLR5YwDPATJ1b/tmu4g7l8urBKSWHxYKpJiJFS9b3 PlszC92aOlK+wlqG1SFa/Fcpm37CXqaUkTZrkB8fpzxaKqowDXslaJ0HssoP8Yfk/YNy lsYcQXJtOC2LXk6BMQt+u0uVSBJypBbfX+gaBoGXgfYdN34GW1fTu0uNgqVSWE2oci42 wcZau7CMC9bFPzuCncX87W0P182DFOvkZaG/8Yjz2xFaRKnIsNBdyJKWDYgLw5SQda/Y IRIlujW9vBJtafcH5vrDm7UAGmrnPVIYm2xSNUFUkZRHwRiY2KOh3iteJmATm0SCQOuF yfMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537894; x=1761142694; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1tfnp2VgNI7X5NDB06OaZAZrurrAWQnxjDDxodnKF9g=; b=lrUjETAG1xwLxwTKHc1h3YZJ4apCKvJs4h+X/9zQVNRd7wbfCCr7/uweKWymXjjz5F SQA3/TgJiVW6JL98MCxLS59BCqpE1klBU7o0LMfQN0o1ij96OAoM7FfDHtjDZw/oORh3 QnbQ1fzm5/UPWMYdiC1KaOe3Tr83vjsIk5CswirNXUKG7mo5XO/IFnzlhMdkidWw80M3 hDYiSgc1tL0xCk5b/cRwfF46CTd4LyPzWFdHOZ+QKh1whcLVi+co6FBnWujiQrWE06O2 VcxsvfbrhTfyG9ZO24ORMJ+hgPv90q0XbklhZ0lLogLjZRfbmudvEHSz92/PERphAcaf uHBg== X-Forwarded-Encrypted: i=1; AJvYcCVO9iDe9F5YOQNOMnFsYjYnqGu1aUECn+Wj45HHblcdrZQcTUQVUxTq4DBerRIDHDNSVyKsTaFXPcnOIgY=@vger.kernel.org X-Gm-Message-State: AOJu0Yy3miXT/vAr09bcopmnTIh/THnMM1pTkHMEZ1oF+zW3XKWw/nwe me9jEg8IctBGuPdp3gxJT/GeKjXdovpYNUQEQuQ2vTBDZd8nSRtA2Vs3 X-Gm-Gg: ASbGncsK326CLzfV1bnenycuhhJQjpuhO45xuEe9L3sJ1h7F4t9sVRblGH0DuWWnpHA QVkC+5rGNdMDbb4MdiBQ6nJNrxppb9SnvlP+gipdfUPfK+1bemGN9hyihYu6BhmxXUbCgszp/qB Yoi5jtQRLiJFjXIetyyIgScY0q+07FPr6w8dCiReEr6NAUN3YhX8LG7Wvvy6qbJVX47+gwwxvV4 cH5vHCrplTeDMslTuc01KeuGHivgzlGLwKsBFfbhfDFLBtSwKPQHR0VWC4n3675w0Tdkye5gKzV DKJgpnL6NyxdB17lp+tcx0YqhTBVUsA+uyKPNWKZuHW57NBFULVi9YqU1nQSBLa9rDLGBuFc7gI /+Or+6vMLdtKTDqi7k4EBMs2rzIX4dEjEkfBV/jccPDaIFpDBs7xYxHr4y1rrMVWd6Buv9E8B2C +b3wKwv59JKRhjK7f+ X-Google-Smtp-Source: AGHT+IGaH4Q7yuB3UefbX6fUnOF0Cq4qxYjfR/YqrSeLwExSFoNCC9hfKuz5SVL5H702WlYSyJC/fw== X-Received: by 2002:a17:902:ecc2:b0:281:fd60:807d with SMTP id d9443c01a7336-290919d5389mr3536805ad.2.1760537894159; Wed, 15 Oct 2025 07:18:14 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.18.04 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:12 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 5/9] mm: thp: enable THP allocation exclusively through khugepaged Date: Wed, 15 Oct 2025 22:17:12 +0800 Message-Id: <20251015141716.887-6-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" khugepaged_enter_vma() ultimately invokes any attached BPF function with the TVA_KHUGEPAGED flag set when determining whether or not to enable khugepaged THP for a freshly faulted in VMA. Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as invoked by create_huge_pmd() and only when we have already checked to see if an allowable TVA_PAGEFAULT order is specified. Since we might want to disallow THP on fault-in but allow it via khugepaged, we move things around so we always attempt to enter khugepaged upon fault. This change is safe because: - khugepaged operates at the MM level rather than per-VMA. The THP allocation might fail during page faults due to transient conditions (e.g., memory pressure), it is safe to add this MM to khugepaged for subsequent defragmentation. - If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0. While we could also extend prctl() to utilize this new policy, such a change would require a uAPI modification to PR_SET_THP_DISABLE. Signed-off-by: Yafang Shao Acked-by: Lance Yang Cc: Usama Arif --- mm/huge_memory.c | 1 - mm/memory.c | 13 ++++++++----- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 08372dfcb41a..2b155a734c78 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault= *vmf) ret =3D vmf_anon_prepare(vmf); if (ret) return ret; - khugepaged_enter_vma(vma); =20 if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm) && diff --git a/mm/memory.c b/mm/memory.c index 58ea0f93f79e..64f91191ffff 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6327,11 +6327,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_= struct *vma, if (pud_trans_unstable(vmf.pud)) goto retry_pud; =20 - if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) { - ret =3D create_huge_pmd(&vmf); - if (!(ret & VM_FAULT_FALLBACK)) - return ret; + if (pmd_none(*vmf.pmd)) { + if (vma_is_anonymous(vma)) + khugepaged_enter_vma(vma); + if (thp_vma_allowable_order(vma, TVA_PAGEFAULT, PMD_ORDER)) { + ret =3D create_huge_pmd(&vmf); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } } else { vmf.orig_pmd =3D pmdp_get_lockless(vmf.pmd); =20 --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22E5423D7D8 for ; Wed, 15 Oct 2025 14:18:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537905; cv=none; b=UR/jPJ9iwN3GOzVJX/KoduJpRQTfz3g7mCzkPcmCkWQCQ1dw6aa1zA5YFPX+53a56DTKvd5L+m0UCnaaVJMoyGQN6Gvly5gOBQ0wevbwAaus1AnenQGdBl1C4sSdhqUqokwOW9enA1VQagO6/BVoyxm+pKZUMDWDXHjQTf4uQYs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537905; c=relaxed/simple; bh=CXI0qMvWqpeePDr4xBG9e2uqemr9IpjeVUqDleKN93I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Sem86c/E0xWlZqn9X/1Zx9dCos3uaNEmbuWOEzjA/45N12+OPTrUy/nll9n6+zeDoGZMDmP6UBCRkdB+aMpz3VuawhvpRYdYgrTS77PVF+OdCPFu0mZygnwX/6AoFHWpSH2nyyqWp2UtbJQzhRtDhfMgrcAC4VoR/9YUNhUxYl8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=B/PmxhGW; arc=none smtp.client-ip=209.85.214.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="B/PmxhGW" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-28e7cd6dbc0so76700215ad.0 for ; Wed, 15 Oct 2025 07:18:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537903; x=1761142703; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=oOprrquoKlbkEreJojZ4KykFg8G0TswcVk79ghRz+Tk=; b=B/PmxhGW4USjX2YXd4ZukIPoqXuA90fULSmrq5rxT67yJShl4XIqibhEUVtGPwhFzO LN+xQ7/xaXSQZpakOu7fsyY2WVywH57LwnVqGH1l3Xg1AlYjmTI+hpO0xxxkEPSCS5dP hbiuSkNdc88c/KTt7abdO5MNcO/ahcESDoEOfRzkbRmGYXquGsyVVaVTVZaASZ+D8FTw qfwa0Z2+SoYpr6q8x0f3d+cVTS4q91SzPXC0I8tbIiSWiocp/laTNC7gUrW94UxwMqib /Uai3NzxDARFGDsp9J4CzvLvNP5De+rlVmeEfUWyuGtB98kdh2YF9PeL0QrJH3O/YLXm nQmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537903; x=1761142703; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oOprrquoKlbkEreJojZ4KykFg8G0TswcVk79ghRz+Tk=; b=C4eDgyZFWtG1IcqPAPM7GJz8iIyG1HKtjB8OkXV85GjkyKQHVzuQ9IPDyyMLhjXRYq F2r7aD0nG0WFLbNjRBxg2gbKfZ+N5hUAkSqMmPt3iTyBPKbFf2uc95ynrWBmSSmMIpff z/v56fHiVCSfh/CHPFq3pIA2nn5lpUVu+TXykLG8XZv48oBHMIgu1Va2I4/aOuidoOoS dGawI/kGtmW7jmaYaMgFnc7X8ET7TXI0hndkN2WD/cgfoGMJdABUzUaogWEVMS1bcxdT VAfWU0QVKOGancg9iMXNVEOHBw47e06YkhUxP2+VvcxhALcc3EJBGRqquU9N2+SPUjGb Fvbg== X-Forwarded-Encrypted: i=1; AJvYcCUpHrIDOyJzkd3JI28ql4N2RgFL+FWXZYVTXt3nlFe7SK3Keu/YolZ6Klgo8xtPjQwh3pQ1SrCofpPuQXo=@vger.kernel.org X-Gm-Message-State: AOJu0YzlrVdi4Uai0C20W7+81o0gqYDW0FOe2k7u9xFwnE2M6wTYn53q a8skUnud/4qlWHjjqOrma+L3tg4wOI61PTXg/Sr24thZjU0o0mw4Djoe X-Gm-Gg: ASbGncvN2/FF79JSZF4+yhF2XfjvEFQltviE5yXfmD30BM//lSSsRXtgdnlvgma/zVR Jw4KtpQLJADuEAFyxJUvblM/YbYowinqFfe1R+qeuT0EQDpUBvokyZyAR8Nnr0As3cb7mYAgdtm bdQSAkHtqXlaO1f26IRGp7xP5tIiYIDXMKJCDzsoc1dDl9FDkPo2uaXXkywpqz3rEre4SYoXBru UxJMxlvTw+b0Ts+qV7tr0IzFHCrJ1XACsnpHTU/LfVOzRRRP7vimYX/KR3xDtkgjMrWsQb1yw0C ug/I4Aaeep/A/iDDF7yAb983csOWdQsSnY439Eur8nFwiczTMCga3C22a8i+av93DF9ZdtPWKba b0I9ortKZGBC9N/FU2LvHwqlShmFzHeieuVhbw9jeQQrCrmYhaxpfYXWUGUrzEF/HisPIbi+ET0 V4x5PmOImIO590CTBd X-Google-Smtp-Source: AGHT+IFbpJ/lHtUZtOELQrDFo7AxEZqNrS9hHE1/EnXHjZzMn0AR8tXrPtdpNdO0986Yh1aFevTgmQ== X-Received: by 2002:a17:902:e78f:b0:24b:182b:7144 with SMTP id d9443c01a7336-290273565bdmr371321185ad.7.1760537903035; Wed, 15 Oct 2025 07:18:23 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.18.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:21 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 6/9] bpf: mark mm->owner as __safe_rcu_or_null Date: Wed, 15 Oct 2025 22:17:13 +0800 Message-Id: <20251015141716.887-7-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner =3D @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko Signed-off-by: Yafang Shao Acked-by: Lorenzo Stoakes --- kernel/bpf/verifier.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c4f69a9e9af6..d400e18ee31e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7123,6 +7123,9 @@ BTF_TYPE_SAFE_RCU(struct cgroup_subsys_state) { /* RCU trusted: these fields are trusted in RCU CS and can be NULL */ BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) { struct file __rcu *exe_file; +#ifdef CONFIG_MEMCG + struct task_struct __rcu *owner; +#endif }; =20 /* skb->sk, req->sk are not RCU protected, but we mark them as such --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC0A6252904 for ; Wed, 15 Oct 2025 14:18:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537915; cv=none; b=R1inYiizP3o+fy0SLd7So+hhx1aXVjOf2EnteRFbncm+LuU/MNgGZRSwy5S7zF8V+ikX8D1hLgLleT4Q3kl95A31FWFnS4aju9kKuVL71WllirB8Too1GIokypbuExxTSNDsPWl+kTfru80GAXyvFR59BhKEmfwYHVLOUIpPOvE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537915; c=relaxed/simple; bh=/PMdDC4FDfpJHwrfvPwAn92Vwj5QW0m0NvTBiXm9L2o=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=fy7GPgYIvukDPNznk+/2sfMy92LZconKfVlSStX33HzPAt8LB7pTe9GXi1neND0lIlC+JTCAmFQaG8d9AtWNFXrbGKMZ1e1tAs5kejhDpkwlmkcigcbmu1zb10Fkz9WeeQSePmPOk/SW5wf+ORdneiCrjzPm8DHWDAb2e71rXiM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OOnAbzrY; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OOnAbzrY" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-781ea2cee3fso6435278b3a.0 for ; Wed, 15 Oct 2025 07:18:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537913; x=1761142713; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=n/w5JYrEq/PCfn3Rwv8DsVtkgCdC0Y7Hm3GpSfLGEBg=; b=OOnAbzrY7BPyLQaqgLppZ+AP+1WZdU+iFY/+TUFUIS5tOawH3R4z9u2dVd4z0caTBU +NzrYq/sn+ct73JV5s+e51RuSDypDVAh/Gw7xMhmqM4gcPeSae7mLL7qvigeL/Jjt6PP o3jlcXVCHicJ9CzUjwKcX7/5ZAMnJFUqk6Rorr3uFdVvHhJEDTbNvrTbuFGnqx2Fln6A gmyXYI3PD+p7DtYfOltVLPwAkh2dM53kAZjSkwUM00MCKs18v9nnNdqfd7rEQ2c6hpv3 JRio1Zo5X4A2CFprO/nS2DTzTmjKHemnmgYFwzNLcvk+Zr7e+naQFsKnD0OLmNMo9uFE hHbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537913; x=1761142713; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n/w5JYrEq/PCfn3Rwv8DsVtkgCdC0Y7Hm3GpSfLGEBg=; b=ugXLojbH7lOUtccZZifnAeoTL5VII1TULQ+YxprVT2gOffNu4CnGpBKj04Qudwdtfj 2AIayW4EvxeXUbMwr9Ygx6Lwg/5u8bIiqTFksAADdl7FleB5DFaHJfk+sARjONNwniX7 5Am7fjXZhhf10uhf0U0QsDhqEZ5ZlDh3x87nQ3+QN8fiAP0X0li9b4URXP7ihKsL4U/K ptgS/iMEboCp8jykqpCaLMJOVFng6gw4xFPrMCkwhGOkMfDhd8YjACvfvND05UssGkh3 PYNAnvxVJkzmwHETgohNtmJAhzap0HmtlqQ4mwiqVssvyt1dIBo5bjZGlbxyI65kQ2j+ 7a2g== X-Forwarded-Encrypted: i=1; AJvYcCUak2l4Fq587xOobmlYj3dvVABSYd88fQ2/MBfrOmyDVCj8G7ivHDFR8VAuOcdAgnwrhuB6xhAlEQXq5i4=@vger.kernel.org X-Gm-Message-State: AOJu0YxlHAJJQpctQMPaBoamaoc0ogjVhPiubY2np1QTNcvQb3sqK2QG O6/j3BNcNMn6AfVJOzmt3C2xK/FNAlKzq26zOtYnFK84hSnGkXYgHhLG X-Gm-Gg: ASbGncugCpEYv7SvvL8zpudBve8x+gqpoVbxsQ/c38LzZKNts8wWT+0L9XthCcitW98 wRaiUauulBRRkLv/RxjkwDXI0vJA1TTkDu7l3tpohbQsofiLjJYwkwlGrQJmpUt6MbmCIGwoszl aMXq9b+i4mRoXggBtqTFmbeO2+45KQ6Z4bbfzDPwSYpF2fqM2wGSYeVbtW0W9iRvrBc4pVyNwVO 3ObUW62m19irdIB7QIp5YVIejo6X+dOPsphm4b0q9nbJJqKGXCEQ0apVHV4uTsOmVHNFXWklkz/ lvFdEDLSYaM52/Ih1KZcV2AxrBeO7xVQpNMFYrFqHr6hdF8hiVNOU4eB5Hbg7NS6YaQjI4Ouhbg Xv1c+wQlABVhBeS8cBI+T2H3aPRhxsUyCHMy/+oDTHaiDJwjbzN5t4kFFb9onZ24V9I+A2aO/QV bmjX25Og== X-Google-Smtp-Source: AGHT+IGMQRqzJHiHnVfuKEi8nOXVfKtRDNFlKQ6AKTG17G4aeKVlt5RMD53mTqVXNIbYrqDfmCyjHA== X-Received: by 2002:a17:903:1b44:b0:26c:87f9:9ea7 with SMTP id d9443c01a7336-2902741f6demr352102665ad.59.1760537912651; Wed, 15 Oct 2025 07:18:32 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.18.23 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:31 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 7/9] bpf: mark vma->vm_mm as __safe_trusted_or_null Date: Wed, 15 Oct 2025 22:17:14 +0800 Message-Id: <20251015141716.887-8-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm =3D @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner =3D @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @cgroup1 =3D bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @cgroup1 attribute */ bpf_cgroup_release(@cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner =3D @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor =3D bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group =3D @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } Signed-off-by: Yafang Shao Acked-by: Lorenzo Stoakes Cc: "Liam R. Howlett" --- kernel/bpf/verifier.c | 5 +++++ tools/testing/selftests/bpf/progs/lsm.c | 8 +++++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d400e18ee31e..b708b98f796c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) { struct sock *sk; }; =20 +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) { + struct mm_struct *vm_mm; +}; + static bool type_is_rcu(struct bpf_verifier_env *env, struct bpf_reg_state *reg, const char *field_name, u32 btf_id) @@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifi= er_env *env, { BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket)); BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry)); + BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct)); =20 return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_trusted_or_null"); diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selfte= sts/bpf/progs/lsm.c index 0c13b7409947..7de173daf27b 100644 --- a/tools/testing/selftests/bpf/progs/lsm.c +++ b/tools/testing/selftests/bpf/progs/lsm.c @@ -89,14 +89,16 @@ SEC("lsm/file_mprotect") int BPF_PROG(test_int_hook, struct vm_area_struct *vma, unsigned long reqprot, unsigned long prot, int ret) { - if (ret !=3D 0) + struct mm_struct *mm =3D vma->vm_mm; + + if (ret !=3D 0 || !mm) return ret; =20 __s32 pid =3D bpf_get_current_pid_tgid() >> 32; int is_stack =3D 0; =20 - is_stack =3D (vma->vm_start <=3D vma->vm_mm->start_stack && - vma->vm_end >=3D vma->vm_mm->start_stack); + is_stack =3D (vma->vm_start <=3D mm->start_stack && + vma->vm_end >=3D mm->start_stack); =20 if (is_stack && monitored_pid =3D=3D pid) { mprotect_count++; --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54E0C259C9F for ; Wed, 15 Oct 2025 14:18:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537924; cv=none; b=QAICAjpAkAfwlRllg6pwFqFqP91eE76/ssz57MjIrw3lLHmQ8tesFoAjeyLD+7O4fLZizXn01nHxEYoNtmWo3pNFDX1jL6qP92ZBzQQrPMcN0PqkQZEPdfbKzq6SH/Hg9kpflFHdfBYPv92wqxLf6M81jVagjljkLa3ZmyS+MfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537924; c=relaxed/simple; bh=wwyVnKj0Z9E7rKPt4tas7UgOcWEe2+T1LdCn1Cx2TA8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ddcjnqXi/Cr+0hkyV2utzcxAWLZ5GpIta6cTVeCRw/G8wkYwctDRTyo1Xgb3euj6R6WgEvoLIs41OJ7cX5Blzim8yZzbmn14lWLFiSHDMkf+mkSIad9q4OinfReitvpmDYiVrnMpx43zxUD0hgH9xcFN94K9h8VPCd5cHPZJwVk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=h7uD6KDN; arc=none smtp.client-ip=209.85.214.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="h7uD6KDN" Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-269639879c3so63343675ad.2 for ; Wed, 15 Oct 2025 07:18:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537921; x=1761142721; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6t6N9HpqC1tc69VtCq5PoxDpHYOuoNpn/2bEQQRKsCU=; b=h7uD6KDNjP9KoQtA9oSR5U0sqnzQvBG1GwhONmPU0kyOBpY4OtspZ/IFWlcD+u73a9 rU+5azTfHS2QbaRZB0ObGlu4Bq3wkKESBdDOYZaJ0M6nTzJEaX/W1bRwPlsPwY/tUCq2 KWoV3iTA+OzS7jZ2TpvsJKdleQPT0ZWI8hKNIEWaBIVgPlVCGuFSg5orVviyIv3k99n8 8u7CAX2+jYZrPiFt/+yM6JAFv5v4rRYJQALxxgOrSNExs89C2cuSejDimfyuMNETv0dG eT1qm+BL4wsYp1adIh2Pt+vPQADErsTaQSwtB1AtM+xwx9Ucis9/UMPxqIHUKf93DxjL FEuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537921; x=1761142721; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6t6N9HpqC1tc69VtCq5PoxDpHYOuoNpn/2bEQQRKsCU=; b=Pm04tbnGLn1ccatyVc4BXZizlfDhnWKO16pGGbnziR5y/PPL/tKZJ6YasQREUXKhjH jP7NmlpUFJKYojnm56Ppr6lsUtektk0wSom8fhhGKnSu8hTwXVYctXwI8ydB4tyrOuFJ B6K8C+wH9jnqVUk2xm/jOKKDOR3mIOWOjIwiXSIloMvO4KPL0P3wTEdMlDujUT+mvK06 TVntgLkwvnCcJ0uIjqzY0yYQimflWvVvZqEuaYd+mj3S1gBwwTbKmITQY0iEUzu6R+Lh hDuNDlSECpwT4HwY2RxIPJv4sbAXwQ8VXGvA82zMGOEIti8JKqpBJfwcWg6/5khgrlg3 3aDg== X-Forwarded-Encrypted: i=1; AJvYcCXs8QQ6vM6Ai/9ujGuCJCBmgLLm4tZBMUUHS8RO6X/Wmwdb5hD3P56kv6hqiBDRchRcX1gooaV1jJKvyf4=@vger.kernel.org X-Gm-Message-State: AOJu0YwYRZBC4ToIgKxDD4CSw1ylI2JGwxVuBn62y4vXBBgYBmUBYzwH wel6Jl/saSlnmld7AFw0eNxM5+irdLZi0uQx/7ToMFRzKTrVxQu8jLTq X-Gm-Gg: ASbGncv16f3j0lNkh+RCBOHNjXjkIlhE6F6pbD2DYahGG4LF1rOhqw7v3uUOTkqYDnj eGs4vcRi6FawI9VnvboQ2GLNDcOM91erJaT0xeWwrHQQSt+yqv8Kf85RB2eykYOdzdEon3tJK8H 06PFgOp4h2y3gaKdYYMO+S7WwApspHf9Rt9eMDadoqVOZ3P/COZt/y6Q+otndurl1uGB2yNi7jH vvDrUTxRyPakwaxMoWA+JDLeLVKctIod4u9ax21K0vCnZdn2eZS3csgpXsdN8nxcmtpZJvmzXbx 7DLJUucYEbNZ7ddGxuhj7IWbF63fJ79Nn85kbFAvfBkl70o0rR1Q3Pw8tLWmTwT5Z9BBnz/Outk uAyu+Mr02b+g2T3C5arAD/Xo2MHZRTcSqJe0u1DYjGuvI+C4UOykZGULBT4QcYMct/sbLoLr4fQ fTt+XkDw== X-Google-Smtp-Source: AGHT+IF82bnRiuy2u2JBIprbwTFWlMO2WWuYRZy2bwtJxPH4hn0qqhx4Y3Mh3G8QrQkbXo6LCevr0w== X-Received: by 2002:a17:902:e806:b0:28d:18d3:46ca with SMTP id d9443c01a7336-29027402f00mr407388595ad.49.1760537921383; Wed, 15 Oct 2025 07:18:41 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.18.33 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:39 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 8/9] selftests/bpf: add a simple BPF based THP policy Date: Wed, 15 Oct 2025 22:17:15 +0800 Message-Id: <20251015141716.887-9-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This test case implements a basic THP policy that sets THPeligible to 1 for a specific task and to 0 for all others. I selected THPeligible for verification because its straightforward nature makes it ideal for validating the BPF THP policy functionality. Below configs must be enabled for this test: CONFIG_BPF_THP=3Dy CONFIG_MEMCG=3Dy CONFIG_TRANSPARENT_HUGEPAGE=3Dy Signed-off-by: Yafang Shao --- MAINTAINERS | 2 + tools/testing/selftests/bpf/config | 3 + .../selftests/bpf/prog_tests/thp_adjust.c | 245 ++++++++++++++++++ .../selftests/bpf/progs/test_thp_adjust.c | 23 ++ 4 files changed, 273 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c diff --git a/MAINTAINERS b/MAINTAINERS index 7be34b2a64fd..c1219bcd27c1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -16260,6 +16260,8 @@ F: mm/huge_memory.c F: mm/huge_memory_bpf.c F: mm/khugepaged.c F: mm/mm_slot.h +F: tools/testing/selftests/bpf/prog_tests/thp_adjust.c +F: tools/testing/selftests/bpf/progs/test_thp_adjust* F: tools/testing/selftests/mm/khugepaged.c F: tools/testing/selftests/mm/split_huge_page_test.c F: tools/testing/selftests/mm/transhuge-stress.c diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/b= pf/config index 8916ab814a3e..13711f773091 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -9,6 +9,7 @@ CONFIG_BPF_LIRC_MODE2=3Dy CONFIG_BPF_LSM=3Dy CONFIG_BPF_STREAM_PARSER=3Dy CONFIG_BPF_SYSCALL=3Dy +CONFIG_BPF_THP=3Dy # CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set CONFIG_CGROUP_BPF=3Dy CONFIG_CRYPTO_HMAC=3Dy @@ -51,6 +52,7 @@ CONFIG_IPV6_TUNNEL=3Dy CONFIG_KEYS=3Dy CONFIG_LIRC=3Dy CONFIG_LWTUNNEL=3Dy +CONFIG_MEMCG=3Dy CONFIG_MODULE_SIG=3Dy CONFIG_MODULE_SRCVERSION_ALL=3Dy CONFIG_MODULE_UNLOAD=3Dy @@ -114,6 +116,7 @@ CONFIG_SECURITY=3Dy CONFIG_SECURITYFS=3Dy CONFIG_SYN_COOKIES=3Dy CONFIG_TEST_BPF=3Dm +CONFIG_TRANSPARENT_HUGEPAGE=3Dy CONFIG_UDMABUF=3Dy CONFIG_USERFAULTFD=3Dy CONFIG_VSOCKETS=3Dy diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/te= sting/selftests/bpf/prog_tests/thp_adjust.c new file mode 100644 index 000000000000..b69f51948666 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c @@ -0,0 +1,245 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include "test_thp_adjust.skel.h" + +#define LEN (16 * 1024 * 1024) /* 16MB */ +#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled" +#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" + +static struct test_thp_adjust *skel; +static char old_mode[32]; +static long pagesize; + +static int thp_mode_save(void) +{ + const char *start, *end; + char buf[128]; + int fd, err; + size_t len; + + fd =3D open(THP_ENABLED_FILE, O_RDONLY); + if (fd =3D=3D -1) + return -1; + + err =3D read(fd, buf, sizeof(buf) - 1); + if (err =3D=3D -1) + goto close; + + start =3D strchr(buf, '['); + end =3D start ? strchr(start, ']') : NULL; + if (!start || !end || end <=3D start) { + err =3D -1; + goto close; + } + + len =3D end - start - 1; + if (len >=3D sizeof(old_mode)) + len =3D sizeof(old_mode) - 1; + strncpy(old_mode, start + 1, len); + old_mode[len] =3D '\0'; + +close: + close(fd); + return err; +} + +static int thp_mode_set(const char *desired_mode) +{ + int fd, err; + + fd =3D open(THP_ENABLED_FILE, O_RDWR); + if (fd =3D=3D -1) + return -1; + + err =3D write(fd, desired_mode, strlen(desired_mode)); + close(fd); + return err; +} + +static int thp_mode_reset(void) +{ + int fd, err; + + fd =3D open(THP_ENABLED_FILE, O_WRONLY); + if (fd =3D=3D -1) + return -1; + + err =3D write(fd, old_mode, strlen(old_mode)); + close(fd); + return err; +} + +static char *thp_alloc(void) +{ + char *addr; + int err, i; + + addr =3D mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, = -1, 0); + if (addr =3D=3D MAP_FAILED) + return NULL; + + err =3D madvise(addr, LEN, MADV_HUGEPAGE); + if (err =3D=3D -1) + goto unmap; + + /* Accessing a single byte within a page is sufficient to trigger a page = fault. */ + for (i =3D 0; i < LEN; i +=3D pagesize) + addr[i] =3D 1; + return addr; + +unmap: + munmap(addr, LEN); + return NULL; +} + +static void thp_free(char *ptr) +{ + munmap(ptr, LEN); +} + +static int get_pmd_order(void) +{ + ssize_t bytes_read, size; + int fd, order, ret =3D -1; + char buf[64], *endptr; + + fd =3D open(PMD_SIZE_FILE, O_RDONLY); + if (fd < 0) + return -1; + + bytes_read =3D read(fd, buf, sizeof(buf) - 1); + if (bytes_read <=3D 0) + goto close_fd; + + /* Remove potential newline character */ + if (buf[bytes_read - 1] =3D=3D '\n') + buf[bytes_read - 1] =3D '\0'; + + size =3D strtoul(buf, &endptr, 10); + if (endptr =3D=3D buf || *endptr !=3D '\0') + goto close_fd; + if (size % pagesize !=3D 0) + goto close_fd; + ret =3D size / pagesize; + if ((ret & (ret - 1)) =3D=3D 0) { + order =3D 0; + while (ret > 1) { + ret >>=3D 1; + order++; + } + ret =3D order; + } + +close_fd: + close(fd); + return ret; +} + +static int get_thp_eligible(pid_t pid, unsigned long addr) +{ + int this_vma =3D 0, eligible =3D -1; + unsigned long start, end; + char smaps_path[64]; + FILE *smaps_file; + char line[4096]; + + snprintf(smaps_path, sizeof(smaps_path), "/proc/%d/smaps", pid); + smaps_file =3D fopen(smaps_path, "r"); + if (!smaps_file) + return -1; + + while (fgets(line, sizeof(line), smaps_file)) { + if (sscanf(line, "%lx-%lx", &start, &end) =3D=3D 2) { + /* addr is monotonic */ + if (addr < start) + break; + this_vma =3D (addr >=3D start && addr < end) ? 1 : 0; + continue; + } + + if (!this_vma) + continue; + + if (strstr(line, "THPeligible:")) { + sscanf(line, "THPeligible: %d", &eligible); + break; + } + } + + fclose(smaps_file); + return eligible; +} + +static void subtest_thp_eligible(void) +{ + struct bpf_link *ops_link; + int elighble; + char *ptr; + + ops_link =3D bpf_map__attach_struct_ops(skel->maps.thp_eligible_ops); + if (!ASSERT_OK_PTR(ops_link, "attach struct_ops")) + return; + + ptr =3D thp_alloc(); + if (!ASSERT_OK_PTR(ptr, "THP alloc")) + goto detach; + + elighble =3D get_thp_eligible(getpid(), (unsigned long)ptr); + ASSERT_EQ(elighble, 1, "THPeligible"); + + thp_free(ptr); +detach: + bpf_link__destroy(ops_link); +} + +static int thp_adjust_setup(void) +{ + int err =3D -1, pmd_order; + + pagesize =3D sysconf(_SC_PAGESIZE); + pmd_order =3D get_pmd_order(); + if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order")) + return -1; + + if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save")) + return -1; + if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set")) + return -1; + + skel =3D test_thp_adjust__open(); + if (!ASSERT_OK_PTR(skel, "open")) + goto thp_reset; + + skel->bss->pmd_order =3D pmd_order; + skel->struct_ops.thp_eligible_ops->pid =3D getpid(); + + err =3D test_thp_adjust__load(skel); + if (!ASSERT_OK(err, "load")) + goto destroy; + return 0; + +destroy: + test_thp_adjust__destroy(skel); +thp_reset: + ASSERT_GE(thp_mode_reset(), 0, "THP mode reset"); + return err; +} + +static void thp_adjust_destroy(void) +{ + test_thp_adjust__destroy(skel); + ASSERT_GE(thp_mode_reset(), 0, "THP mode reset"); +} + +void test_thp_adjust(void) +{ + if (thp_adjust_setup() =3D=3D -1) + return; + + if (test__start_subtest("thp_eligible")) + subtest_thp_eligible(); + + thp_adjust_destroy(); +} diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/te= sting/selftests/bpf/progs/test_thp_adjust.c new file mode 100644 index 000000000000..bc062d7feed4 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "vmlinux.h" +#include +#include + +char _license[] SEC("license") =3D "GPL"; + +int pmd_order; + +SEC("struct_ops/thp_get_order") +int BPF_PROG(thp_eligible, struct vm_area_struct *vma, enum tva_type type, + unsigned long orders) +{ + if (type !=3D TVA_SMAPS) + return 0; + return pmd_order; +} + +SEC(".struct_ops.link") +struct bpf_thp_ops thp_eligible_ops =3D { + .thp_get_order =3D (void *)thp_eligible, +}; --=20 2.47.3 From nobody Fri Dec 19 20:50:49 2025 Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E51252741AB for ; Wed, 15 Oct 2025 14:18:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537934; cv=none; b=p5IrDL7E3zLc8NvuYPa8cAQd6AiukKuDaMpNgV5Flp5clRy9u14YzT3lFbd7c+ERoOQccPQkXMLCAckwSOobSaeirWSPBGHGV3iqpe2TKSrBsqYCAcHuis2oZroKpeVztPjEIdTb8LdTp8pQ15M47GlGRN8rUlm+2+XaGOSuq4I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760537934; c=relaxed/simple; bh=M3JDOPbFjkClAC4UX2vhrNDOfN9b6bNYa0hiNAn1emM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=NNxIyOvfMPuH2ZVxEQKhY1vrR2V3yxnYQPjqgCd/GqiH71CgQ4KBIdBH8QNkGQXnX/zoXxZK1F+JAIZckl4wQWTVQMPOcsyYoifQZVb/dtzyp0SRskXzCP74VyZ+ubtTEsWJXxC+LPfT+4of0WcfkRLQAoHnGrDdZ4YMsArihxs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=E2wH7hDz; arc=none smtp.client-ip=209.85.214.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="E2wH7hDz" Received: by mail-pl1-f176.google.com with SMTP id d9443c01a7336-2909448641eso110085ad.1 for ; Wed, 15 Oct 2025 07:18:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760537932; x=1761142732; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=SHmEXjchUo7p5MNeEkw3mVcK0Ae4ArUpPhVglZardsE=; b=E2wH7hDzkRL8PDMBeI58qq8pdyGWoikElGN309zi0u6ZTb1JapYfMPznTFuqZRZ2UB 3mZ/Qs/+Io1RctJyUUxKvqV+D7EQCmcWS/6E3fFjkEw+jQl7Rmtvs7BrmPy5evCmmiGJ jxeNVba1Oa3GYxFYXVaT96nevEw5qc9H64tU0Rc4VsyExhps/G4PhgMZACV90vBxWSP4 VYRSw5kqiQwpy6A96SoU597khVuy13j0taQh4ZM3OoYMYYon8qfPhm27AyncFFUHZAza EIheK9S9Wypf2cyAlr88lDViKiap4eAtW8mRWX0GK+1PcKH9/5yi41ZSYagXDhMBXacG dt2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760537932; x=1761142732; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SHmEXjchUo7p5MNeEkw3mVcK0Ae4ArUpPhVglZardsE=; b=YMC/9InCuyEmR9MRCjX6qymXDMvfGCZ3I1ARvVT9Ji69Wq/lv5yUucFvdipK+QCPvR ueYjTr+W94WVkXVYw12IsiSgZzI463A9vddpdbIig69XXCT5nqZ0gIL1MiLn6b2OvbTN DvSrvucX76/SEhrceizdE0MgwKKQ8ToBaUjme9pt2iy+2Dl//9DZu3U52BY3n+1njazV lCr6auULMq1SoUzMhVSLsBhXpXvJA1rKOXT37qTJytgvRQhxKimuHpBsChCH06V0cSXt nQbTGHhBh0a9GC/6FXkFM6IZg+iIbdpsscOBBB2k7VKtizI+82p2oSAhCu94zr0WNGQA jdyg== X-Forwarded-Encrypted: i=1; AJvYcCVZz1bz68gaY5ZR/Q2kUEjBxDeGNRZbykrNNXLuZNg0itO8Zo2tFbiSBKZcc/yPeMpHGkIFexLqlV74M74=@vger.kernel.org X-Gm-Message-State: AOJu0Yz8SKzJzHAwnPVcB4JbuQKCp7M4gwk9+yx3WbUQcg/HV3doQvEV jqxB7nBFN+0MYA7xbYV1TztncXTCUuNvNsDs0W4FuZLsHGiNeXDrH+QZ X-Gm-Gg: ASbGncsihz27teZHOxHCoMdxvQmFzL/88NXehxqs3rwMlqHjCPERIpvegRY69Z4zzMU pT3w/+CDN7odGd6Ics3IkKLymLy40+8QJsTwdQB2JpD0tCCsLSdS0NoCuuY1Jandk51in6xyxKq 68X9b2SG0B9/EoAoyVgfWxRAMn5GXOkNn25xGNj8zpfZDZNSnTUtRAhZRZlOg5PNO9+4AjakKMQ 4SaCuE8mb1WtjakK6iqkURow9YwU/M5skx2vKaMepIpKapwWZgJZ7gePb7Hdzy/k5BNmw0TSr89 tRyvq4yyZZvjUl3GQ64FSqEgRLCIX6m/GRZ5uUjR72CRnMofPt2DBsyl7BCRS/LwCO3QTAugqew eQmOtV5erHup6G3Bvxb8sbEwUZRpv8y6iEnLUWh8NmhZCqqwvcbHY5M/E958BMywKxocvhgWVFT 5+KHrtpA== X-Google-Smtp-Source: AGHT+IGPkoQEyym4vrrdjP+BNzpJr8ZiqPYTLWrM9emXz1prPZyu27GMgJiwVA1OnHpVqgCC5eI05A== X-Received: by 2002:a17:902:cece:b0:27d:6cb6:f7c2 with SMTP id d9443c01a7336-29091b162f4mr3433615ad.17.1760537931868; Wed, 15 Oct 2025 07:18:51 -0700 (PDT) Received: from localhost.localdomain ([2409:891f:1b80:80c6:cd21:3ff9:2bca:36d1]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034f32d6fsm199561445ad.96.2025.10.15.07.18.41 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 15 Oct 2025 07:18:50 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, ameryhung@gmail.com, rientjes@google.com, corbet@lwn.net, 21cnbao@gmail.com, shakeel.butt@linux.dev, tj@kernel.org, lance.yang@linux.dev, rdunlap@infradead.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao Subject: [RFC PATCH v10 mm-new 9/9] Documentation: add BPF-based THP policy management Date: Wed, 15 Oct 2025 22:17:16 +0800 Message-Id: <20251015141716.887-10-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) In-Reply-To: <20251015141716.887-1-laoar.shao@gmail.com> References: <20251015141716.887-1-laoar.shao@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add the documentation. Signed-off-by: Yafang Shao --- Documentation/admin-guide/mm/transhuge.rst | 39 ++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/adm= in-guide/mm/transhuge.rst index 1654211cc6cf..f6991c674329 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -738,3 +738,42 @@ support enabled just fine as always. No difference can= be noted in hugetlbfs other than there will be less overall fragmentation. All usual features belonging to hugetlbfs are preserved and unaffected. libhugetlbfs will also work fine as usual. + +BPF THP +=3D=3D=3D=3D=3D=3D=3D + +Overview +-------- + +When the system is configured with "always" or "madvise" THP mode, a BPF p= rogram +can be used to adjust THP allocation policies dynamically. This enables +fine-grained control over THP decisions based on various factors including +workload identity, allocation context, and system memory pressure. + +Program Interface +----------------- + +This feature implements a struct_ops BPF program with the following interf= ace:: + + int thp_get_order(struct vm_area_struct *vma, + enum tva_type type, + unsigned long orders); + +Parameters:: + + @vma: vm_area_struct associated with the THP allocation + @type: TVA type for current @vma + @orders: Bitmask of available THP orders for this allocation + +Return value:: + + The suggested THP order for allocation from the BPF program. Must be + a valid, available order. + +Implementation Notes +-------------------- + +This is currently an experimental feature. CONFIG_BPF_THP (EXPERIMENTAL) m= ust be +enabled to use it. Only one BPF program can be attached at a time, but the +program can be updated dynamically to adjust policies without requiring af= fected +tasks to be restarted. --=20 2.47.3