From nobody Mon Oct 6 03:13:54 2025 Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C42552EF9A6; Fri, 25 Jul 2025 16:23:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460586; cv=none; b=sX7nNznVdWrf+OWzbfIEBCdphCG0hcmcVwBKyyC7Q6I8Cu1Ih1+m+4EIKfhqGdfbRX8x0phGTKULOrtL5bsPAZE66e9B0kzPi4Wj/6enPIvzdTyBrIWQIFvSDxAn6ETuQrEIDeDxOosXjgSsbJaW8k6+KV9Q2bnT4oNgLGBaXvM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460586; c=relaxed/simple; bh=rPGdzJ1utADlbXcCG5t7pOohlyC8BgIb2JWIp4zGJus=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pxzq5Km8qKQoceyJ/XByqNwMJ/IUCW0Hji+YaqABCxdHPTam3hijkVtj8VMSmZZ2oLLp1ecc0cucqPLqirMrTPSXPhUvUO5DVFhMVupnuc5HOOe9Gt0WqzJsZskEuMkaTPF/V1q6Rt5AW/WcK53754vtCEXm5oaQiRR43CJ0FcU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=T+sDCyM5; arc=none smtp.client-ip=209.85.222.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="T+sDCyM5" Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7e344e0212eso203436385a.0; Fri, 25 Jul 2025 09:23:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753460582; x=1754065382; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Z57Bp7fcgh9mpw4Mw69raPU+Ddkv3U7TmKT9VCG6aHQ=; b=T+sDCyM5NzMNLzN4KlSzffoBfjOLyZJyTvxE5Lj3lEqmQ42gSCTI7WwoVDJiHgXtvY H9C+rVgnCd+YgyClqwrcMzZQPo1MkZo9tDYtDTq8vpk09cbWXyl1dvrUdweutjyxUr8G iTYUrPjfHOo4Pjrv0puOMHJxoN8SeTIPzZ7swYbJimEWnO1SS+0urfrXFIcO2JIOrT2t 78ntsH107xAug2xw9whtDC/cXU9mGGsj/JRbcsY/sY9HCASlG5H6jipTMAOfm9P810gU lxCNDPMfLNR51TLlqzSIDmg7L/x6HxdrbcVFNJwgAwqBO5LuqjWKZfdepvoubVB+Iex3 pp2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753460582; x=1754065382; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Z57Bp7fcgh9mpw4Mw69raPU+Ddkv3U7TmKT9VCG6aHQ=; b=Vbgea0XwVp1/1v1L3vKeDgX0M/4AKvjon5Bkoi4sq0YKWjcPd0I3BWFuiZ25MhQHPR 0hlD/DcPuQRcCbwib66txsmvZnenC9JkZuI/t5oQDwhOKbM4/pAxRCJpGMhNhGrJiB7S nkhfr2BNmqdcWqA+467FHHLI8Tjwh4sSRPMwB+cTHtARB2EoUN5t+mX5cfJcIDSgvXTP bIez3ICx+hLqi20lu0tESqZViA3+uoFr3SKg2b8W6Ct3Njm0Ec+zpxLl/njtE6p0EVmy rHH8/0rIde/rXsAmJ5AkVjffooYzIclq2EpLQm5OBWhtrimHkfUMHzQCO8IXPb+Nfens 7PDw== X-Forwarded-Encrypted: i=1; AJvYcCUFExQ7O2H0O0wBDJUXGZqfzoeYzeSmyrnpZ4mDcq0tCJ1KPp9balnSj0Q9jStW4E60egSo7WW/q41N2oHE@vger.kernel.org, AJvYcCUaBm0GjEvHfKHkdaNaZNh4IPPzsopKRVzyQ71i6Z6w2IAdQ+FQDvfv9fl95HvFPtPJSibTHdy6L7E=@vger.kernel.org X-Gm-Message-State: AOJu0YyAiwKiEOK61xn0BJO1WEoMfU839h4tT49C8V4lQpz0GcYDHkSh n6dif4zcIfvOoRjs4bGpsxZ6SI45GlNSpqfN8EMsgPp1xEdgBUmiV1pg X-Gm-Gg: ASbGncv1zU+LxF1p4zTiJLI6yk0QniNbml58FI200ysTgTJKDPiZOJ+GgVMGdhnYbo2 xwSrFCli0AeCkAsxIlfyKuMJ8PBfb53nQ6HGV/pNMw/cNJoGG6+C4xb3XFhMElXCjWvBq4zv9LY n5lFtBvCs458gKoo7wZx1nuyiCXQuIA8P6EPouMztONt8q35stTxtY4txBihc+AInTtI02pYCLc MP0QazSpbqdlwEa1x9uiFmbmvIGXINij1/r9jHsLkuU6/wW4C95qO2wtqRagee05uUKry3r4Xu9 uldGRT7aboGgZACXg1urJBtAs+jxcvAKPUbNXT7tE5CXhmHbjGlkl7LQS4ECp3KE3MI1USWo4Lp A0EV2PETD+gn+8NYRSzI3 X-Google-Smtp-Source: AGHT+IGCeNke0SeM9Ispg/d02iGrJuhljbXVRw1O86IJ1A7bjFuImtZGjecGyafqaBCHYBK2FmMnNQ== X-Received: by 2002:a05:620a:2683:b0:7e3:340c:2f12 with SMTP id af79cd13be357-7e633ce53f0mr821330485a.0.1753460581900; Fri, 25 Jul 2025 09:23:01 -0700 (PDT) Received: from localhost ([2a03:2880:20ff:72::]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e6438a61b7sm13048985a.91.2025.07.25.09.23.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 09:23:00 -0700 (PDT) From: Usama Arif To: Andrew Morton , david@redhat.com, linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org, corbet@lwn.net, rppt@kernel.org, surenb@google.com, mhocko@suse.com, hannes@cmpxchg.org, baohua@kernel.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , sj@kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif , Matthew Wilcox Subject: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE Date: Fri, 25 Jul 2025 17:22:40 +0100 Message-ID: <20250725162258.1043176-2-usamaarif642@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20250725162258.1043176-1-usamaarif642@gmail.com> References: <20250725162258.1043176-1-usamaarif642@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: David Hildenbrand People want to make use of more THPs, for example, moving from the "never" system policy to "madvise", or from "madvise" to "always". While this is great news for every THP desperately waiting to get allocated out there, apparently there are some workloads that require a bit of care during that transition: individual processes may need to opt-out from this behavior for various reasons, and this should be permitted without needing to make all other workloads on the system similarly opt-out. The following scenarios are imaginable: (1) Switch from "none" system policy to "madvise"/"always", but keep THPs disabled for selected workloads. (2) Stay at "none" system policy, but enable THPs for selected workloads, making only these workloads use the "madvise" or "always" policy. (3) Switch from "madvise" system policy to "always", but keep the "madvise" policy for selected workloads: allocate THPs only when advised. (4) Stay at "madvise" system policy, but enable THPs even when not advised for selected workloads -- "always" policy. Once can emulate (2) through (1), by setting the system policy to "madvise"/"always" while disabling THPs for all processes that don't want THPs. It requires configuring all workloads, but that is a user-space problem to sort out. (4) can be emulated through (3) in a similar way. Back when (1) was relevant in the past, as people started enabling THPs, we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet (i.e., used by Redis) were able to just disable THPs completely. Redis still implements the option to use this interface to disable THPs completely. With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a workload -- a process, including fork+exec'ed process hierarchy. That essentially made us support (1): simply disable THPs for all workloads that are not ready for THPs yet, while still enabling THPs system-wide. The quest for handling (3) and (4) started, but current approaches (completely new prctl, options to set other policies per process, alternatives to prctl -- mctrl, cgroup handling) don't look particularly promising. Likely, the future will use bpf or something similar to implement better policies, in particular to also make better decisions about THP sizes to use, but this will certainly take a while as that work just started. Long story short: a simple enable/disable is not really suitable for the future, so we're not willing to add completely new toggles. While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs completely for these processes, this is a step backwards, because these processes can no longer allocate THPs in regions where THPs were explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that imposes a problem for relevant workloads, because "not THPs" is certainly worse than "THPs only when advised". Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this would change the documented semantics quite a bit, and the versatility to use it for debugging purposes, so I am not 100% sure that is what we want -- although it would certainly be much easier. So instead, as an easy way forward for (3) and (4), add an option to make PR_SET_THP_DISABLE disable *less* THPs for a process. In essence, this patch: (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3 of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 !=3D 0). prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED). (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling. Previously, it would return 1 if THPs were disabled completely. Now it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED was set. (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express the semantics clearly. Fortunately, there are only two instances outside of prctl() code. (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs with VM_HUGEPAGE" -- essentially "thp=3Dmadvise" behavior Fortunately, we only have to extend vma_thp_disabled(). (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are disabled completely Only indicating that THPs are disabled when they are really disabled completely, not only partially. For now, we don't add another interface to obtained whether THPs are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If ever required, we could add a new entry. The documented semantics in the man page for PR_SET_THP_DISABLE "is inherited by a child created via fork(2) and is preserved across execve(2)" is maintained. This behavior, for example, allows for disabling THPs for a workload through the launching process (e.g., systemd where we fork() a helper process to then exec()). For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space thinks a THP is a good idea, we'll enable that separately next (requiring a bit of cleanup first). There is currently not way to prevent that a process will not issue PR_SET_THP_DISABLE itself to re-enable THP. There are not really known users for re-enabling it, and it's against the purpose of the original interface. So if ever required, we could investigate just forbidding to re-enable them, or make this somehow configurable. Acked-by: Usama Arif Tested-by: Usama Arif Cc: Jonathan Corbet Cc: Andrew Morton Cc: Lorenzo Stoakes Cc: Zi Yan Cc: Baolin Wang Cc: "Liam R. Howlett" Cc: Nico Pache Cc: Ryan Roberts Cc: Dev Jain Cc: Barry Song Cc: Vlastimil Babka Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Michal Hocko Cc: Usama Arif Cc: SeongJae Park Cc: Jann Horn Cc: Liam R. Howlett Cc: Yafang Shao Cc: Matthew Wilcox Signed-off-by: David Hildenbrand Reviewed-by: Lorenzo Stoakes --- At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I think there might be real use cases where we want to disable any THPs -- in particular also around debugging THP-related problems, and "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE. PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very helpful for debugging purposes. Of course, I thought of having a system-wide config option to modify PR_SET_THP_DISABLE behavior, but I just don't like the semantics. "prctl: allow overriding system THP policy to always"[1] proposed "overriding policies to always", which is just the wrong way around: we should not add mechanisms to "enable more" when we already have an interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets weird otherwise. "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed setting the default of the VM_HUGEPAGE, which is similarly the wrong way around I think now. The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4] similarly were around the "default for VM_HUGEPAGE" idea, but after the discussion, I think we should better leave VM_HUGEPAGE untouched. Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where we essentially want to say "leave advised regions alone" -- "keep THP enabled for advised regions", The only thing I really dislike about this is using another MMF_* flag, but well, no way around it -- and seems like we could easily support more than 32 if we want to (most users already treat it like a proper bitmap). I think this here (modifying an existing toggle) is the only prctl() extension that we might be willing to accept. In general, I agree like most others, that prctl() is a very bad interface for that -- but PR_SET_THP_DISABLE is already there and is getting used. Long-term, I think the answer will be something based on bpf[5]. Maybe in that context, I there could still be value in easily disabling THPs for selected workloads (esp. debugging purposes). Jann raised valid concerns[6] about new flags that are persistent across exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I consider it having a similar security risk as our existing PR_SET_THP_DISABLE, but devil is in the detail. [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.c= om [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.c= om [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.c= om [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.= local [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8D= dXRnjg@mail.gmail.com Signed-off-by: David Hildenbrand --- Documentation/filesystems/proc.rst | 5 +-- fs/proc/array.c | 2 +- include/linux/huge_mm.h | 20 ++++++++--- include/linux/mm_types.h | 13 +++---- include/uapi/linux/prctl.h | 10 ++++++ kernel/sys.c | 58 +++++++++++++++++++++++------- mm/khugepaged.c | 2 +- 7 files changed, 81 insertions(+), 29 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems= /proc.rst index 2971551b7235..915a3e44bc12 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -291,8 +291,9 @@ It's slow but very precise. HugetlbPages size of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted = core) - THP_enabled process is allowed to use THP (returns 0 when - PR_SET_THP_DISABLE is set on the process + THP_enabled process is allowed to use THP (returns 0 when + PR_SET_THP_DISABLE is set on the process to d= isable + THP completely, not just partially) Threads number of threads SigQ number of signals queued/max. number for queue SigPnd bitmap of pending signals for the thread diff --git a/fs/proc/array.c b/fs/proc/array.c index d6a0369caa93..c4f91a784104 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, = struct mm_struct *mm) bool thp_enabled =3D IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE); =20 if (thp_enabled) - thp_enabled =3D !test_bit(MMF_DISABLE_THP, &mm->flags); + thp_enabled =3D !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); seq_printf(m, "THP_enabled:\t%d\n", thp_enabled); } =20 diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7748489fde1b..71db243a002e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -318,16 +318,26 @@ struct thpsize { (transparent_hugepage_flags & \ (1<vm_mm->flags)) + return true; /* - * Explicitly disabled through madvise or prctl, or some - * architectures may disable THP for some mappings, for - * example, s390 kvm. + * Are THPs disabled only for VMAs where we didn't get an explicit + * advise to use them? */ - return (vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags); + if (vm_flags & VM_HUGEPAGE) + return false; + return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); } =20 static inline bool thp_disabled_by_hw(void) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1ec273b06691..123fefaa4b98 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1743,19 +1743,16 @@ enum { #define MMF_VM_MERGEABLE 16 /* KSM may merge identical pages */ #define MMF_VM_HUGEPAGE 17 /* set when mm is available for khugepaged */ =20 -/* - * This one-shot flag is dropped due to necessity of changing exe once aga= in - * on NFS restore - */ -//#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +#define MMF_HUGE_ZERO_PAGE 18 /* mm has ever used the global huge zer= o page */ =20 #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ #define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */ #define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */ -#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zer= o page */ -#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */ -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +#define MMF_DISABLE_THP_EXCEPT_ADVISED 23 /* no THP except when advised (e= .g., VM_HUGEPAGE) */ +#define MMF_DISABLE_THP_COMPLETELY 24 /* no THP for all VMAs */ +#define MMF_DISABLE_THP_MASK ((1 << MMF_DISABLE_THP_COMPLETELY) |\ + (1 << MMF_DISABLE_THP_EXCEPT_ADVISED)) #define MMF_OOM_REAP_QUEUED 25 /* mm was queued for oom_reaper */ #define MMF_MULTIPROCESS 26 /* mm is shared between processes */ /* diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 43dec6eed559..9c1d6e49b8a9 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -177,7 +177,17 @@ struct prctl_mm_map { =20 #define PR_GET_TID_ADDRESS 40 =20 +/* + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0 + * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effective= ly + * return "1" when no flags were specified for PR_SET_THP_DISABLE. + */ #define PR_SET_THP_DISABLE 41 +/* + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE / + * VM_HUGEPAGE). + */ +# define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) #define PR_GET_THP_DISABLE 42 =20 /* diff --git a/kernel/sys.c b/kernel/sys.c index b153fb345ada..b87d0acaab0b 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigne= d long len) return sizeof(mm->saved_auxv); } =20 +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3, + unsigned long arg4, unsigned long arg5) +{ + unsigned long *mm_flags =3D ¤t->mm->flags; + + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + + if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags)) + return 1; + else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags)) + return 1 | PR_THP_DISABLE_EXCEPT_ADVISED; + return 0; +} + +static int prctl_set_thp_disable(bool thp_disable, unsigned long flags, + unsigned long arg4, unsigned long arg5) +{ + unsigned long *mm_flags =3D ¤t->mm->flags; + + if (arg4 || arg5) + return -EINVAL; + + /* Flags are only allowed when disabling. */ + if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED)) + return -EINVAL; + if (mmap_write_lock_killable(current->mm)) + return -EINTR; + if (thp_disable) { + if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) { + clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); + set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); + } else { + set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); + clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); + } + } else { + clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags); + clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags); + } + mmap_write_unlock(current->mm); + return 0; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, ar= g3, unsigned long, arg4, unsigned long, arg5) { @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, = arg2, unsigned long, arg3, return -EINVAL; return task_no_new_privs(current) ? 1 : 0; case PR_GET_THP_DISABLE: - if (arg2 || arg3 || arg4 || arg5) - return -EINVAL; - error =3D !!test_bit(MMF_DISABLE_THP, &me->mm->flags); + error =3D prctl_get_thp_disable(arg2, arg3, arg4, arg5); break; case PR_SET_THP_DISABLE: - if (arg3 || arg4 || arg5) - return -EINVAL; - if (mmap_write_lock_killable(me->mm)) - return -EINTR; - if (arg2) - set_bit(MMF_DISABLE_THP, &me->mm->flags); - else - clear_bit(MMF_DISABLE_THP, &me->mm->flags); - mmap_write_unlock(me->mm); + error =3D prctl_set_thp_disable(arg2, arg3, arg4, arg5); break; case PR_MPX_ENABLE_MANAGEMENT: case PR_MPX_DISABLE_MANAGEMENT: diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 1ff0c7dd2be4..2c9008246785 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_st= ruct *mm) static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm) { return hpage_collapse_test_exit(mm) || - test_bit(MMF_DISABLE_THP, &mm->flags); + test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags); } =20 static bool hugepage_pmd_enabled(void) --=20 2.47.3 From nobody Mon Oct 6 03:13:54 2025 Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9AD1E2EF9BA; Fri, 25 Jul 2025 16:23:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460586; cv=none; b=I5pIXv4Hg4MsWIld0iJ0vj54zHuUgRVbRvu6ycAuSMDQMaQNwSeXivZ7wXH/0q1oU9fnjKNRkmUaRYWsmO3HMOt7tAbSDwPp0ueY0XuMgw9+9Wfr+nIYDyaGm2zqe7kunMVb72mVDxxRPG51C6GJWkhOLccZfr/FUER3aodjWBw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460586; c=relaxed/simple; bh=SbV7RA8f8RD5nTQmeH0YktvVRh+zDjgh0VIsL8wW30s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=K54wEtrNbHr3keVWba3+akED6t9SPO6UmjJGOGB/HRJUZLii1OxDbkYzd8uecimsHDyBqftlWHSlCFdEpXskariXRFxFDlB3EFYMW3HApvfOncgIxHwnCAvPELPsbwA3TKhYzYqjwjxIQpbnaF0hvVcjCJxi/gbPmGNP1VjLT00= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=nV8kPCoq; arc=none smtp.client-ip=209.85.222.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nV8kPCoq" Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7e63adcd6d8so78911385a.2; Fri, 25 Jul 2025 09:23:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753460583; x=1754065383; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=J/IVGkkPzURjGDJrhaXlQLfLTJbo3ezBGdfSaquSXrc=; b=nV8kPCoqIxiqE3I6tinaGD5cZBulSKZN/PViR/zucNKYcDGd19ufWWWE/9SMj2MidB mWjW68zX1tYL5Zdpja71YXn80yN9UK1inVGOfp7/9GmfNONZXXLPMnc4wmt2sXDlcgcz 0poiRTrY8ZOy6NKIe+yqCgqN+E1sKS/5dodyAD5qA7lwMPsMfaGniEg8nTGRfnf9tvxJ rftGIGMSLmSh3HJml1k3NxVmErTySjx0u94j51NdHpFGMUpbIi8qMb8pqq9BB02RV4IB haDfAJ1vOf2NhwsBEqTGpESfqimyMBL8kU1W5CNe1F6zIJcBHsrYwnez0DcqrwCZgrMm 4pyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753460583; x=1754065383; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=J/IVGkkPzURjGDJrhaXlQLfLTJbo3ezBGdfSaquSXrc=; b=oL4O0DSS8tsio/swGBViArgxajBqI0SIBqHglD+dad53shrf9zVO1egg1zz97yr9N0 oqgWyRRJv4IZ7FN/4pjTYKNsyYkHRay+oxILRspS7mMWg4YrM+wUxh0WYzpIq0Hrtb9S 2NqTyWbcpFZPIfmj2/q80dk1KwSYbyFAMc28TyFvQtLkIIFEKgY7Fbv3/7suBZGLIPDi HEHnp16qr0YHYD/t/QAIyv0cpQttKQI9aHLq/nKN3z7fD/kpP5dDdUrasfKYUr7RFm0H +jravp7vwqwUdwTiEuWoSNJbzkYUDTIQps+gWVnk9xmIxYuTCJO5QQbTDX43mC+2Nh6s o0iA== X-Forwarded-Encrypted: i=1; AJvYcCX2yKv2Sunfy9jkQqILE7FgR0GygPDkdMGCNufysxsNBP9OYZ6lrH5zcNRUxlRW6C4QeWcmF6X/rHk=@vger.kernel.org, AJvYcCXVXVIe/uQyY9BzFKl2sYnQH8xHRx0GrIWEGl0lOfWPpPlNGXYn9Sxiy1cfceLLrq76vuGlzp+iSnZb4L5u@vger.kernel.org X-Gm-Message-State: AOJu0YwZPbr6U0NzNumeHufq+G3OvlFo/5GuDyyFmzd+NfuxmseSEfL/ rxTxkQUuH5zFVNinpv23ccnaL0M0fhU3ijKVTYRX0D4ZtZEbEcrPN4Ft X-Gm-Gg: ASbGnctznHxjNtNI+EEdEUMQ0VrOgSMfz4lThyhV3/fiW2qle+//8IM316gWrggfvl7 +UH9V28gp47q6nI+NvdOXG1Dvnt6qAuLr/Hv1T7NYpF007bqexfKstkBlFSRgywuSHwuf5Y2mKv 1/taD1QN7/wUZoEtn4kujy1SdSTn0AFhvfJTOXLq9cIJ1C1n6UEJtSey2a46s2TO8MtOqm5j47p 4ye8/KLyhEBdNkSFieJZ+jkQ75vcrPifJliDaJ6ggPgBYns4es6ZP5aCxkAa6GYsBo49hJFsg6f a/wN7It8KeKdqq7vxOo7BMVr9x0VacKXfqBZ7HMyARgG08b+7ggE78jRHzHNzmM3L2JpnsujDPv LbbPNSHsnWQByLkyRJSlZ X-Google-Smtp-Source: AGHT+IEgZsEiqpagnFsk8v4FDxK/dzmgVDM7YABO2TNliYM29fPVCH0Cz5kX2NK91wCLVnYNwD6yuQ== X-Received: by 2002:a05:620a:a005:b0:7e3:3e32:e620 with SMTP id af79cd13be357-7e63c1be3demr198759185a.36.1753460583243; Fri, 25 Jul 2025 09:23:03 -0700 (PDT) Received: from localhost ([2a03:2880:20ff:74::]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7e6438a891dsm13036485a.92.2025.07.25.09.23.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 09:23:02 -0700 (PDT) From: Usama Arif To: Andrew Morton , david@redhat.com, linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org, corbet@lwn.net, rppt@kernel.org, surenb@google.com, mhocko@suse.com, hannes@cmpxchg.org, baohua@kernel.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , sj@kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH 2/5] mm/huge_memory: convert "tva_flags" to "enum tva_type" for thp_vma_allowable_order*() Date: Fri, 25 Jul 2025 17:22:41 +0100 Message-ID: <20250725162258.1043176-3-usamaarif642@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20250725162258.1043176-1-usamaarif642@gmail.com> References: <20250725162258.1043176-1-usamaarif642@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: David Hildenbrand Describing the context through a type is much clearer, and good enough for our case. We have: * smaps handling for showing "THPeligible" * Pagefault handling * khugepaged handling * Forced collapse handling: primarily MADV_COLLAPSE, but one other odd case Really, we want to ignore sysfs only when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to enforce. With this change, we immediately know if we are in the forced collapse case, which will be valuable next. Signed-off-by: David Hildenbrand Acked-by: Usama Arif --- fs/proc/task_mmu.c | 4 ++-- include/linux/huge_mm.h | 30 ++++++++++++++++++------------ mm/huge_memory.c | 8 ++++---- mm/khugepaged.c | 18 +++++++++--------- mm/memory.c | 14 ++++++-------- 5 files changed, 39 insertions(+), 35 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3d6d8a9f13fc..d440df7b3d59 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1293,8 +1293,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); =20 seq_printf(m, "THPeligible: %8u\n", - !!thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_SMAPS | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL)); + !!thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SMAPS, + THP_ORDERS_ALL)); =20 if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 71db243a002e..b0ff54eee81c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -94,12 +94,15 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define THP_ORDERS_ALL \ (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_SPECIAL | THP_ORDERS_ALL_FILE_DEFAU= LT) =20 -#define TVA_SMAPS (1 << 0) /* Will be used for procfs */ -#define TVA_IN_PF (1 << 1) /* Page fault handler */ -#define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ +enum tva_type { + TVA_SMAPS, /* Exposing "THPeligible:" in smaps. */ + TVA_PAGEFAULT, /* Serving a page fault. */ + TVA_KHUGEPAGED, /* Khugepaged collapse. */ + TVA_FORCED_COLLAPSE, /* Forced collapse (i.e., MADV_COLLAPSE). */ +}; =20 -#define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ - (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) +#define thp_vma_allowable_order(vma, vm_flags, type, order) \ + (!!thp_vma_allowable_orders(vma, vm_flags, type, BIT(order))) =20 #define split_folio(f) split_folio_to_list(f, NULL) =20 @@ -264,14 +267,14 @@ static inline unsigned long thp_vma_suitable_orders(s= truct vm_area_struct *vma, =20 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders); =20 /** * thp_vma_allowable_orders - determine hugepage orders that are allowed f= or vma * @vma: the vm area to check * @vm_flags: use these vm_flags instead of vma->vm_flags - * @tva_flags: Which TVA flags to honour + * @type: TVA type * @orders: bitfield of all orders to consider * * Calculates the intersection of the requested hugepage orders and the al= lowed @@ -285,11 +288,14 @@ unsigned long __thp_vma_allowable_orders(struct vm_ar= ea_struct *vma, static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { - /* Optimization to check if required orders are enabled early. */ - if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) { + /* + * Optimization to check if required orders are enabled early. Only + * forced collapse ignores sysfs configs. + */ + if (type !=3D TVA_FORCED_COLLAPSE && vma_is_anonymous(vma)) { unsigned long mask =3D READ_ONCE(huge_anon_orders_always); =20 if (vm_flags & VM_HUGEPAGE) @@ -303,7 +309,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_s= truct *vma, return 0; } =20 - return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders); + return __thp_vma_allowable_orders(vma, vm_flags, type, orders); } =20 struct thpsize { @@ -536,7 +542,7 @@ static inline unsigned long thp_vma_suitable_orders(str= uct vm_area_struct *vma, =20 static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct= *vma, vm_flags_t vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2b4ea5a2ce7d..85252b468f80 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -99,12 +99,12 @@ static inline bool file_thp_enabled(struct vm_area_stru= ct *vma) =20 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, vm_flags_t vm_flags, - unsigned long tva_flags, + enum tva_type type, unsigned long orders) { - bool smaps =3D tva_flags & TVA_SMAPS; - bool in_pf =3D tva_flags & TVA_IN_PF; - bool enforce_sysfs =3D tva_flags & TVA_ENFORCE_SYSFS; + const bool smaps =3D type =3D=3D TVA_SMAPS; + const bool in_pf =3D type =3D=3D TVA_PAGEFAULT; + const bool enforce_sysfs =3D type !=3D TVA_FORCED_COLLAPSE; unsigned long supported_orders; =20 /* Check the intersection of requested and supported orders. */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2c9008246785..7a54b6f2a346 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -474,8 +474,7 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_pmd_enabled()) { - if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS, - PMD_ORDER)) + if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) __khugepaged_enter(vma->vm_mm); } } @@ -921,7 +920,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm= , unsigned long address, struct collapse_control *cc) { struct vm_area_struct *vma; - unsigned long tva_flags =3D cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0; + enum tva_type tva_type =3D cc->is_khugepaged ? TVA_KHUGEPAGED : + TVA_FORCED_COLLAPSE; =20 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) return SCAN_ANY_PROCESS; @@ -932,7 +932,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm= , unsigned long address, =20 if (!thp_vma_suitable_order(vma, address, PMD_ORDER)) return SCAN_ADDRESS_RANGE; - if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_type, PMD_ORDER)) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1532,9 +1532,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr, * in the page cache with a single hugepage. If a mm were to fault-in * this memory (mapped by a suitably aligned VMA), we'd get the hugepage * and map it by a PMD, regardless of sysfs THP settings. As such, let's - * analogously elide sysfs THP settings here. + * analogously elide sysfs THP settings here and pretend we are + * collapsing. */ - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) return SCAN_VMA_CHECK; =20 /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2431,8 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned = int pages, int *result, progress++; break; } - if (!thp_vma_allowable_order(vma, vma->vm_flags, - TVA_ENFORCE_SYSFS, PMD_ORDER)) { + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORD= ER)) { skip: progress++; continue; @@ -2766,7 +2766,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, BUG_ON(vma->vm_start > start); BUG_ON(vma->vm_end < end); =20 - if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER)) + if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) return -EINVAL; =20 cc =3D kmalloc(sizeof(*cc), GFP_KERNEL); diff --git a/mm/memory.c b/mm/memory.c index 92fd18a5d8d1..be761753f240 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4369,8 +4369,8 @@ static struct folio *alloc_swap_folio(struct vm_fault= *vmf) * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ - orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); orders =3D thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); @@ -4917,8 +4917,8 @@ static struct folio *alloc_anon_folio(struct vm_fault= *vmf) * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ - orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT, + BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); =20 if (!orders) @@ -6108,8 +6108,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - thp_vma_allowable_order(vma, vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PUD_ORDER)) { ret =3D create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -6143,8 +6142,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_st= ruct *vma, goto retry_pud; =20 if (pmd_none(*vmf.pmd) && - thp_vma_allowable_order(vma, vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, PMD_ORDER)) { + thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) { ret =3D create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; --=20 2.47.3 From nobody Mon Oct 6 03:13:54 2025 Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com [209.85.219.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 125972EF9B2; Fri, 25 Jul 2025 16:23:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460587; cv=none; b=T8YIqjSYT0cA7Yx7r3pFZxxW+Hck579sT8WOd2oONNt1jeLeX0URoK12bV9bIHTBUGojSsKlkTsTt4ClR+xpCHriqad7nRbZw/TOhidwSYoT39872vhB51rE5y+iTZ/SNdFy3Hjby5KD56v3p1A1yYOzWZxrr8zWDxsZ01e2b4Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460587; c=relaxed/simple; bh=lmwS2ogqw4bPnaA/NfmN2fdPCVXyC/N6/6t61sTZVlU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uMspOcD6A2AXG2oP9cqwCne4Ia7rbI/Mm/AZ6cAzUy8DCKVeQ8LoIzHK0AmypMdWDrPdWb86pkp3PZQO5oYF6RUZNRNmWJL9W8SGaBXgfIfJiluA6Pvbbv2PFrsog1spiz9uv+QdTuXzANs+UTsWu0LT3apxn5LvJIVqJJxbgy8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=WQTXq0SX; arc=none smtp.client-ip=209.85.219.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WQTXq0SX" Received: by mail-qv1-f45.google.com with SMTP id 6a1803df08f44-6fd1b2a57a0so23596746d6.1; Fri, 25 Jul 2025 09:23:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753460585; x=1754065385; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JREE5x0ARz2QqEVTan/K1w654MwTZ7Hc0OH3aFNqbBE=; b=WQTXq0SXSooxrx8kN39IbwOTb8kAoeAbT6ti7g34UM+4EmqOoVM8Emc6B1spIP5E6P Hhi/ZBKjqBn2Lxr1MTXe/fPbWptgdaJi4FfVlqg5ggordKxZ/jXAeiKfDjruy8qEDF7j Jnl27+ZCxT/uPiQNxFxHJGJfBvY5ypBm1h/OztB4Imofjm/sGaxVufIU5hY1cUgkhlgW 4dQw9hy/V0CFcW8ycRVZ30hKvIVI3kuEjcF+yOSGFHgPPLByUC+n3d0hRsQ7y5KXEAmI rRp0C6kBJVsTR6IkRPs8kYEh1sfNwkMuZdm5qRqN7YmzWLjPbJt238Hh0LuwxQvEkaTt OJAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753460585; x=1754065385; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JREE5x0ARz2QqEVTan/K1w654MwTZ7Hc0OH3aFNqbBE=; b=NdOPMFrKgZfIC0ctxRNoh9766GDD9a4YTwC2uY82gPFGxIENX00aI7uc3Iz7H+eE16 1zKWVv9xX8vxZcm+talUtGVXbzia6lSvmfDLSaIIvhwtHhwRjPn7MjYkpfOxJ9xmMwoA r8gSbhsPp2MIuvEYafxej5UIZ66DilgApLgIlgI24F2vSpATEoAteS5d+j+Lv5tuql9k AZSElXvVJ8Ef7VT3gnnizKR68wTrjamxSNskeJ2sO6iu9Y/O09lFhmBUKjDkekB9yP0Q T5M7zAjzSA9vskN43ps3YVFruXJ5A6K8XHbL+opf+DYY4GUkKKjwxEeCwhHi7Yokvou/ C1gQ== X-Forwarded-Encrypted: i=1; AJvYcCVuEpgGDmhxtYupgwtvp1gQniyTPEGn28frllXMjmDM7JRFmVJP81Du8H/ql0e/xqJjvYQNob8zgGgJdy3I@vger.kernel.org, AJvYcCWnO/nJl8qb+IDcGF18NDBdJBiKCjjhrGFZtc9gdpdRcIm38/UgeQWOxUloQKhlNG+1k3xJyJmr9uM=@vger.kernel.org X-Gm-Message-State: AOJu0YyRyVH2P8o1a3qVCd0qJWmR1Hmhl87lZErQ8OW++yD09gAD4ISl ltjq3qot4q0RO5b/YNo+ZNRgeWp/3PI6GsMNpB+nFYJWhzKQtq8yDGRM X-Gm-Gg: ASbGncvpHH1JLb3Qi3GzYlX/bXHjupYouKyC0YPcgjlBhK4zwlysErRqMJtPvBh275S Ny5kCcgHPhu83gHkKO6XJ+P5i4GvKo1ccv7SXxOm8IR81D/Klz8w9e5YgshN65jcCk9EORivdF7 CJ3gaeNOJUNPDiLy5fqfSMjMwKnSgdgVUggKxS1mCtEfiD/ZUuIx2GLBbM7yRgSfy4tMcOPRml2 7HPfXDYreBVWYJtLYUc+yDtTzjdLcoEChiDON+mHRWBdGEojATDVV9zX1LZa+48k9RVdczUjwNl DV5J7Hl/gVySs8tlHx1bBUXpWH1Zt6K1b0W5REDruN8MUN1APJop0BJbGwkRufGHf2Xf+SK1Kom zrxRns1NUYrBpLTuge4Q= X-Google-Smtp-Source: AGHT+IEN5w1m+WG8vUhOdUow2NKovnyWHknbSVQl8xiAB4hL6Q385HGbdTJbSA2QKHuAKvjV5mQMkw== X-Received: by 2002:a05:6214:29c6:b0:704:a583:d98 with SMTP id 6a1803df08f44-707205304f3mr36859976d6.18.1753460584541; Fri, 25 Jul 2025 09:23:04 -0700 (PDT) Received: from localhost ([2a03:2880:20ff:5::]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70729c6f5d9sm1807706d6.83.2025.07.25.09.23.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 09:23:04 -0700 (PDT) From: Usama Arif To: Andrew Morton , david@redhat.com, linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org, corbet@lwn.net, rppt@kernel.org, surenb@google.com, mhocko@suse.com, hannes@cmpxchg.org, baohua@kernel.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , sj@kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED Date: Fri, 25 Jul 2025 17:22:42 +0100 Message-ID: <20250725162258.1043176-4-usamaarif642@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20250725162258.1043176-1-usamaarif642@gmail.com> References: <20250725162258.1043176-1-usamaarif642@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: David Hildenbrand Let's allow for making MADV_COLLAPSE succeed on areas that neither have VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED). MADV_COLLAPSE is a clear advise that we want to collapse. Note that we still respect the VM_NOHUGEPAGE flag, just like MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED. Co-developed-by: Usama Arif Signed-off-by: Usama Arif Signed-off-by: David Hildenbrand --- include/linux/huge_mm.h | 8 +++++++- include/uapi/linux/prctl.h | 2 +- mm/huge_memory.c | 5 +++-- mm/memory.c | 6 ++++-- mm/shmem.c | 2 +- 5 files changed, 16 insertions(+), 7 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b0ff54eee81c..aeaf93f8ac2e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -329,7 +329,7 @@ struct thpsize { * through madvise or prctl. */ static inline bool vma_thp_disabled(struct vm_area_struct *vma, - vm_flags_t vm_flags) + vm_flags_t vm_flags, bool forced_collapse) { /* Are THPs disabled for this VMA? */ if (vm_flags & VM_NOHUGEPAGE) @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_str= uct *vma, */ if (vm_flags & VM_HUGEPAGE) return false; + /* + * Forcing a collapse (e.g., madv_collapse), is a clear advise to + * use THPs. + */ + if (forced_collapse) + return false; return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags); } =20 diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 9c1d6e49b8a9..ee4165738779 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -185,7 +185,7 @@ struct prctl_mm_map { #define PR_SET_THP_DISABLE 41 /* * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE / - * VM_HUGEPAGE). + * VM_HUGEPAGE / MADV_COLLAPSE). */ # define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) #define PR_GET_THP_DISABLE 42 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 85252b468f80..ef5ccb0ec5d5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area= _struct *vma, { const bool smaps =3D type =3D=3D TVA_SMAPS; const bool in_pf =3D type =3D=3D TVA_PAGEFAULT; - const bool enforce_sysfs =3D type !=3D TVA_FORCED_COLLAPSE; + const bool forced_collapse =3D type =3D=3D TVA_FORCED_COLLAPSE; + const bool enforce_sysfs =3D !forced_collapse; unsigned long supported_orders; =20 /* Check the intersection of requested and supported orders. */ @@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area= _struct *vma, if (!vma->vm_mm) /* vdso */ return 0; =20 - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags)) + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collap= se)) return 0; =20 /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ diff --git a/mm/memory.c b/mm/memory.c index be761753f240..bd04212d6f79 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5186,9 +5186,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct f= olio *folio, struct page *pa * It is too late to allocate a small folio, we already have a large * folio in the pagecache: especially s390 KVM cannot tolerate any * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any - * PMD mappings if THPs are disabled. + * PMD mappings if THPs are disabled. As we already have a THP ... + * behave as if we are forcing a collapse. */ - if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags)) + if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags, + /* forced_collapse=3D*/ true)) return ret; =20 if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER)) diff --git a/mm/shmem.c b/mm/shmem.c index e6cdfda08aed..30609197a266 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1816,7 +1816,7 @@ unsigned long shmem_allowable_huge_orders(struct inod= e *inode, vm_flags_t vm_flags =3D vma ? vma->vm_flags : 0; unsigned int global_orders; =20 - if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags))) + if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem= _huge_force))) return 0; =20 global_orders =3D shmem_huge_global_enabled(inode, index, write_end, --=20 2.47.3 From nobody Mon Oct 6 03:13:54 2025 Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 68C712EFD99; Fri, 25 Jul 2025 16:23:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460590; cv=none; b=ltYG6WW2Y2XBPGNcNbymWWVl0g56919Oi0IOdrwfFoVXQu7o45jwBc62/gbNru8rRKFT61Fq56AbJH3B2kGuyzeOqjhLZzshzmWWGPnfnSfXKRixWNhil5JAc00/9zTHDOJD0q9Ypzh7VCV1/bCHNkYESGGf21go6RoB3vGN9e8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460590; c=relaxed/simple; bh=LkB88lWIbkM8o0a2fZPfu3489tAdfyFV07pwUyZ9Z0E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lFOkBDrV5nnNkPbAWDwUpnVzgN5OSEj73LeZx6EcnADL2Bqf6/9zh5ab6rWMyIbbrlptS49DmExmTngb9SponnBTjasO+vP8hcu2GK3Co2ZRktiLhLlIGW9lbEek4GHop6qKM05b+7DySrV7YaMwg821MUvyzSaWmiEftb1Uawc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HH4sfpqZ; arc=none smtp.client-ip=209.85.160.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HH4sfpqZ" Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-4ab81d0169cso29277111cf.2; Fri, 25 Jul 2025 09:23:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753460586; x=1754065386; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=c2nMfW+8EG0yHdd/y9pp9Ykp4/bgoGWX5aUFmAoHq2k=; b=HH4sfpqZl54F0KoL0ccU32DeF235n+JQF607FyDRzumLVnGKopXawOVrsncpUbq0vi dDyQjJEzic9dQHM9UK6iJbqzSpfdIyA4dNxCLJ2e4oldOQsPvOzq7ysGgxMo4Z+wuHCN BgJOA49fmefS3zrj20BsZKVTjopEtznwrn/drAKucmE02XKWqYNzTwiuRTTOydm9ZdAZ kfSAxm9oCpZIkIA+Mou7fhABPo6h1yqZ1JVIi+XmE988mS+g6Rw2b2N68JypwnwwXKZ7 5jdlMODs3sPExBUEOX6KzU1w+Lfhmxiva6oPZOjDMpPvoa2dzUaGyJQRrt6UvdSMMBmJ N+9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753460586; x=1754065386; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=c2nMfW+8EG0yHdd/y9pp9Ykp4/bgoGWX5aUFmAoHq2k=; b=csD+f2E6LVjun+Gz+aCNnWMnlul/C2m7XDHhPKeRelWML+zG7NW69TAKTj3kCojqWQ aaLJOcbkENwuqUIQkFvO20TniODXZNedC+wAwIQ1PmrHPB0obuy9gWDQzXZKTFoQHO3f ov7ztPF0mmhYrx47TYCNuehw+Cjsvmk5+MX5jHnDTFXjc6tTDHkhGHDD3TpiaAeHrzxZ oAnhGJlySD4r35MACpAcdDhFe7SUGnAZu/wfLxlwc+NsYL9MCHcMd+e5cLKoqk9MRHZt LNGnFFe+YRgjGQ2ZvCboFyyJShEe8AKqkScsMIfmjpqDPIpfcY1OIxWuhqkrCsNOwuDA htgg== X-Forwarded-Encrypted: i=1; AJvYcCWa9gDy5pwLfg6YuseLHweYtlKSN8kjyvSx52seHd/QNSf/xQLwAiq03WiigTJwc7zdaZwJAEYCq3W595go@vger.kernel.org, AJvYcCX0pwnnRBJuRWA3A7V/rF9Qm2tQcAsAXqHb62Eqbm9gpDkE2jWhpcE3zMy6efZD/Kl15MbjvZWDOYw=@vger.kernel.org X-Gm-Message-State: AOJu0YxHx3kXHA/IjFhQ2KoKWl6eJ8U0G6kbLk9BXZfuNhwJA2cQ6eDN NCo3CSB4qWTeRl7INvXIZ2qHRp6m3BDSl4PJbttaijzGKmzKZFrx3GRv+SRsCaS1 X-Gm-Gg: ASbGncsuBT8qtwOSOpsuiCdgML/7dDODivRiaLkg0O6cJC8XW8HSTu76HAUSwjFHZrk duz6KZk+5XEth7TzzOch5EMOzHbFk10Jt36jTkhoh8s5WrQCKk2Z5tYgi4hGpuhO9tA9ML018ae vlyc3CLgYWSWAgbvycPEZBXOTj9q9kYK85j0YSBl45U7V/g2Y8rKhMpa9gO65irrPhlH2wkfQQY tTRdUVszxf10+SUbYzOQzfnaFIlBpPOkqytpMJqm9eg4nPvGvXh5S1OIdnUtc16Vqdm5opoaDYB kmDI2zL+tBafl+5RtoXaS0c8XAFXdar1gQskT0+8lI8wBOl1ZmpaSvhog3YvGmsq+ArpOjPqmzm 1kjgHSYg2hfWIZjTgQusWYmFhIli5Ww== X-Google-Smtp-Source: AGHT+IGhB0ODduWZoA0stvQcbydiYz6AaKK3E3SHNDplUWRJe9PTwU9UCJc3fawE6uOksSdsUztLVA== X-Received: by 2002:a05:622a:1348:b0:4ab:6a1e:ea51 with SMTP id d75a77b69052e-4ae8edcdf0bmr33885121cf.11.1753460586010; Fri, 25 Jul 2025 09:23:06 -0700 (PDT) Received: from localhost ([2a03:2880:20ff:6::]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-4ae99516fcesm1906231cf.9.2025.07.25.09.23.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 09:23:05 -0700 (PDT) From: Usama Arif To: Andrew Morton , david@redhat.com, linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org, corbet@lwn.net, rppt@kernel.org, surenb@google.com, mhocko@suse.com, hannes@cmpxchg.org, baohua@kernel.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , sj@kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH 4/5] selftests: prctl: introduce tests for disabling THPs completely Date: Fri, 25 Jul 2025 17:22:43 +0100 Message-ID: <20250725162258.1043176-5-usamaarif642@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20250725162258.1043176-1-usamaarif642@gmail.com> References: <20250725162258.1043176-1-usamaarif642@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The test will set the global system THP setting to madvise and the 2M setting to inherit before it starts (and reset to original at teardown) This tests if the process can: - successfully set and get the policy to disable THPs completely. - never get a hugepage when the THPs are completely disabled, including with MADV_HUGE and MADV_COLLAPSE. - successfully reset the policy of the process. - get hugepages only on MADV_HUGE and MADV_COLLAPSE after reset. - repeat the above tests in a forked process to make sure the policy is carried across forks. Signed-off-by: Usama Arif --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + .../testing/selftests/mm/prctl_thp_disable.c | 162 ++++++++++++++++++ 3 files changed, 164 insertions(+) create mode 100644 tools/testing/selftests/mm/prctl_thp_disable.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftest= s/mm/.gitignore index e7b23a8a05fe..eb023ea857b3 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -58,3 +58,4 @@ pkey_sighandler_tests_32 pkey_sighandler_tests_64 guard-regions merge +prctl_thp_disable diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/= mm/Makefile index d13b3cef2a2b..2bb8d3ebc17c 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -86,6 +86,7 @@ TEST_GEN_FILES +=3D on-fault-limit TEST_GEN_FILES +=3D pagemap_ioctl TEST_GEN_FILES +=3D pfnmap TEST_GEN_FILES +=3D process_madv +TEST_GEN_FILES +=3D prctl_thp_disable TEST_GEN_FILES +=3D thuge-gen TEST_GEN_FILES +=3D transhuge-stress TEST_GEN_FILES +=3D uffd-stress diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing= /selftests/mm/prctl_thp_disable.c new file mode 100644 index 000000000000..52f7e6659b1f --- /dev/null +++ b/tools/testing/selftests/mm/prctl_thp_disable.c @@ -0,0 +1,162 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Basic tests for PR_GET/SET_THP_DISABLE prctl calls + * + * Author(s): Usama Arif + */ +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest_harness.h" +#include "thp_settings.h" +#include "vm_util.h" + +#ifndef PR_THP_DISABLE_EXCEPT_ADVISED +#define PR_THP_DISABLE_EXCEPT_ADVISED (1 << 1) +#endif + +#define NR_HUGEPAGES 6 + +static int sz2ord(size_t size, size_t pagesize) +{ + return __builtin_ctzll(size / pagesize); +} + +enum madvise_buffer { + NONE, + HUGE, + COLLAPSE +}; + +/* + * Function to mmap a buffer, fault it in, madvise it appropriately (before + * page fault for MADV_HUGE, and after for MADV_COLLAPSE), and check if the + * mmap region is huge. + * returns: + * 0 if test doesn't give hugepage + * 1 if test gives a hugepage + * -1 if mmap fails + */ +static int test_mmap_thp(enum madvise_buffer madvise_buf, size_t pmdsize) +{ + int ret; + int buf_size =3D NR_HUGEPAGES * pmdsize; + + char *buffer =3D (char *)mmap(NULL, buf_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (buffer =3D=3D MAP_FAILED) + return -1; + + if (madvise_buf =3D=3D HUGE) + madvise(buffer, buf_size, MADV_HUGEPAGE); + + /* Ensure memory is allocated */ + memset(buffer, 1, buf_size); + + if (madvise_buf =3D=3D COLLAPSE) + madvise(buffer, buf_size, MADV_COLLAPSE); + + ret =3D check_huge_anon(buffer, NR_HUGEPAGES, pmdsize); + munmap(buffer, buf_size); + return ret; +} +FIXTURE(prctl_thp_disable_completely) +{ + struct thp_settings settings; + size_t pmdsize; +}; + +FIXTURE_SETUP(prctl_thp_disable_completely) +{ + if (!thp_is_enabled()) + SKIP(return, "Transparent Hugepages not available\n"); + + self->pmdsize =3D read_pmd_pagesize(); + if (!self->pmdsize) + SKIP(return, "Unable to read PMD size\n"); + + thp_read_settings(&self->settings); + self->settings.thp_enabled =3D THP_MADVISE; + self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = =3D THP_INHERIT; + thp_save_settings(); + thp_push_settings(&self->settings); +} + +FIXTURE_TEARDOWN(prctl_thp_disable_completely) +{ + thp_restore_settings(); +} + +/* prctl_thp_disable_except_madvise fixture sets system THP setting to mad= vise */ +static void prctl_thp_disable_completely(struct __test_metadata *const _me= tadata, + size_t pmdsize) +{ + int res =3D 0; + + res =3D prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL); + ASSERT_EQ(res, 1); + + /* global =3D madvise, process =3D never, we shouldn't get HPs even with = madvise */ + res =3D test_mmap_thp(NONE, pmdsize); + ASSERT_EQ(res, 0); + + res =3D test_mmap_thp(HUGE, pmdsize); + ASSERT_EQ(res, 0); + + res =3D test_mmap_thp(COLLAPSE, pmdsize); + ASSERT_EQ(res, 0); + + /* Reset to system policy */ + res =3D prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL); + ASSERT_EQ(res, 0); + + /* global =3D madvise */ + res =3D test_mmap_thp(NONE, pmdsize); + ASSERT_EQ(res, 0); + + res =3D test_mmap_thp(HUGE, pmdsize); + ASSERT_EQ(res, 1); + + res =3D test_mmap_thp(COLLAPSE, pmdsize); + ASSERT_EQ(res, 1); +} + +TEST_F(prctl_thp_disable_completely, nofork) +{ + int res =3D 0; + + res =3D prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL); + ASSERT_EQ(res, 0); + + prctl_thp_disable_completely(_metadata, self->pmdsize); +} + +TEST_F(prctl_thp_disable_completely, fork) +{ + int res =3D 0, ret =3D 0; + pid_t pid; + + res =3D prctl(PR_SET_THP_DISABLE, 1, NULL, NULL, NULL); + ASSERT_EQ(res, 0); + + /* Make sure prctl changes are carried across fork */ + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (!pid) + prctl_thp_disable_completely(_metadata, self->pmdsize); + + wait(&ret); + if (WIFEXITED(ret)) + ret =3D WEXITSTATUS(ret); + else + ret =3D -EINVAL; + ASSERT_EQ(ret, 0); +} + +TEST_HARNESS_MAIN --=20 2.47.3 From nobody Mon Oct 6 03:13:54 2025 Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E72EC2EFD91; Fri, 25 Jul 2025 16:23:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460590; cv=none; b=C1R8cTZfbgZNTK5/d0Dc9VzGdwdX4fPg6ZSiVvMMaFHZ25R+F6f8XYyajC0GvOsNSSq3g6si/VQU1ASK+xBEFXFKcZRENTiv8Jwqw0TtWXwO/wBv/nZRzPBLOGD1VmOqsRE5CoBw+oIn9KYMgWTElNEc6GGTtQP9Qtxz4qVZ71o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753460590; c=relaxed/simple; bh=neZhPWNjaWBKB5Z3M//a7sTXy5qc9Y2wCac0p/ThBFA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=d/gGJiIJ5+Xuj1CPn7Fij0JMzSgGb8nuxiNcTGaWoaccI+tV8K61f19cA5dbqn/fxlkQH6blEftok3UK4jO6yawaw4/sqXHA5fly/Mut67LuayVHqRRrkbctaGnA4N1p7+5OSaN6SOSzHUTNjiWE7Bg31LyvXXLNP18+QVSkS9Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=eXljbpm3; arc=none smtp.client-ip=209.85.219.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eXljbpm3" Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-6fad3400ea3so16861656d6.0; Fri, 25 Jul 2025 09:23:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753460588; x=1754065388; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Kg4daqcmOqhcR1CzrWxL0M/I6a0shgB30VgpkxJSydc=; b=eXljbpm31VyrEoVmVo8LSDuUG815UbVr/lRhHBWtO7Z2aMg/02rGKiYqrDyWb0RYw1 1hKPhh6241MWNveoMFf9WM/oTf3PnuiSZvDsRHCkZUqNEeUXf/7ZAEVT22GczDtN0DTG xOC+TRyXTEF5JuInmrccNyID5NZhfwnZijTE7KlaEaJrxvW9hRy9YjdatSszMgFPkhBV 2VzwBvVpR79J/A/5baZRSvqxRdt9G9yTWSrnLA6/+XmbIYepm/mLGoxjUiXi6F93Cluz s0959tSr9ngZFeESvIPxDar/3NbDgzE7MMsoQDbE0ZwWtWaJAPNO4WYDeDag1CjRpWAY XDAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753460588; x=1754065388; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Kg4daqcmOqhcR1CzrWxL0M/I6a0shgB30VgpkxJSydc=; b=PCRuKmFjevGy6tSLdRgAjLqvqKxY7Du6I5ZoIIuqIyfx22L/QSqHUSWdZLGo1yMhyF zPqqE417ORie9JOSN8Dj6TazpCE/D/p3HxZtVrSFSMbQ3vMlXdjQoFR6io5u/tVc1WCD 85NquKvIvxVMbzq8qxMFddiqcV8sEMnz1OLRMcFrt5aztZwcjCXXkIc1tyMxlmqxNyWp RVhgGEbcKpGZM6ErBfB/fo65Ge2fDkWfejJuc+EMjf3TYwSpT4brCLAe3Ux+gfYgBaEB +GMxeKeCmbPnuNqD+zxM0azH9iNdvTHysGYGeDS92RkV0wNnWi2GzCyTZiMDvh9Xy9Xz Igdg== X-Forwarded-Encrypted: i=1; AJvYcCXX+pg8r6Z6bRn/+atNyP0NiC42flA0ebyWUdqKdJh86B96m4xz4TjyAWQsAL3w9Ynd4yq08taJMxw=@vger.kernel.org, AJvYcCXc/HPM25oq4hWcvXHcA0oCnqbs7zOZ+83H8VIRS5o+YNb2V20GwaRu71OmCYx5QHZ5yoY2MTagrhGvtMAp@vger.kernel.org X-Gm-Message-State: AOJu0YxIjy7jVb73x/62Zwbvz/DAj1i4ajHeF0C46HRq+kx6a6celcTC ShV53G2B7CtdaVbRlXOEagd5n3HnYuTWLFMhQHQ5cnPHKIO2LF4Dy9eV X-Gm-Gg: ASbGncsgwhyIpbYpBrQSImbjHVactb3S+GtHvFndRYa1ItU+jSsRuGWE+vk46uDvkdt PNRHxj8Qe4Nprz/Wh0v06QdTETnZl5vPTdDBig1/sr2cDwSzsUST1EYKyCKoOtiBWCnv8cYqe0x cRWFqjDEZj4ccjKmNF2SYx2/61BRlUejU2iSdAF4Yw8afr2CB55bIQPF9uAy6ewNy8SZ8kS5rJQ gpzMFci8BUMItJc0sdIYJu8+4VcEyoOLa8I7heAseODwwMjUMcXg/1o4PJ6gzxE8FEwRFwi8T9I Id0xwOwSdTwN7wGN95JnkkYJYs0DXl8DAUdFLu04/pG02taCeJAm+SkM4k6KQ6hcS1hzOSw2haO Y4F0uoH80VFbq5ih9zIaH1NNBb/Wa5XY= X-Google-Smtp-Source: AGHT+IGixZFGbIYWOPP1umgQcO8kd0SNYDAa2j8sTfscN1UGv/O9HXscnBrOMp7xCbnGJ+kgRCQEXg== X-Received: by 2002:a05:6214:250a:b0:707:15c3:e4f8 with SMTP id 6a1803df08f44-707205e9b59mr32272166d6.38.1753460587555; Fri, 25 Jul 2025 09:23:07 -0700 (PDT) Received: from localhost ([2a03:2880:20ff:71::]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70729c4de30sm1873286d6.72.2025.07.25.09.23.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Jul 2025 09:23:06 -0700 (PDT) From: Usama Arif To: Andrew Morton , david@redhat.com, linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org, corbet@lwn.net, rppt@kernel.org, surenb@google.com, mhocko@suse.com, hannes@cmpxchg.org, baohua@kernel.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, jannh@google.com, Arnd Bergmann , sj@kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH 5/5] selftests: prctl: introduce tests for disabling THPs except for madvise Date: Fri, 25 Jul 2025 17:22:44 +0100 Message-ID: <20250725162258.1043176-6-usamaarif642@gmail.com> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20250725162258.1043176-1-usamaarif642@gmail.com> References: <20250725162258.1043176-1-usamaarif642@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The test will set the global system THP setting to always and the 2M setting to inherit before it starts (and reset to original at teardown) This tests if the process can: - successfully set and get the policy to disable THPs expect for madvise. - get hugepages only on MADV_HUGE and MADV_COLLAPSE after policy is set. - successfully reset the policy of the process. - get hugepages always after reset. - repeat the above tests in a forked process to make sure the policy is carried across forks. Signed-off-by: Usama Arif --- .../testing/selftests/mm/prctl_thp_disable.c | 95 +++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/tools/testing/selftests/mm/prctl_thp_disable.c b/tools/testing= /selftests/mm/prctl_thp_disable.c index 52f7e6659b1f..288d5ad6ffbb 100644 --- a/tools/testing/selftests/mm/prctl_thp_disable.c +++ b/tools/testing/selftests/mm/prctl_thp_disable.c @@ -65,6 +65,101 @@ static int test_mmap_thp(enum madvise_buffer madvise_bu= f, size_t pmdsize) munmap(buffer, buf_size); return ret; } + +FIXTURE(prctl_thp_disable_except_madvise) +{ + struct thp_settings settings; + size_t pmdsize; +}; + +FIXTURE_SETUP(prctl_thp_disable_except_madvise) +{ + if (!thp_is_enabled()) + SKIP(return, "Transparent Hugepages not available\n"); + + self->pmdsize =3D read_pmd_pagesize(); + if (!self->pmdsize) + SKIP(return, "Unable to read PMD size\n"); + + thp_read_settings(&self->settings); + self->settings.thp_enabled =3D THP_ALWAYS; + self->settings.hugepages[sz2ord(self->pmdsize, getpagesize())].enabled = =3D THP_INHERIT; + thp_save_settings(); + thp_push_settings(&self->settings); + +} + +FIXTURE_TEARDOWN(prctl_thp_disable_except_madvise) +{ + thp_restore_settings(); +} + +/* prctl_thp_disable_except_madvise fixture sets system THP setting to alw= ays */ +static void prctl_thp_disable_except_madvise(struct __test_metadata *const= _metadata, + size_t pmdsize) +{ + int res =3D 0; + + res =3D prctl(PR_GET_THP_DISABLE, NULL, NULL, NULL, NULL); + ASSERT_EQ(res, 3); + + /* global =3D always, process =3D madvise, we shouldn't get HPs without m= advise */ + res =3D test_mmap_thp(NONE, pmdsize); + ASSERT_EQ(res, 0); + + res =3D test_mmap_thp(HUGE, pmdsize); + ASSERT_EQ(res, 1); + + res =3D test_mmap_thp(COLLAPSE, pmdsize); + ASSERT_EQ(res, 1); + + /* Reset to system policy */ + res =3D prctl(PR_SET_THP_DISABLE, 0, NULL, NULL, NULL); + ASSERT_EQ(res, 0); + + /* global =3D always, hence we should get HPs without madvise */ + res =3D test_mmap_thp(NONE, pmdsize); + ASSERT_EQ(res, 1); + + res =3D test_mmap_thp(HUGE, pmdsize); + ASSERT_EQ(res, 1); + + res =3D test_mmap_thp(COLLAPSE, pmdsize); + ASSERT_EQ(res, 1); +} + +TEST_F(prctl_thp_disable_except_madvise, nofork) +{ + int res =3D 0; + + res =3D prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL,= NULL); + ASSERT_EQ(res, 0); + prctl_thp_disable_except_madvise(_metadata, self->pmdsize); +} + +TEST_F(prctl_thp_disable_except_madvise, fork) +{ + int res =3D 0, ret =3D 0; + pid_t pid; + + res =3D prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, NULL,= NULL); + ASSERT_EQ(res, 0); + + /* Make sure prctl changes are carried across fork */ + pid =3D fork(); + ASSERT_GE(pid, 0); + + if (!pid) + prctl_thp_disable_except_madvise(_metadata, self->pmdsize); + + wait(&ret); + if (WIFEXITED(ret)) + ret =3D WEXITSTATUS(ret); + else + ret =3D -EINVAL; + ASSERT_EQ(ret, 0); +} + FIXTURE(prctl_thp_disable_completely) { struct thp_settings settings; --=20 2.47.3