[PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE

Usama Arif posted 5 patches 2 months, 1 week ago
There is a newer version of this series
[PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Usama Arif 2 months, 1 week ago
From: David Hildenbrand <david@redhat.com>

People want to make use of more THPs, for example, moving from
the "never" system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
    disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
    workloads, making only these workloads use the "madvise" or "always"
    policy.

(3) Switch from "madvise" system policy to "always", but keep the
    "madvise" policy for selected workloads: allocate THPs only when
    advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
    for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs. It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
yet (i.e., used by Redis) were able to just disable THPs completely. Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.
That essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising. Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
would change the documented semantics quite a bit, and the versatility
to use it for debugging purposes, so I am not 100% sure that is what we
want -- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    Previously, it would return 1 if THPs were disabled completely. Now
    it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
    was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

    For now, we don't add another interface to obtained whether THPs
    are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
    ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE
"is inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained. This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g.,
systemd where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
thinks a THP is a good idea, we'll enable that separately next
(requiring a bit of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
users for re-enabling it, and it's against the purpose of the original
interface. So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Hildenbrand <david@redhat.com>

---

At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
think there might be real use cases where we want to disable any THPs --
in particular also around debugging THP-related problems, and
"never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
helpful for debugging purposes. Of course, I thought of having a
system-wide config option to modify PR_SET_THP_DISABLE behavior, but
I just don't like the semantics.

"prctl: allow overriding system THP policy to always"[1] proposed
"overriding policies to always", which is just the wrong way around: we
should not add mechanisms to "enable more" when we already have an
interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
weird otherwise.

"[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
setting the default of the VM_HUGEPAGE, which is similarly the wrong way
around I think now.

The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4]
similarly were around the "default for VM_HUGEPAGE" idea, but after the
discussion, I think we should better leave VM_HUGEPAGE untouched.

Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
we essentially want to say "leave advised regions alone" -- "keep THP
enabled for advised regions",

The only thing I really dislike about this is using another MMF_* flag,
but well, no way around it -- and seems like we could easily support
more than 32 if we want to (most users already treat it like a proper
bitmap).

I think this here (modifying an existing toggle) is the only prctl()
extension that we might be willing to accept. In general, I agree like
most others, that prctl() is a very bad interface for that -- but
PR_SET_THP_DISABLE is already there and is getting used.

Long-term, I think the answer will be something based on bpf[5]. Maybe
in that context, I there could still be value in easily disabling THPs for
selected workloads (esp. debugging purposes).

Jann raised valid concerns[6] about new flags that are persistent across
exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
consider it having a similar security risk as our existing
PR_SET_THP_DISABLE, but devil is in the detail.

[1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
[2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
[3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
[4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
[5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
[6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/filesystems/proc.rst |  5 +--
 fs/proc/array.c                    |  2 +-
 include/linux/huge_mm.h            | 20 ++++++++---
 include/linux/mm_types.h           | 13 +++----
 include/uapi/linux/prctl.h         | 10 ++++++
 kernel/sys.c                       | 58 +++++++++++++++++++++++-------
 mm/khugepaged.c                    |  2 +-
 7 files changed, 81 insertions(+), 29 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b7235..915a3e44bc12 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,8 +291,9 @@ It's slow but very precise.
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
- THP_enabled		     process is allowed to use THP (returns 0 when
-			     PR_SET_THP_DISABLE is set on the process
+ THP_enabled                 process is allowed to use THP (returns 0 when
+                             PR_SET_THP_DISABLE is set on the process to disable
+                             THP completely, not just partially)
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..c4f91a784104 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
 	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
 
 	if (thp_enabled)
-		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
 	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..71db243a002e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -318,16 +318,26 @@ struct thpsize {
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
 
+/*
+ * Check whether THPs are explicitly disabled for this VMA, for example,
+ * through madvise or prctl.
+ */
 static inline bool vma_thp_disabled(struct vm_area_struct *vma,
 		vm_flags_t vm_flags)
 {
+	/* Are THPs disabled for this VMA? */
+	if (vm_flags & VM_NOHUGEPAGE)
+		return true;
+	/* Are THPs disabled for all VMAs in the whole process? */
+	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
+		return true;
 	/*
-	 * Explicitly disabled through madvise or prctl, or some
-	 * architectures may disable THP for some mappings, for
-	 * example, s390 kvm.
+	 * Are THPs disabled only for VMAs where we didn't get an explicit
+	 * advise to use them?
 	 */
-	return (vm_flags & VM_NOHUGEPAGE) ||
-	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
+	if (vm_flags & VM_HUGEPAGE)
+		return false;
+	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
 }
 
 static inline bool thp_disabled_by_hw(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ec273b06691..123fefaa4b98 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1743,19 +1743,16 @@ enum {
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
 
-/*
- * This one-shot flag is dropped due to necessity of changing exe once again
- * on NFS restore
- */
-//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
+#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
 #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
-#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
-#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
-#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
+#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except when advised (e.g., VM_HUGEPAGE) */
+#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
+#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
+				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
 #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
 #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
 /*
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559..9c1d6e49b8a9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -177,7 +177,17 @@ struct prctl_mm_map {
 
 #define PR_GET_TID_ADDRESS	40
 
+/*
+ * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
+ * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
+ * return "1" when no flags were specified for PR_SET_THP_DISABLE.
+ */
 #define PR_SET_THP_DISABLE	41
+/*
+ * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
+ * VM_HUGEPAGE).
+ */
+# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
 #define PR_GET_THP_DISABLE	42
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index b153fb345ada..b87d0acaab0b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
 	return sizeof(mm->saved_auxv);
 }
 
+static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
+				 unsigned long arg4, unsigned long arg5)
+{
+	unsigned long *mm_flags = &current->mm->flags;
+
+	if (arg2 || arg3 || arg4 || arg5)
+		return -EINVAL;
+
+	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
+		return 1;
+	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
+		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
+	return 0;
+}
+
+static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
+				 unsigned long arg4, unsigned long arg5)
+{
+	unsigned long *mm_flags = &current->mm->flags;
+
+	if (arg4 || arg5)
+		return -EINVAL;
+
+	/* Flags are only allowed when disabling. */
+	if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
+		return -EINVAL;
+	if (mmap_write_lock_killable(current->mm))
+		return -EINTR;
+	if (thp_disable) {
+		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
+			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+		} else {
+			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+		}
+	} else {
+		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
+		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
+	}
+	mmap_write_unlock(current->mm);
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		return task_no_new_privs(current) ? 1 : 0;
 	case PR_GET_THP_DISABLE:
-		if (arg2 || arg3 || arg4 || arg5)
-			return -EINVAL;
-		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
+		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_SET_THP_DISABLE:
-		if (arg3 || arg4 || arg5)
-			return -EINVAL;
-		if (mmap_write_lock_killable(me->mm))
-			return -EINTR;
-		if (arg2)
-			set_bit(MMF_DISABLE_THP, &me->mm->flags);
-		else
-			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
-		mmap_write_unlock(me->mm);
+		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
 		break;
 	case PR_MPX_ENABLE_MANAGEMENT:
 	case PR_MPX_DISABLE_MANAGEMENT:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ff0c7dd2be4..2c9008246785 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 {
 	return hpage_collapse_test_exit(mm) ||
-	       test_bit(MMF_DISABLE_THP, &mm->flags);
+	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
 }
 
 static bool hugepage_pmd_enabled(void)
-- 
2.47.3
Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Lorenzo Stoakes 2 months, 1 week ago
An aside - we should update the man page for this - see
https://man7.org/linux/man-pages/man2/PR_SET_THP_DISABLE.2const.html

This has to be done separately from the series I think.

On Fri, Jul 25, 2025 at 05:22:40PM +0100, Usama Arif wrote:
> From: David Hildenbrand <david@redhat.com>
>
> People want to make use of more THPs, for example, moving from
> the "never" system policy to "madvise", or from "madvise" to "always".
>
> While this is great news for every THP desperately waiting to get
> allocated out there, apparently there are some workloads that require a
> bit of care during that transition: individual processes may need to
> opt-out from this behavior for various reasons, and this should be
> permitted without needing to make all other workloads on the system
> similarly opt-out.
>
> The following scenarios are imaginable:
>
> (1) Switch from "none" system policy to "madvise"/"always", but keep THPs
>     disabled for selected workloads.
>
> (2) Stay at "none" system policy, but enable THPs for selected
>     workloads, making only these workloads use the "madvise" or "always"
>     policy.
>
> (3) Switch from "madvise" system policy to "always", but keep the
>     "madvise" policy for selected workloads: allocate THPs only when
>     advised.
>
> (4) Stay at "madvise" system policy, but enable THPs even when not advised
>     for selected workloads -- "always" policy.
>
> Once can emulate (2) through (1), by setting the system policy to
> "madvise"/"always" while disabling THPs for all processes that don't want
> THPs. It requires configuring all workloads, but that is a user-space
> problem to sort out.
>
> (4) can be emulated through (3) in a similar way.
>
> Back when (1) was relevant in the past, as people started enabling THPs,
> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
> still implements the option to use this interface to disable THPs
> completely.
>
> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
> workload -- a process, including fork+exec'ed process hierarchy.
> That essentially made us support (1): simply disable THPs for all workloads
> that are not ready for THPs yet, while still enabling THPs system-wide.
>
> The quest for handling (3) and (4) started, but current approaches
> (completely new prctl, options to set other policies per process,
> alternatives to prctl -- mctrl, cgroup handling) don't look particularly
> promising. Likely, the future will use bpf or something similar to
> implement better policies, in particular to also make better decisions
> about THP sizes to use, but this will certainly take a while as that work
> just started.
>
> Long story short: a simple enable/disable is not really suitable for the
> future, so we're not willing to add completely new toggles.
>
> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
> completely for these processes, this is a step backwards, because these
> processes can no longer allocate THPs in regions where THPs were
> explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
> imposes a problem for relevant workloads, because "not THPs" is certainly
> worse than "THPs only when advised".
>
> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
> would change the documented semantics quite a bit, and the versatility
> to use it for debugging purposes, so I am not 100% sure that is what we
> want -- although it would certainly be much easier.
>
> So instead, as an easy way forward for (3) and (4), add an option to
> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>
> In essence, this patch:
>
> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
>
>     prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).
>
> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>
>     Previously, it would return 1 if THPs were disabled completely. Now
>     it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
>     was set.
>
> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>     the semantics clearly.
>
>     Fortunately, there are only two instances outside of prctl() code.
>
> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>
>     Fortunately, we only have to extend vma_thp_disabled().
>
> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
>     disabled completely
>
>     Only indicating that THPs are disabled when they are really disabled
>     completely, not only partially.
>
>     For now, we don't add another interface to obtained whether THPs
>     are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
>     ever required, we could add a new entry.
>
> The documented semantics in the man page for PR_SET_THP_DISABLE
> "is inherited by a child created via fork(2) and is preserved across
> execve(2)" is maintained. This behavior, for example, allows for
> disabling THPs for a workload through the launching process (e.g.,
> systemd where we fork() a helper process to then exec()).
>
> For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
> VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
> thinks a THP is a good idea, we'll enable that separately next
> (requiring a bit of cleanup first).
>
> There is currently not way to prevent that a process will not issue
> PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
> users for re-enabling it, and it's against the purpose of the original
> interface. So if ever required, we could investigate just forbidding to
> re-enable them, or make this somehow configurable.
>
> Acked-by: Usama Arif <usamaarif642@gmail.com>
> Tested-by: Usama Arif <usamaarif642@gmail.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> ---
>
> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
> think there might be real use cases where we want to disable any THPs --
> in particular also around debugging THP-related problems, and
> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
> helpful for debugging purposes. Of course, I thought of having a
> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
> I just don't like the semantics.
>
> "prctl: allow overriding system THP policy to always"[1] proposed
> "overriding policies to always", which is just the wrong way around: we
> should not add mechanisms to "enable more" when we already have an
> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
> weird otherwise.
>
> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
> around I think now.
>
> The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4]
> similarly were around the "default for VM_HUGEPAGE" idea, but after the
> discussion, I think we should better leave VM_HUGEPAGE untouched.
>
> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
> we essentially want to say "leave advised regions alone" -- "keep THP
> enabled for advised regions",
>
> The only thing I really dislike about this is using another MMF_* flag,
> but well, no way around it -- and seems like we could easily support
> more than 32 if we want to (most users already treat it like a proper
> bitmap).
>
> I think this here (modifying an existing toggle) is the only prctl()
> extension that we might be willing to accept. In general, I agree like
> most others, that prctl() is a very bad interface for that -- but
> PR_SET_THP_DISABLE is already there and is getting used.
>
> Long-term, I think the answer will be something based on bpf[5]. Maybe
> in that context, I there could still be value in easily disabling THPs for
> selected workloads (esp. debugging purposes).
>
> Jann raised valid concerns[6] about new flags that are persistent across
> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
> consider it having a similar security risk as our existing
> PR_SET_THP_DISABLE, but devil is in the detail.
>
> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  Documentation/filesystems/proc.rst |  5 +--
>  fs/proc/array.c                    |  2 +-
>  include/linux/huge_mm.h            | 20 ++++++++---
>  include/linux/mm_types.h           | 13 +++----
>  include/uapi/linux/prctl.h         | 10 ++++++
>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>  mm/khugepaged.c                    |  2 +-
>  7 files changed, 81 insertions(+), 29 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2971551b7235..915a3e44bc12 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,8 +291,9 @@ It's slow but very precise.
>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> - THP_enabled		     process is allowed to use THP (returns 0 when
> -			     PR_SET_THP_DISABLE is set on the process
> + THP_enabled                 process is allowed to use THP (returns 0 when
> +                             PR_SET_THP_DISABLE is set on the process to disable
> +                             THP completely, not just partially)
>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index d6a0369caa93..c4f91a784104 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>  	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>
>  	if (thp_enabled)
> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>  }
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7748489fde1b..71db243a002e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -318,16 +318,26 @@ struct thpsize {
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>
> +/*
> + * Check whether THPs are explicitly disabled for this VMA, for example,
> + * through madvise or prctl.
> + */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  		vm_flags_t vm_flags)
>  {
> +	/* Are THPs disabled for this VMA? */
> +	if (vm_flags & VM_NOHUGEPAGE)
> +		return true;
> +	/* Are THPs disabled for all VMAs in the whole process? */
> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> +		return true;
>  	/*
> -	 * Explicitly disabled through madvise or prctl, or some
> -	 * architectures may disable THP for some mappings, for
> -	 * example, s390 kvm.
> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> +	 * advise to use them?
>  	 */
> -	return (vm_flags & VM_NOHUGEPAGE) ||
> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> +	if (vm_flags & VM_HUGEPAGE)
> +		return false;

Hm is this correct? This means that VM_NOHUGEPAGE no longer results in THP being
disabled here no?

> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>  }
>
>  static inline bool thp_disabled_by_hw(void)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ec273b06691..123fefaa4b98 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1743,19 +1743,16 @@ enum {
>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>
> -/*
> - * This one-shot flag is dropped due to necessity of changing exe once again
> - * on NFS restore
> - */
> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>
>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except when advised (e.g., VM_HUGEPAGE) */
> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */

It's almost a bit weird to have these as separate flags, since they're distinct
(of course, I don't think there's necessarily another way).

Though this makes me think maybe in future we can have a new mode where both
enabled == something else :P

But perhaps I've been infected with 'bit packing' disease.

Anyway as discussed in the THP meeting, I'm going to be (hopefully!) making
the mm flags a bitmap soon so we'll get more flags available.

> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
>  #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>  #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>  /*
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43dec6eed559..9c1d6e49b8a9 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -177,7 +177,17 @@ struct prctl_mm_map {
>
>  #define PR_GET_TID_ADDRESS	40
>
> +/*
> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
> + * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
> + * return "1" when no flags were specified for PR_SET_THP_DISABLE.
> + */
>  #define PR_SET_THP_DISABLE	41
> +/*
> + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> + * VM_HUGEPAGE).
> + */
> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)

NO space after # please.

>  #define PR_GET_THP_DISABLE	42
>
>  /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index b153fb345ada..b87d0acaab0b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>  	return sizeof(mm->saved_auxv);
>  }
>
> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg2 || arg3 || arg4 || arg5)
> +		return -EINVAL;
> +

Can we have a comment here about what we're doing below re: the return
value?

> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
> +		return 1;
> +	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
> +		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
> +	return 0;
> +}
> +
> +static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> +				 unsigned long arg4, unsigned long arg5)
> +{
> +	unsigned long *mm_flags = &current->mm->flags;
> +
> +	if (arg4 || arg5)
> +		return -EINVAL;
> +
> +	/* Flags are only allowed when disabling. */
> +	if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> +		return -EINVAL;
> +	if (mmap_write_lock_killable(current->mm))
> +		return -EINTR;
> +	if (thp_disable) {
> +		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> +			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		} else {
> +			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +		}
> +	} else {
> +		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
> +		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
> +	}
> +	mmap_write_unlock(current->mm);
> +	return 0;
> +}
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  		return task_no_new_privs(current) ? 1 : 0;
>  	case PR_GET_THP_DISABLE:
> -		if (arg2 || arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
> +		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_SET_THP_DISABLE:
> -		if (arg3 || arg4 || arg5)
> -			return -EINVAL;
> -		if (mmap_write_lock_killable(me->mm))
> -			return -EINTR;
> -		if (arg2)
> -			set_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		else
> -			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
> -		mmap_write_unlock(me->mm);
> +		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
>  		break;
>  	case PR_MPX_ENABLE_MANAGEMENT:
>  	case PR_MPX_DISABLE_MANAGEMENT:
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1ff0c7dd2be4..2c9008246785 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm)
>  static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
>  {
>  	return hpage_collapse_test_exit(mm) ||
> -	       test_bit(MMF_DISABLE_THP, &mm->flags);
> +	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>  }
>
>  static bool hugepage_pmd_enabled(void)
> --
> 2.47.3
>
Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Usama Arif 2 months, 1 week ago

On 30/07/2025 20:31, Lorenzo Stoakes wrote:
> An aside - we should update the man page for this - see
> https://man7.org/linux/man-pages/man2/PR_SET_THP_DISABLE.2const.html
> 
> This has to be done separately from the series I think.
> 
> On Fri, Jul 25, 2025 at 05:22:40PM +0100, Usama Arif wrote:
>> From: David Hildenbrand <david@redhat.com>
>>
>> People want to make use of more THPs, for example, moving from
>> the "never" system policy to "madvise", or from "madvise" to "always".
>>
>> While this is great news for every THP desperately waiting to get
>> allocated out there, apparently there are some workloads that require a
>> bit of care during that transition: individual processes may need to
>> opt-out from this behavior for various reasons, and this should be
>> permitted without needing to make all other workloads on the system
>> similarly opt-out.
>>
>> The following scenarios are imaginable:
>>
>> (1) Switch from "none" system policy to "madvise"/"always", but keep THPs
>>     disabled for selected workloads.
>>
>> (2) Stay at "none" system policy, but enable THPs for selected
>>     workloads, making only these workloads use the "madvise" or "always"
>>     policy.
>>
>> (3) Switch from "madvise" system policy to "always", but keep the
>>     "madvise" policy for selected workloads: allocate THPs only when
>>     advised.
>>
>> (4) Stay at "madvise" system policy, but enable THPs even when not advised
>>     for selected workloads -- "always" policy.
>>
>> Once can emulate (2) through (1), by setting the system policy to
>> "madvise"/"always" while disabling THPs for all processes that don't want
>> THPs. It requires configuring all workloads, but that is a user-space
>> problem to sort out.
>>
>> (4) can be emulated through (3) in a similar way.
>>
>> Back when (1) was relevant in the past, as people started enabling THPs,
>> we added PR_SET_THP_DISABLE, so relevant workloads that were not ready
>> yet (i.e., used by Redis) were able to just disable THPs completely. Redis
>> still implements the option to use this interface to disable THPs
>> completely.
>>
>> With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
>> workload -- a process, including fork+exec'ed process hierarchy.
>> That essentially made us support (1): simply disable THPs for all workloads
>> that are not ready for THPs yet, while still enabling THPs system-wide.
>>
>> The quest for handling (3) and (4) started, but current approaches
>> (completely new prctl, options to set other policies per process,
>> alternatives to prctl -- mctrl, cgroup handling) don't look particularly
>> promising. Likely, the future will use bpf or something similar to
>> implement better policies, in particular to also make better decisions
>> about THP sizes to use, but this will certainly take a while as that work
>> just started.
>>
>> Long story short: a simple enable/disable is not really suitable for the
>> future, so we're not willing to add completely new toggles.
>>
>> While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
>> completely for these processes, this is a step backwards, because these
>> processes can no longer allocate THPs in regions where THPs were
>> explicitly advised: regions flagged as VM_HUGEPAGE. Apparently, that
>> imposes a problem for relevant workloads, because "not THPs" is certainly
>> worse than "THPs only when advised".
>>
>> Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
>> explicitly advised by the app through MAD_HUGEPAGE"? *maybe*, but this
>> would change the documented semantics quite a bit, and the versatility
>> to use it for debugging purposes, so I am not 100% sure that is what we
>> want -- although it would certainly be much easier.
>>
>> So instead, as an easy way forward for (3) and (4), add an option to
>> make PR_SET_THP_DISABLE disable *less* THPs for a process.
>>
>> In essence, this patch:
>>
>> (A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
>>     of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).
>>
>>     prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).
>>
>> (B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
>>     PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.
>>
>>     Previously, it would return 1 if THPs were disabled completely. Now
>>     it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
>>     was set.
>>
>> (C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
>>     the semantics clearly.
>>
>>     Fortunately, there are only two instances outside of prctl() code.
>>
>> (D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
>>     with VM_HUGEPAGE" -- essentially "thp=madvise" behavior
>>
>>     Fortunately, we only have to extend vma_thp_disabled().
>>
>> (E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
>>     disabled completely
>>
>>     Only indicating that THPs are disabled when they are really disabled
>>     completely, not only partially.
>>
>>     For now, we don't add another interface to obtained whether THPs
>>     are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
>>     ever required, we could add a new entry.
>>
>> The documented semantics in the man page for PR_SET_THP_DISABLE
>> "is inherited by a child created via fork(2) and is preserved across
>> execve(2)" is maintained. This behavior, for example, allows for
>> disabling THPs for a workload through the launching process (e.g.,
>> systemd where we fork() a helper process to then exec()).
>>
>> For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
>> VM_NOHUGEPAGE. As MADV_COLLAPSE is a clear advise that user space
>> thinks a THP is a good idea, we'll enable that separately next
>> (requiring a bit of cleanup first).
>>
>> There is currently not way to prevent that a process will not issue
>> PR_SET_THP_DISABLE itself to re-enable THP. There are not really known
>> users for re-enabling it, and it's against the purpose of the original
>> interface. So if ever required, we could investigate just forbidding to
>> re-enable them, or make this somehow configurable.
>>
>> Acked-by: Usama Arif <usamaarif642@gmail.com>
>> Tested-by: Usama Arif <usamaarif642@gmail.com>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Cc: SeongJae Park <sj@kernel.org>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Yafang Shao <laoar.shao@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> ---
>>
>> At first, I thought of "why not simply relax PR_SET_THP_DISABLE", but I
>> think there might be real use cases where we want to disable any THPs --
>> in particular also around debugging THP-related problems, and
>> "never" not meaning ... "never" anymore ever since we add MADV_COLLAPSE.
>> PR_SET_THP_DISABLE will also block MADV_COLLAPSE, which can be very
>> helpful for debugging purposes. Of course, I thought of having a
>> system-wide config option to modify PR_SET_THP_DISABLE behavior, but
>> I just don't like the semantics.
>>
>> "prctl: allow overriding system THP policy to always"[1] proposed
>> "overriding policies to always", which is just the wrong way around: we
>> should not add mechanisms to "enable more" when we already have an
>> interface/mechanism to "disable" them (PR_SET_THP_DISABLE). It all gets
>> weird otherwise.
>>
>> "[PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY"[2] proposed
>> setting the default of the VM_HUGEPAGE, which is similarly the wrong way
>> around I think now.
>>
>> The ideas explored by Lorenzo to extend process_madvise()[3] and mctrl()[4]
>> similarly were around the "default for VM_HUGEPAGE" idea, but after the
>> discussion, I think we should better leave VM_HUGEPAGE untouched.
>>
>> Happy to hear naming suggestions for "PR_THP_DISABLE_EXCEPT_ADVISED" where
>> we essentially want to say "leave advised regions alone" -- "keep THP
>> enabled for advised regions",
>>
>> The only thing I really dislike about this is using another MMF_* flag,
>> but well, no way around it -- and seems like we could easily support
>> more than 32 if we want to (most users already treat it like a proper
>> bitmap).
>>
>> I think this here (modifying an existing toggle) is the only prctl()
>> extension that we might be willing to accept. In general, I agree like
>> most others, that prctl() is a very bad interface for that -- but
>> PR_SET_THP_DISABLE is already there and is getting used.
>>
>> Long-term, I think the answer will be something based on bpf[5]. Maybe
>> in that context, I there could still be value in easily disabling THPs for
>> selected workloads (esp. debugging purposes).
>>
>> Jann raised valid concerns[6] about new flags that are persistent across
>> exec[6]. As this here is a relaxation to existing PR_SET_THP_DISABLE I
>> consider it having a similar security risk as our existing
>> PR_SET_THP_DISABLE, but devil is in the detail.
>>
>> [1] https://lore.kernel.org/r/20250507141132.2773275-1-usamaarif642@gmail.com
>> [2] https://lkml.kernel.org/r/20250515133519.2779639-2-usamaarif642@gmail.com
>> [3] https://lore.kernel.org/r/cover.1747686021.git.lorenzo.stoakes@oracle.com
>> [4] https://lkml.kernel.org/r/85778a76-7dc8-4ea8-8827-acb45f74ee05@lucifer.local
>> [5] https://lkml.kernel.org/r/20250608073516.22415-1-laoar.shao@gmail.com
>> [6] https://lore.kernel.org/r/CAG48ez3-7EnBVEjpdoW7z5K0hX41nLQN5Wb65Vg-1p8DdXRnjg@mail.gmail.com
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  Documentation/filesystems/proc.rst |  5 +--
>>  fs/proc/array.c                    |  2 +-
>>  include/linux/huge_mm.h            | 20 ++++++++---
>>  include/linux/mm_types.h           | 13 +++----
>>  include/uapi/linux/prctl.h         | 10 ++++++
>>  kernel/sys.c                       | 58 +++++++++++++++++++++++-------
>>  mm/khugepaged.c                    |  2 +-
>>  7 files changed, 81 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>> index 2971551b7235..915a3e44bc12 100644
>> --- a/Documentation/filesystems/proc.rst
>> +++ b/Documentation/filesystems/proc.rst
>> @@ -291,8 +291,9 @@ It's slow but very precise.
>>   HugetlbPages                size of hugetlb memory portions
>>   CoreDumping                 process's memory is currently being dumped
>>                               (killing the process may lead to a corrupted core)
>> - THP_enabled		     process is allowed to use THP (returns 0 when
>> -			     PR_SET_THP_DISABLE is set on the process
>> + THP_enabled                 process is allowed to use THP (returns 0 when
>> +                             PR_SET_THP_DISABLE is set on the process to disable
>> +                             THP completely, not just partially)
>>   Threads                     number of threads
>>   SigQ                        number of signals queued/max. number for queue
>>   SigPnd                      bitmap of pending signals for the thread
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index d6a0369caa93..c4f91a784104 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -422,7 +422,7 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
>>  	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
>>
>>  	if (thp_enabled)
>> -		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
>> +		thp_enabled = !test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>>  	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>  }
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7748489fde1b..71db243a002e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -318,16 +318,26 @@ struct thpsize {
>>  	(transparent_hugepage_flags &					\
>>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
>>
>> +/*
>> + * Check whether THPs are explicitly disabled for this VMA, for example,
>> + * through madvise or prctl.
>> + */
>>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>>  		vm_flags_t vm_flags)
>>  {
>> +	/* Are THPs disabled for this VMA? */
>> +	if (vm_flags & VM_NOHUGEPAGE)
>> +		return true;

VM_NOHUGEPAGE will cause the THP being disabled here.

>> +	/* Are THPs disabled for all VMAs in the whole process? */
>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
>> +		return true;
>>  	/*
>> -	 * Explicitly disabled through madvise or prctl, or some
>> -	 * architectures may disable THP for some mappings, for
>> -	 * example, s390 kvm.
>> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
>> +	 * advise to use them?
>>  	 */
>> -	return (vm_flags & VM_NOHUGEPAGE) ||
>> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		return false;
> 
> Hm is this correct? This means that VM_NOHUGEPAGE no longer results in THP being
> disabled here no?

Lorenzo, pointed to VM_NOHUGEPAGE case above.> 
>> +	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>>  }
>>
>>  static inline bool thp_disabled_by_hw(void)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ec273b06691..123fefaa4b98 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1743,19 +1743,16 @@ enum {
>>  #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
>>  #define MMF_VM_HUGEPAGE		17	/* set when mm is available for khugepaged */
>>
>> -/*
>> - * This one-shot flag is dropped due to necessity of changing exe once again
>> - * on NFS restore
>> - */
>> -//#define MMF_EXE_FILE_CHANGED	18	/* see prctl_set_mm_exe_file() */
>> +#define MMF_HUGE_ZERO_PAGE	18      /* mm has ever used the global huge zero page */
>>
>>  #define MMF_HAS_UPROBES		19	/* has uprobes */
>>  #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
>>  #define MMF_OOM_SKIP		21	/* mm is of no interest for the OOM killer */
>>  #define MMF_UNSTABLE		22	/* mm is unstable for copy_from_user */
>> -#define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
>> -#define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>> -#define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>> +#define MMF_DISABLE_THP_EXCEPT_ADVISED	23	/* no THP except when advised (e.g., VM_HUGEPAGE) */
>> +#define MMF_DISABLE_THP_COMPLETELY	24	/* no THP for all VMAs */
> 
> It's almost a bit weird to have these as separate flags, since they're distinct
> (of course, I don't think there's necessarily another way).
> 
> Though this makes me think maybe in future we can have a new mode where both
> enabled == something else :P
> 
> But perhaps I've been infected with 'bit packing' disease.
> 
> Anyway as discussed in the THP meeting, I'm going to be (hopefully!) making
> the mm flags a bitmap soon so we'll get more flags available.
> 
>> +#define MMF_DISABLE_THP_MASK	((1 << MMF_DISABLE_THP_COMPLETELY) |\
>> +				 (1 << MMF_DISABLE_THP_EXCEPT_ADVISED))
>>  #define MMF_OOM_REAP_QUEUED	25	/* mm was queued for oom_reaper */
>>  #define MMF_MULTIPROCESS	26	/* mm is shared between processes */
>>  /*
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 43dec6eed559..9c1d6e49b8a9 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -177,7 +177,17 @@ struct prctl_mm_map {
>>
>>  #define PR_GET_TID_ADDRESS	40
>>
>> +/*
>> + * Flags for PR_SET_THP_DISABLE are only applicable when disabling. Bit 0
>> + * is reserved, so PR_GET_THP_DISABLE can return "1 | flags", to effectively
>> + * return "1" when no flags were specified for PR_SET_THP_DISABLE.
>> + */
>>  #define PR_SET_THP_DISABLE	41
>> +/*
>> + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
>> + * VM_HUGEPAGE).
>> + */
>> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
> 
> NO space after # please.
> 

I think this is following the file convention, the space is there in other flags
all over this file. I dont like the space as well.

>>  #define PR_GET_THP_DISABLE	42
>>
>>  /*
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index b153fb345ada..b87d0acaab0b 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>>  	return sizeof(mm->saved_auxv);
>>  }
>>
>> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
>> +				 unsigned long arg4, unsigned long arg5)
>> +{
>> +	unsigned long *mm_flags = &current->mm->flags;
>> +
>> +	if (arg2 || arg3 || arg4 || arg5)
>> +		return -EINVAL;
>> +
> 
> Can we have a comment here about what we're doing below re: the return
> value?
> 

Do you mean add returning 1 for MMF_DISABLE_THP_COMPLETELY and 3 for MMF_DISABLE_THP_EXCEPT_ADVISED?

I will add it in next revision.

>> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags))
>> +		return 1;
>> +	else if (test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags))
>> +		return 1 | PR_THP_DISABLE_EXCEPT_ADVISED;
>> +	return 0;
>> +}
>> +
>> +static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>> +				 unsigned long arg4, unsigned long arg5)
>> +{
>> +	unsigned long *mm_flags = &current->mm->flags;
>> +
>> +	if (arg4 || arg5)
>> +		return -EINVAL;
>> +
>> +	/* Flags are only allowed when disabling. */
>> +	if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>> +		return -EINVAL;
>> +	if (mmap_write_lock_killable(current->mm))
>> +		return -EINTR;
>> +	if (thp_disable) {
>> +		if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
>> +			clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
>> +			set_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
>> +		} else {
>> +			set_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
>> +			clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
>> +		}
>> +	} else {
>> +		clear_bit(MMF_DISABLE_THP_COMPLETELY, mm_flags);
>> +		clear_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, mm_flags);
>> +	}
>> +	mmap_write_unlock(current->mm);
>> +	return 0;
>> +}
>> +
>>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  		unsigned long, arg4, unsigned long, arg5)
>>  {
>> @@ -2596,20 +2640,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>  			return -EINVAL;
>>  		return task_no_new_privs(current) ? 1 : 0;
>>  	case PR_GET_THP_DISABLE:
>> -		if (arg2 || arg3 || arg4 || arg5)
>> -			return -EINVAL;
>> -		error = !!test_bit(MMF_DISABLE_THP, &me->mm->flags);
>> +		error = prctl_get_thp_disable(arg2, arg3, arg4, arg5);
>>  		break;
>>  	case PR_SET_THP_DISABLE:
>> -		if (arg3 || arg4 || arg5)
>> -			return -EINVAL;
>> -		if (mmap_write_lock_killable(me->mm))
>> -			return -EINTR;
>> -		if (arg2)
>> -			set_bit(MMF_DISABLE_THP, &me->mm->flags);
>> -		else
>> -			clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>> -		mmap_write_unlock(me->mm);
>> +		error = prctl_set_thp_disable(arg2, arg3, arg4, arg5);
>>  		break;
>>  	case PR_MPX_ENABLE_MANAGEMENT:
>>  	case PR_MPX_DISABLE_MANAGEMENT:
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1ff0c7dd2be4..2c9008246785 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -410,7 +410,7 @@ static inline int hpage_collapse_test_exit(struct mm_struct *mm)
>>  static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
>>  {
>>  	return hpage_collapse_test_exit(mm) ||
>> -	       test_bit(MMF_DISABLE_THP, &mm->flags);
>> +	       test_bit(MMF_DISABLE_THP_COMPLETELY, &mm->flags);
>>  }
>>
>>  static bool hugepage_pmd_enabled(void)
>> --
>> 2.47.3
>>
Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Lorenzo Stoakes 2 months ago
Just a ping on the man page stuff - you will do that right? :>)

Probably best at end of 6.18 cycle when this is about to go to Linus.

I can give you some details on how that all works if it'd be helpful.

On Wed, Jul 30, 2025 at 08:42:04PM +0100, Usama Arif wrote:
> >> Acked-by: Usama Arif <usamaarif642@gmail.com>
> >> Tested-by: Usama Arif <usamaarif642@gmail.com>
[snip]
> >> Signed-off-by: David Hildenbrand <david@redhat.com>

All looks good aside from nits, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> >> +/*
> >> + * Check whether THPs are explicitly disabled for this VMA, for example,
> >> + * through madvise or prctl.
> >> + */
> >>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> >>  		vm_flags_t vm_flags)
> >>  {
> >> +	/* Are THPs disabled for this VMA? */
> >> +	if (vm_flags & VM_NOHUGEPAGE)
> >> +		return true;
>
> VM_NOHUGEPAGE will cause the THP being disabled here.
>
> >> +	/* Are THPs disabled for all VMAs in the whole process? */
> >> +	if (test_bit(MMF_DISABLE_THP_COMPLETELY, &vma->vm_mm->flags))
> >> +		return true;
> >>  	/*
> >> -	 * Explicitly disabled through madvise or prctl, or some
> >> -	 * architectures may disable THP for some mappings, for
> >> -	 * example, s390 kvm.
> >> +	 * Are THPs disabled only for VMAs where we didn't get an explicit
> >> +	 * advise to use them?
> >>  	 */
> >> -	return (vm_flags & VM_NOHUGEPAGE) ||
> >> -	       test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags);
> >> +	if (vm_flags & VM_HUGEPAGE)
> >> +		return false;
> >
> > Hm is this correct? This means that VM_NOHUGEPAGE no longer results in THP being
> > disabled here no?
>
> Lorenzo, pointed to VM_NOHUGEPAGE case above.>

Haha doh! OK cool, obviously I did this review a little too late in the day
when Mr. Brain was not on best form :P

All good, thanks!

And of course this now enforces the 'except advised' logic below it.

> >> +/*
> >> + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> >> + * VM_HUGEPAGE).
> >> + */
> >> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
> >
> > NO space after # please.
> >
>
> I think this is following the file convention, the space is there in other flags
> all over this file. I dont like the space as well.

Yeah yuck. It's not a big deal, but ideally I'd prefer us to be sane even
if the rest of the header is less so here.

>
> >>  #define PR_GET_THP_DISABLE	42
> >>
> >>  /*
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index b153fb345ada..b87d0acaab0b 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
> >>  	return sizeof(mm->saved_auxv);
> >>  }
> >>
> >> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> >> +				 unsigned long arg4, unsigned long arg5)
> >> +{
> >> +	unsigned long *mm_flags = &current->mm->flags;
> >> +
> >> +	if (arg2 || arg3 || arg4 || arg5)
> >> +		return -EINVAL;
> >> +
> >
> > Can we have a comment here about what we're doing below re: the return
> > value?
> >
>
> Do you mean add returning 1 for MMF_DISABLE_THP_COMPLETELY and 3 for MMF_DISABLE_THP_EXCEPT_ADVISED?

Well more so something about we return essentially flags indicating what is
enabled or not, if bit 0 is set then it's disabled, if bit 1 is set then
it's that with the exception of VM_HUGEPAGE VMAs.

>
> I will add it in next revision.

Thanks!
Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by David Hildenbrand 2 months ago
Thanks Lorenzo for the review, I'll leave handling all that to Usama 
from this point :)

On 31.07.25 10:29, Lorenzo Stoakes wrote:
> Just a ping on the man page stuff - you will do that right? :>)
> 

I'm hoping that Usama can take over that part. If not, I'll handle it 
(had planned it for once it's in mm-stable / going upstream).

[ ... ]

>>>> +/*
>>>> + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
>>>> + * VM_HUGEPAGE).
>>>> + */
>>>> +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>>>
>>> NO space after # please.
>>>
>>
>> I think this is following the file convention, the space is there in other flags
>> all over this file. I dont like the space as well.
> 
> Yeah yuck. It's not a big deal, but ideally I'd prefer us to be sane even
> if the rest of the header is less so here.

I'm afraid us doing something different here will not make prctl() any 
better as a whole, so let's keep it consistent in this questionable file.

> 
>>
>>>>   #define PR_GET_THP_DISABLE	42
>>>>
>>>>   /*
>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>> index b153fb345ada..b87d0acaab0b 100644
>>>> --- a/kernel/sys.c
>>>> +++ b/kernel/sys.c
>>>> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>>>>   	return sizeof(mm->saved_auxv);
>>>>   }
>>>>
>>>> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
>>>> +				 unsigned long arg4, unsigned long arg5)
>>>> +{
>>>> +	unsigned long *mm_flags = &current->mm->flags;
>>>> +
>>>> +	if (arg2 || arg3 || arg4 || arg5)
>>>> +		return -EINVAL;
>>>> +
>>>
>>> Can we have a comment here about what we're doing below re: the return
>>> value?
>>>
>>
>> Do you mean add returning 1 for MMF_DISABLE_THP_COMPLETELY and 3 for MMF_DISABLE_THP_EXCEPT_ADVISED?
> 
> Well more so something about we return essentially flags indicating what is
> enabled or not, if bit 0 is set then it's disabled, if bit 1 is set then
> it's that with the exception of VM_HUGEPAGE VMAs.

We have that documented above the defines for flags etc. Maybe simply here:

/* If disabled, we return "1 | flags", otherwise 0. */

-- 
Cheers,

David / dhildenb
Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Usama Arif 2 months ago

On 31/07/2025 09:38, David Hildenbrand wrote:
> Thanks Lorenzo for the review, I'll leave handling all that to Usama from this point :)
> 
> On 31.07.25 10:29, Lorenzo Stoakes wrote:
>> Just a ping on the man page stuff - you will do that right? :>)
>>
> 
> I'm hoping that Usama can take over that part. If not, I'll handle it (had planned it for once it's in mm-stable / going upstream).


Yes, plan to take care of man page, systemd, etc once the patch makes it to mm-stable. 

> 
> [ ... ]
> 
>>>>> +/*
>>>>> + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
>>>>> + * VM_HUGEPAGE).
>>>>> + */
>>>>> +# define PR_THP_DISABLE_EXCEPT_ADVISED    (1 << 1)
>>>>
>>>> NO space after # please.
>>>>
>>>
>>> I think this is following the file convention, the space is there in other flags
>>> all over this file. I dont like the space as well.
>>
>> Yeah yuck. It's not a big deal, but ideally I'd prefer us to be sane even
>> if the rest of the header is less so here.
> 
> I'm afraid us doing something different here will not make prctl() any better as a whole, so let's keep it consistent in this questionable file.
> 
>>
>>>
>>>>>   #define PR_GET_THP_DISABLE    42
>>>>>
>>>>>   /*
>>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>>> index b153fb345ada..b87d0acaab0b 100644
>>>>> --- a/kernel/sys.c
>>>>> +++ b/kernel/sys.c
>>>>> @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
>>>>>       return sizeof(mm->saved_auxv);
>>>>>   }
>>>>>
>>>>> +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
>>>>> +                 unsigned long arg4, unsigned long arg5)
>>>>> +{
>>>>> +    unsigned long *mm_flags = &current->mm->flags;
>>>>> +
>>>>> +    if (arg2 || arg3 || arg4 || arg5)
>>>>> +        return -EINVAL;
>>>>> +
>>>>
>>>> Can we have a comment here about what we're doing below re: the return
>>>> value?
>>>>
>>>
>>> Do you mean add returning 1 for MMF_DISABLE_THP_COMPLETELY and 3 for MMF_DISABLE_THP_EXCEPT_ADVISED?
>>
>> Well more so something about we return essentially flags indicating what is
>> enabled or not, if bit 0 is set then it's disabled, if bit 1 is set then
>> it's that with the exception of VM_HUGEPAGE VMAs.
> 
> We have that documented above the defines for flags etc. Maybe simply here:
> 
> /* If disabled, we return "1 | flags", otherwise 0. */
> 

Thanks, will add this to next revision.

Re: [PATCH 1/5] prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE
Posted by Lorenzo Stoakes 2 months ago
On Thu, Jul 31, 2025 at 10:38:49AM +0200, David Hildenbrand wrote:
> Thanks Lorenzo for the review, I'll leave handling all that to Usama from
> this point :)
>
> On 31.07.25 10:29, Lorenzo Stoakes wrote:
> > Just a ping on the man page stuff - you will do that right? :>)
> >
>
> I'm hoping that Usama can take over that part. If not, I'll handle it (had
> planned it for once it's in mm-stable / going upstream).
>
> [ ... ]

Thanks

>
> > > > > +/*
> > > > > + * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> > > > > + * VM_HUGEPAGE).
> > > > > + */
> > > > > +# define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
> > > >
> > > > NO space after # please.
> > > >
> > >
> > > I think this is following the file convention, the space is there in other flags
> > > all over this file. I dont like the space as well.
> >
> > Yeah yuck. It's not a big deal, but ideally I'd prefer us to be sane even
> > if the rest of the header is less so here.
>
> I'm afraid us doing something different here will not make prctl() any
> better as a whole, so let's keep it consistent in this questionable file.

Sure not a big deal.

>
> >
> > >
> > > > >   #define PR_GET_THP_DISABLE	42
> > > > >
> > > > >   /*
> > > > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > > > index b153fb345ada..b87d0acaab0b 100644
> > > > > --- a/kernel/sys.c
> > > > > +++ b/kernel/sys.c
> > > > > @@ -2423,6 +2423,50 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
> > > > >   	return sizeof(mm->saved_auxv);
> > > > >   }
> > > > >
> > > > > +static int prctl_get_thp_disable(unsigned long arg2, unsigned long arg3,
> > > > > +				 unsigned long arg4, unsigned long arg5)
> > > > > +{
> > > > > +	unsigned long *mm_flags = &current->mm->flags;
> > > > > +
> > > > > +	if (arg2 || arg3 || arg4 || arg5)
> > > > > +		return -EINVAL;
> > > > > +
> > > >
> > > > Can we have a comment here about what we're doing below re: the return
> > > > value?
> > > >
> > >
> > > Do you mean add returning 1 for MMF_DISABLE_THP_COMPLETELY and 3 for MMF_DISABLE_THP_EXCEPT_ADVISED?
> >
> > Well more so something about we return essentially flags indicating what is
> > enabled or not, if bit 0 is set then it's disabled, if bit 1 is set then
> > it's that with the exception of VM_HUGEPAGE VMAs.
>
> We have that documented above the defines for flags etc. Maybe simply here:
>
> /* If disabled, we return "1 | flags", otherwise 0. */

Yup I know it was documented, but nice to have here as I'd definitely read this
code and think 'huh?'.

What you suggest is fine!

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks, Lorenzo