[v6] mm: thp: reduce unnecessary start_stop_khugepaged() calls

[PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Breno Leitao 3 weeks, 6 days ago

Refactor enabled_store() to use a new change_enabled() helper.
Introduce a separate enum global_enabled_mode and
global_enabled_mode_strings[], mirroring the anon_enabled_mode
pattern from the previous commit.

A separate enum is necessary because the global THP setting does
not support "inherit", only "always", "madvise", and "never".
Reusing anon_enabled_mode would leave a NULL gap in the string
array, causing sysfs_match_string() to stop early and fail to
match entries after the gap.

The helper uses the same loop pattern as set_anon_enabled_mode(),
iterating over an array of flag bit positions and using
__test_and_set_bit()/__test_and_clear_bit() to track whether the state
actually changed.

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f6af90e6cf05d..8949393de7a51 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -330,30 +330,63 @@ static const char * const anon_enabled_mode_strings[] = {
 	[ANON_ENABLED_NEVER]	= "never",
 };
 
+enum global_enabled_mode {
+	GLOBAL_ENABLED_ALWAYS	= 0,
+	GLOBAL_ENABLED_MADVISE	= 1,
+	GLOBAL_ENABLED_NEVER	= 2,
+};
+
+static const char * const global_enabled_mode_strings[] = {
+	[GLOBAL_ENABLED_ALWAYS]		= "always",
+	[GLOBAL_ENABLED_MADVISE]	= "madvise",
+	[GLOBAL_ENABLED_NEVER]		= "never",
+};
+
+static bool set_global_enabled_mode(enum global_enabled_mode mode)
+{
+	static const unsigned long thp_flags[] = {
+		TRANSPARENT_HUGEPAGE_FLAG,
+		TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	};
+	enum global_enabled_mode m;
+	bool changed = false;
+
+	for (m = 0; m < ARRAY_SIZE(thp_flags); m++) {
+		if (m == mode)
+			changed |= !__test_and_set_bit(thp_flags[m],
+						       &transparent_hugepage_flags);
+		else
+			changed |= __test_and_clear_bit(thp_flags[m],
+							&transparent_hugepage_flags);
+	}
+
+	return changed;
+}
+
 static ssize_t enabled_store(struct kobject *kobj,
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	ssize_t ret = count;
+	int mode;
 
-	if (sysfs_streq(buf, "always")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "madvise")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "never")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else
-		ret = -EINVAL;
+	mode = sysfs_match_string(global_enabled_mode_strings, buf);
+	if (mode < 0)
+		return -EINVAL;
 
-	if (ret > 0) {
+	if (set_global_enabled_mode(mode)) {
 		int err = start_stop_khugepaged();
+
 		if (err)
-			ret = err;
+			return err;
+	} else {
+		/*
+		 * Recalculate watermarks even when the mode didn't
+		 * change, as the previous code always called
+		 * start_stop_khugepaged() which does this internally.
+		 */
+		set_recommended_min_free_kbytes();
 	}
-	return ret;
+	return count;
 }
 
 static struct kobj_attribute enabled_attr = __ATTR_RW(enabled);

-- 
2.52.0

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Sohil Mehta 3 weeks, 3 days ago

Hello,

On 3/11/2026 3:17 AM, Breno Leitao wrote:

> +static bool set_global_enabled_mode(enum global_enabled_mode mode)
> +{
> +	static const unsigned long thp_flags[] = {
> +		TRANSPARENT_HUGEPAGE_FLAG,
> +		TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	};
> +	enum global_enabled_mode m;
> +	bool changed = false;
> +
> +	for (m = 0; m < ARRAY_SIZE(thp_flags); m++) {
> +		if (m == mode)
> +			changed |= !__test_and_set_bit(thp_flags[m],
> +						       &transparent_hugepage_flags);
> +		else
> +			changed |= __test_and_clear_bit(thp_flags[m],
> +							&transparent_hugepage_flags);

I am not a mm expert and typically do not follow the mm list. Is there
an issue with the usage of non-atomic variants here? The commit message
says this uses the same pattern as set_anon_enabled_mode().

However, set_anon_enabled_mode() has a spinlock=>huge_anon_orders_lock
protecting the access. But, transparent_hugepage_flags seems to be
unprotected in that regard.

IIUC, David's suggestion to use the lockless version was also based on
the use of a lock in that context.
https://lore.kernel.org/all/4f2abf42-983f-4cc2-92f5-c81827e7b7e2@kernel.org/

Note: We ran into this issue while experimenting with an AI review agent
internally. The patch was selected at random from the mailing list. I
analyzed the issue independently and came to the same conclusion. I
apologize if this is a false alarm.

> +	}
> +
> +	return changed;
> +}
> +

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Breno Leitao 3 weeks, 1 day ago

Hello Sohil,

On Fri, Mar 13, 2026 at 03:31:25PM -0700, Sohil Mehta wrote:
> Hello,
> 
> On 3/11/2026 3:17 AM, Breno Leitao wrote:
> 
> > +static bool set_global_enabled_mode(enum global_enabled_mode mode)
> > +{
> > +	static const unsigned long thp_flags[] = {
> > +		TRANSPARENT_HUGEPAGE_FLAG,
> > +		TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> > +	};
> > +	enum global_enabled_mode m;
> > +	bool changed = false;
> > +
> > +	for (m = 0; m < ARRAY_SIZE(thp_flags); m++) {
> > +		if (m == mode)
> > +			changed |= !__test_and_set_bit(thp_flags[m],
> > +						       &transparent_hugepage_flags);
> > +		else
> > +			changed |= __test_and_clear_bit(thp_flags[m],
> > +							&transparent_hugepage_flags);
> 
> I am not a mm expert and typically do not follow the mm list. Is there
> an issue with the usage of non-atomic variants here? The commit message
> says this uses the same pattern as set_anon_enabled_mode().
> 
> However, set_anon_enabled_mode() has a spinlock=>huge_anon_orders_lock
> protecting the access. But, transparent_hugepage_flags seems to be
> unprotected in that regard.

I don't think that the atomic vs non-atomic will not help much, given
this is a compoud operation. Independently if this is atomic or not, it
is racy with anyone changing these fields (transparent_hugepage_flags).
In other words, Atomic ops make each individual bit flip safe, but
set_global_enabled_mode() and defrag_store() need to flip multiple bits
as a group. With atomic ops, two concurrent writers can still interleave
and leave the flags in an invalid state.

That said, Although I don't think this patch is making it worse, I think
the is a racy issue here that we can make better. 

My suggestion is to move the rest of the helpers (defrag_store()) to use
sysfs_match_string(), and then create a thp_flags_lock spinlock to
protect operations against transparent_hugepage_flags. Any concern about
this approach?

Thanks for reviewing it,
--breno

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Sohil Mehta 3 weeks ago

On 3/16/2026 3:12 AM, Breno Leitao wrote:

>> I am not a mm expert and typically do not follow the mm list. Is there
>> an issue with the usage of non-atomic variants here? The commit message
>> says this uses the same pattern as set_anon_enabled_mode().
>>
>> However, set_anon_enabled_mode() has a spinlock=>huge_anon_orders_lock
>> protecting the access. But, transparent_hugepage_flags seems to be
>> unprotected in that regard.
> 
> I don't think that the atomic vs non-atomic will not help much, given
> this is a compoud operation. Independently if this is atomic or not, it
> is racy with anyone changing these fields (transparent_hugepage_flags).
> In other words, Atomic ops make each individual bit flip safe, but
> set_global_enabled_mode() and defrag_store() need to flip multiple bits
> as a group. With atomic ops, two concurrent writers can still interleave
> and leave the flags in an invalid state.

You are right it is a compound operation. So, there is an existing issue
with two concurrent writers which can leave the flags in an invalid state.

But, I was wondering if there is a slightly different issue now due to
the non-atomic part. Some updates could be lost depending on the timing
of the operation.

For example,

CPU1 does:				CPU2 does:
set_global_enabled_mode("always")	defrag_store("always")

___test_and_set_bit():
// Trying to set bit 1

old = *p
// reads flags, sees defrag bit=0

					set_bit(DEFRAG_DIRECT_FLAG)
					// atomic: sets bit 3

*p = old | TRANSPARENT_HUGEPAGE_FLAG;
// writes back old value with bit 1 set
// DEFRAG_DIRECT_FLAG is lost (reverted to 0)

IIUC, this issue didn't exist before. I think it might be safer to use
the atomic test_and_set_bit() to be compatible with the older code.
Though, I'll leave it up to you as I don't have expertise here.

Overall, as you mentioned below, protecting transparent_hugepage_flags
with a spinlock seems like a better, long-term solution to me as well.

> 
> That said, Although I don't think this patch is making it worse, I think
> the is a racy issue here that we can make better. 
> 
> My suggestion is to move the rest of the helpers (defrag_store()) to use
> sysfs_match_string(), and then create a thp_flags_lock spinlock to
> protect operations against transparent_hugepage_flags. Any concern about
> this approach?
>

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Lorenzo Stoakes (Oracle) 3 weeks ago

On Mon, Mar 16, 2026 at 04:26:30PM -0700, Sohil Mehta wrote:
> On 3/16/2026 3:12 AM, Breno Leitao wrote:
>
> >> I am not a mm expert and typically do not follow the mm list. Is there
> >> an issue with the usage of non-atomic variants here? The commit message
> >> says this uses the same pattern as set_anon_enabled_mode().
> >>
> >> However, set_anon_enabled_mode() has a spinlock=>huge_anon_orders_lock
> >> protecting the access. But, transparent_hugepage_flags seems to be
> >> unprotected in that regard.
> >
> > I don't think that the atomic vs non-atomic will not help much, given
> > this is a compoud operation. Independently if this is atomic or not, it
> > is racy with anyone changing these fields (transparent_hugepage_flags).
> > In other words, Atomic ops make each individual bit flip safe, but
> > set_global_enabled_mode() and defrag_store() need to flip multiple bits
> > as a group. With atomic ops, two concurrent writers can still interleave
> > and leave the flags in an invalid state.
>
> You are right it is a compound operation. So, there is an existing issue
> with two concurrent writers which can leave the flags in an invalid state.
>
> But, I was wondering if there is a slightly different issue now due to
> the non-atomic part. Some updates could be lost depending on the timing
> of the operation.
>
> For example,
>
> CPU1 does:				CPU2 does:
> set_global_enabled_mode("always")	defrag_store("always")
>
> ___test_and_set_bit():
> // Trying to set bit 1
>
> old = *p
> // reads flags, sees defrag bit=0
>
> 					set_bit(DEFRAG_DIRECT_FLAG)
> 					// atomic: sets bit 3
>
>
> *p = old | TRANSPARENT_HUGEPAGE_FLAG;

> // writes back old value with bit 1 set
> // DEFRAG_DIRECT_FLAG is lost (reverted to 0)
>
> IIUC, this issue didn't exist before. I think it might be safer to use
> the atomic test_and_set_bit() to be compatible with the older code.
> Though, I'll leave it up to you as I don't have expertise here.

No, it's up to the maintainers :)

Given the above I think we should switch back to the atomic accessors.

We can address the broader issues with this horrible code in a separate
series.

>
> Overall, as you mentioned below, protecting transparent_hugepage_flags
> with a spinlock seems like a better, long-term solution to me as well.

Yeah, let's look at doing a follow up that cleans this up in general and
address that then.

>
> >
> > That said, Although I don't think this patch is making it worse, I think
> > the is a racy issue here that we can make better.
> >
> > My suggestion is to move the rest of the helpers (defrag_store()) to use
> > sysfs_match_string(), and then create a thp_flags_lock spinlock to
> > protect operations against transparent_hugepage_flags. Any concern about
> > this approach?
> >
>
>

Cheers, Lorenzo

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Breno Leitao 3 weeks ago

On Tue, Mar 17, 2026 at 09:37:39AM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Mon, Mar 16, 2026 at 04:26:30PM -0700, Sohil Mehta wrote:
> > On 3/16/2026 3:12 AM, Breno Leitao wrote:
> >
> > >> I am not a mm expert and typically do not follow the mm list. Is there
> > >> an issue with the usage of non-atomic variants here? The commit message
> > >> says this uses the same pattern as set_anon_enabled_mode().
> > >>
> > >> However, set_anon_enabled_mode() has a spinlock=>huge_anon_orders_lock
> > >> protecting the access. But, transparent_hugepage_flags seems to be
> > >> unprotected in that regard.
> > >
> > > I don't think that the atomic vs non-atomic will not help much, given
> > > this is a compoud operation. Independently if this is atomic or not, it
> > > is racy with anyone changing these fields (transparent_hugepage_flags).
> > > In other words, Atomic ops make each individual bit flip safe, but
> > > set_global_enabled_mode() and defrag_store() need to flip multiple bits
> > > as a group. With atomic ops, two concurrent writers can still interleave
> > > and leave the flags in an invalid state.
> >
> > You are right it is a compound operation. So, there is an existing issue
> > with two concurrent writers which can leave the flags in an invalid state.
> >
> > But, I was wondering if there is a slightly different issue now due to
> > the non-atomic part. Some updates could be lost depending on the timing
> > of the operation.
> >
> > For example,
> >
> > CPU1 does:				CPU2 does:
> > set_global_enabled_mode("always")	defrag_store("always")
> >
> > ___test_and_set_bit():
> > // Trying to set bit 1
> >
> > old = *p
> > // reads flags, sees defrag bit=0
> >
> > 					set_bit(DEFRAG_DIRECT_FLAG)
> > 					// atomic: sets bit 3
> >
> >
> > *p = old | TRANSPARENT_HUGEPAGE_FLAG;
> 
> > // writes back old value with bit 1 set
> > // DEFRAG_DIRECT_FLAG is lost (reverted to 0)
> >
> > IIUC, this issue didn't exist before. I think it might be safer to use
> > the atomic test_and_set_bit() to be compatible with the older code.
> > Though, I'll leave it up to you as I don't have expertise here.
> 
> No, it's up to the maintainers :)
> 
> Given the above I think we should switch back to the atomic accessors.
> 
> We can address the broader issues with this horrible code in a separate
> series.

Ack. let me respin then.

> > Overall, as you mentioned below, protecting transparent_hugepage_flags
> > with a spinlock seems like a better, long-term solution to me as well.
> 
> Yeah, let's look at doing a follow up that cleans this up in general and
> address that then.

Sure. I am planning to improve defrag_store() as the next work, and then
come up with this additional spinlock for transparent_hugepage_flags.

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Lorenzo Stoakes (Oracle) 3 weeks ago

On Tue, Mar 17, 2026 at 04:23:37AM -0700, Breno Leitao wrote:
> On Tue, Mar 17, 2026 at 09:37:39AM +0000, Lorenzo Stoakes (Oracle) wrote:
> >
> > Given the above I think we should switch back to the atomic accessors.
> >
> > We can address the broader issues with this horrible code in a separate
> > series.
>
> Ack. let me respin then.

Thanks!

>
> > > Overall, as you mentioned below, protecting transparent_hugepage_flags
> > > with a spinlock seems like a better, long-term solution to me as well.
> >
> > Yeah, let's look at doing a follow up that cleans this up in general and
> > address that then.
>
> Sure. I am planning to improve defrag_store() as the next work, and then
> come up with this additional spinlock for transparent_hugepage_flags.
>

To be clear by the way by 'horrible code' I meant the existing logic with the
globals etc. not your change which is positive and welcome :)

Cheers, Lorenzo

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Breno Leitao 3 weeks ago

On Tue, Mar 17, 2026 at 11:25:23AM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 17, 2026 at 04:23:37AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:37:39AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > >
> > > Given the above I think we should switch back to the atomic accessors.
> > >
> > > We can address the broader issues with this horrible code in a separate
> > > series.
> >
> > Ack. let me respin then.
> 
> Thanks!
> 
> >
> > > > Overall, as you mentioned below, protecting transparent_hugepage_flags
> > > > with a spinlock seems like a better, long-term solution to me as well.
> > >
> > > Yeah, let's look at doing a follow up that cleans this up in general and
> > > address that then.
> >
> > Sure. I am planning to improve defrag_store() as the next work, and then
> > come up with this additional spinlock for transparent_hugepage_flags.
> >
> 
> To be clear by the way by 'horrible code' I meant the existing logic with the
> globals etc. not your change which is positive and welcome :)

lol. Not once did it cross my mind that you might be referring to my
changes. I have never written horrible code in my life. :-P

Re: [PATCH v6 3/4] mm: huge_memory: refactor enabled_store() with change_enabled()

Posted by Lance Yang 3 weeks, 6 days ago


On 2026/3/11 18:17, Breno Leitao wrote:
> Refactor enabled_store() to use a new change_enabled() helper.
> Introduce a separate enum global_enabled_mode and
> global_enabled_mode_strings[], mirroring the anon_enabled_mode
> pattern from the previous commit.
> 
> A separate enum is necessary because the global THP setting does
> not support "inherit", only "always", "madvise", and "never".
> Reusing anon_enabled_mode would leave a NULL gap in the string
> array, causing sysfs_match_string() to stop early and fail to
> match entries after the gap.
> 
> The helper uses the same loop pattern as set_anon_enabled_mode(),
> iterating over an array of flag bit positions and using
> __test_and_set_bit()/__test_and_clear_bit() to track whether the state
> actually changed.
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---

Thanks!

Tested-by: Lance Yang <lance.yang@linux.dev>