collapse_file() requires FSes supporting large folio with at least
PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
huge option turned on also sets large folio order on mapping, so the check
also applies to shmem.
While at it, replace VM_BUG_ON with returning failure values.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
mm/khugepaged.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d06d84219e1b..45b12ffb1550 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
int nr_none = 0;
bool is_shmem = shmem_file(file);
- VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
- VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
+ /* "huge" shmem sets mapping folio order and passes the check below */
+ if (mapping_max_folio_order(mapping) < PMD_ORDER)
+ return SCAN_FAIL;
+ if (start & (HPAGE_PMD_NR - 1))
+ return SCAN_ADDRESS_RANGE;
result = alloc_charge_folio(&new_folio, mm, cc);
if (result != SCAN_SUCCEED)
--
2.43.0
On 3/27/26 02:42, Zi Yan wrote: > collapse_file() requires FSes supporting large folio with at least > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with > huge option turned on also sets large folio order on mapping, so the check > also applies to shmem. > > While at it, replace VM_BUG_ON with returning failure values. Why not VM_WARN_ON_ONCE() ? These are conditions that must be checked earlier, no? -- Cheers, David
On 27 Mar 2026, at 9:37, David Hildenbrand (Arm) wrote: > On 3/27/26 02:42, Zi Yan wrote: >> collapse_file() requires FSes supporting large folio with at least >> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >> huge option turned on also sets large folio order on mapping, so the check >> also applies to shmem. >> >> While at it, replace VM_BUG_ON with returning failure values. > > Why not VM_WARN_ON_ONCE() ? > > These are conditions that must be checked earlier, no? start & (HPAGE_PMD_NR - 1) yes. I can convert it to VM_WARN_ON_ONCE(). For mapping_max_folio_order(mapping) < PMD_ORDER, I probably should move it to collapse_scan_file() to prevent wasting scanning time if the file does not support large folio. Then, I can turn it into a VM_WARN_ON_ONCE(). Best Regards, Yan, Zi
On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> huge option turned on also sets large folio order on mapping, so the check
> also applies to shmem.
>
> While at it, replace VM_BUG_ON with returning failure values.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> mm/khugepaged.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d06d84219e1b..45b12ffb1550 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> int nr_none = 0;
> bool is_shmem = shmem_file(file);
>
> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> + /* "huge" shmem sets mapping folio order and passes the check below */
I think this isn't quite clear and could be improved to e.g.:
/*
* Either anon shmem supports huge pages as set by shmem_enabled sysfs,
* or a shmem file system mounted with the "huge" option.
*/
> + if (mapping_max_folio_order(mapping) < PMD_ORDER)
> + return SCAN_FAIL;
As per rest of thread, this looks correct.
> + if (start & (HPAGE_PMD_NR - 1))
> + return SCAN_ADDRESS_RANGE;
Hmm, we're kinda making this - presumably buggy situation - into a valid input
that just fails the scan.
Maybe just make it a VM_WARN_ON_ONCE()? Or if we want to avoid propagating the
bug that'd cause it any further:
if (start & (HPAGE_PMD_NR - 1)) {
VM_WARN_ON_ONCE(true);
return SCAN_ADDRESS_RANGE;
}
Or similar.
>
> result = alloc_charge_folio(&new_folio, mm, cc);
> if (result != SCAN_SUCCEED)
> --
> 2.43.0
>
Cheers, Lorenzo
On 27 Mar 2026, at 8:07, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote:
>> collapse_file() requires FSes supporting large folio with at least
>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>> huge option turned on also sets large folio order on mapping, so the check
>> also applies to shmem.
>>
>> While at it, replace VM_BUG_ON with returning failure values.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
>
>
>> ---
>> mm/khugepaged.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d06d84219e1b..45b12ffb1550 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>> int nr_none = 0;
>> bool is_shmem = shmem_file(file);
>>
>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>> + /* "huge" shmem sets mapping folio order and passes the check below */
>
> I think this isn't quite clear and could be improved to e.g.:
>
> /*
> * Either anon shmem supports huge pages as set by shmem_enabled sysfs,
> * or a shmem file system mounted with the "huge" option.
> */
>
>> + if (mapping_max_folio_order(mapping) < PMD_ORDER)
>> + return SCAN_FAIL;
>
> As per rest of thread, this looks correct.
Will respond to that thread.
>
>> + if (start & (HPAGE_PMD_NR - 1))
>> + return SCAN_ADDRESS_RANGE;
>
> Hmm, we're kinda making this - presumably buggy situation - into a valid input
> that just fails the scan.
>
> Maybe just make it a VM_WARN_ON_ONCE()? Or if we want to avoid propagating the
> bug that'd cause it any further:
>
> if (start & (HPAGE_PMD_NR - 1)) {
> VM_WARN_ON_ONCE(true);
> return SCAN_ADDRESS_RANGE;
> }
>
> Or similar.
As I responded to David, will change it to VM_WARN_ON_ONCE().
>
>>
>> result = alloc_charge_folio(&new_folio, mm, cc);
>> if (result != SCAN_SUCCEED)
>> --
>> 2.43.0
>>
>
> Cheers, Lorenzo
Best Regards,
Yan, Zi
On Fri, Mar 27, 2026 at 12:07:22PM +0000, Lorenzo Stoakes (Oracle) wrote: > > + if (mapping_max_folio_order(mapping) < PMD_ORDER) > > + return SCAN_FAIL; > > As per rest of thread, this looks correct. Actually, no :)
On 3/27/26 9:42 AM, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
> huge option turned on also sets large folio order on mapping, so the check
> also applies to shmem.
>
> While at it, replace VM_BUG_ON with returning failure values.
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> mm/khugepaged.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d06d84219e1b..45b12ffb1550 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> int nr_none = 0;
> bool is_shmem = shmem_file(file);
>
> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> + /* "huge" shmem sets mapping folio order and passes the check below */
> + if (mapping_max_folio_order(mapping) < PMD_ORDER)
> + return SCAN_FAIL;
This is not true for anonymous shmem, since its large order allocation
logic is similar to anonymous memory. That means it will not call
mapping_set_large_folios() for anonymous shmem.
So I think the check should be:
if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
return SCAN_FAIL;
On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: > > > On 3/27/26 9:42 AM, Zi Yan wrote: > > collapse_file() requires FSes supporting large folio with at least > > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with > > huge option turned on also sets large folio order on mapping, so the check > > also applies to shmem. > > > > While at it, replace VM_BUG_ON with returning failure values. > > > > Signed-off-by: Zi Yan <ziy@nvidia.com> > > --- > > mm/khugepaged.c | 7 +++++-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index d06d84219e1b..45b12ffb1550 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, > > int nr_none = 0; > > bool is_shmem = shmem_file(file); > > - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); > > - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); > > + /* "huge" shmem sets mapping folio order and passes the check below */ > > + if (mapping_max_folio_order(mapping) < PMD_ORDER) > > + return SCAN_FAIL; > > This is not true for anonymous shmem, since its large order allocation logic > is similar to anonymous memory. That means it will not call > mapping_set_large_folios() for anonymous shmem. > > So I think the check should be: > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) > return SCAN_FAIL; Hmm but in shmem_init() we have: #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ /* * Default to setting PMD-sized THP to inherit the global setting and * disable all other multi-size THPs. */ if (!shmem_orders_configured) huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); #endif And shm_mnt->mnt_sb is the superblock used for anon shmem. Also shmem_enabled_store() updates that if necessary. So we're still fine right? __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> __shmem_get_inode() which has: if (sbinfo->huge) mapping_set_large_folios(inode->i_mapping); Shared for both anon shmem and tmpfs-style shmem. So I think it's fine as-is. Cheers, Lorenzo
On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote:
> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/27/26 9:42 AM, Zi Yan wrote:
>>> collapse_file() requires FSes supporting large folio with at least
>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with
>>> huge option turned on also sets large folio order on mapping, so the check
>>> also applies to shmem.
>>>
>>> While at it, replace VM_BUG_ON with returning failure values.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> mm/khugepaged.c | 7 +++++--
>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d06d84219e1b..45b12ffb1550 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>> int nr_none = 0;
>>> bool is_shmem = shmem_file(file);
>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>>> + /* "huge" shmem sets mapping folio order and passes the check below */
>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER)
>>> + return SCAN_FAIL;
>>
>> This is not true for anonymous shmem, since its large order allocation logic
>> is similar to anonymous memory. That means it will not call
>> mapping_set_large_folios() for anonymous shmem.
>>
>> So I think the check should be:
>>
>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER)
>> return SCAN_FAIL;
>
> Hmm but in shmem_init() we have:
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> else
> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>
> /*
> * Default to setting PMD-sized THP to inherit the global setting and
> * disable all other multi-size THPs.
> */
> if (!shmem_orders_configured)
> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
> #endif
>
> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also
> shmem_enabled_store() updates that if necessary.
>
> So we're still fine right?
>
> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() ->
> __shmem_get_inode() which has:
>
> if (sbinfo->huge)
> mapping_set_large_folios(inode->i_mapping);
>
> Shared for both anon shmem and tmpfs-style shmem.
>
> So I think it's fine as-is.
I'm afraid not. Sorry, I should have been clearer.
First, anonymous shmem large order allocation is dynamically controlled
via the global interface
(/sys/kernel/mm/transparent_hugepage/shmem_enabled) and the mTHP
interfaces
(/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled).
This means that during anonymous shmem initialization, these interfaces
might be set to 'never'. so it will not call mapping_set_large_folios()
because sbinfo->huge is 'SHMEM_HUGE_NEVER'.
Even if shmem large order allocation is subsequently enabled via the
interfaces, __shmem_file_setup -> mapping_set_large_folios() is not
called again.
Anonymous shmem behaves similarly to anonymous pages: it is controlled
by the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders()
to check for allowed large orders, rather than relying on
mapping_max_folio_order().
The mapping_max_folio_order() is intended to control large page
allocation only for tmpfs mounts. Therefore, I find the current code
confusing and think it needs to be fixed:
/* Don't consider 'deny' for emergencies and 'force' for testing */
if (sb != shm_mnt->mnt_sb && sbinfo->huge)
mapping_set_large_folios(inode->i_mapping);
On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: > > > On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: > > On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: > > > > > > > > > On 3/27/26 9:42 AM, Zi Yan wrote: > > > > collapse_file() requires FSes supporting large folio with at least > > > > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with > > > > huge option turned on also sets large folio order on mapping, so the check > > > > also applies to shmem. > > > > > > > > While at it, replace VM_BUG_ON with returning failure values. > > > > > > > > Signed-off-by: Zi Yan <ziy@nvidia.com> > > > > --- > > > > mm/khugepaged.c | 7 +++++-- > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index d06d84219e1b..45b12ffb1550 100644 > > > > --- a/mm/khugepaged.c > > > > +++ b/mm/khugepaged.c > > > > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, > > > > int nr_none = 0; > > > > bool is_shmem = shmem_file(file); > > > > - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); > > > > - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); > > > > + /* "huge" shmem sets mapping folio order and passes the check below */ > > > > + if (mapping_max_folio_order(mapping) < PMD_ORDER) > > > > + return SCAN_FAIL; > > > > > > This is not true for anonymous shmem, since its large order allocation logic > > > is similar to anonymous memory. That means it will not call > > > mapping_set_large_folios() for anonymous shmem. > > > > > > So I think the check should be: > > > > > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) > > > return SCAN_FAIL; > > > > Hmm but in shmem_init() we have: > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) > > SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; > > else > > shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ > > > > /* > > * Default to setting PMD-sized THP to inherit the global setting and > > * disable all other multi-size THPs. > > */ > > if (!shmem_orders_configured) > > huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); > > #endif > > > > And shm_mnt->mnt_sb is the superblock used for anon shmem. Also > > shmem_enabled_store() updates that if necessary. > > > > So we're still fine right? > > > > __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> > > __shmem_get_inode() which has: > > > > if (sbinfo->huge) > > mapping_set_large_folios(inode->i_mapping); > > > > Shared for both anon shmem and tmpfs-style shmem. > > > > So I think it's fine as-is. > > I'm afraid not. Sorry, I should have been clearer. > > First, anonymous shmem large order allocation is dynamically controlled via > the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and > the mTHP interfaces > (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). > > This means that during anonymous shmem initialization, these interfaces > might be set to 'never'. so it will not call mapping_set_large_folios() > because sbinfo->huge is 'SHMEM_HUGE_NEVER'. > > Even if shmem large order allocation is subsequently enabled via the > interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called > again. I see your point, oh this is all a bit of a mess... It feels like entirely the wrong abstraction anyway, since at best you're getting a global 'is enabled'. I guess what happened before was we'd never call into this with ! r/o thp for fs && ! is_shmem. But now we are allowing it, but should STILL be gating on !is_shmem so yeah your suggestion is correct I think actualyl. I do hate: if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) As a bit of code though. It's horrible. Let's abstract that... It'd be nice if we could find a way to clean things up in the lead up to changes in series like this instead of sticking with the mess, but I guess since it mostly removes stuff that's ok for now. > > Anonymous shmem behaves similarly to anonymous pages: it is controlled by > the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to > check for allowed large orders, rather than relying on > mapping_max_folio_order(). > > The mapping_max_folio_order() is intended to control large page allocation > only for tmpfs mounts. Therefore, I find the current code confusing and > think it needs to be fixed: > > /* Don't consider 'deny' for emergencies and 'force' for testing */ > if (sb != shm_mnt->mnt_sb && sbinfo->huge) > mapping_set_large_folios(inode->i_mapping); Cheers, Lorenzo
On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: > On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: >> >> >> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: >>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: >>>> >>>> >>>> On 3/27/26 9:42 AM, Zi Yan wrote: >>>>> collapse_file() requires FSes supporting large folio with at least >>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >>>>> huge option turned on also sets large folio order on mapping, so the check >>>>> also applies to shmem. >>>>> >>>>> While at it, replace VM_BUG_ON with returning failure values. >>>>> >>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>> --- >>>>> mm/khugepaged.c | 7 +++++-- >>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>> index d06d84219e1b..45b12ffb1550 100644 >>>>> --- a/mm/khugepaged.c >>>>> +++ b/mm/khugepaged.c >>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >>>>> int nr_none = 0; >>>>> bool is_shmem = shmem_file(file); >>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >>>>> + /* "huge" shmem sets mapping folio order and passes the check below */ >>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >>>>> + return SCAN_FAIL; >>>> >>>> This is not true for anonymous shmem, since its large order allocation logic >>>> is similar to anonymous memory. That means it will not call >>>> mapping_set_large_folios() for anonymous shmem. >>>> >>>> So I think the check should be: >>>> >>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>> return SCAN_FAIL; >>> >>> Hmm but in shmem_init() we have: >>> >>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) >>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; >>> else >>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ >>> >>> /* >>> * Default to setting PMD-sized THP to inherit the global setting and >>> * disable all other multi-size THPs. >>> */ >>> if (!shmem_orders_configured) >>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); >>> #endif >>> >>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also >>> shmem_enabled_store() updates that if necessary. >>> >>> So we're still fine right? >>> >>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> >>> __shmem_get_inode() which has: >>> >>> if (sbinfo->huge) >>> mapping_set_large_folios(inode->i_mapping); >>> >>> Shared for both anon shmem and tmpfs-style shmem. >>> >>> So I think it's fine as-is. >> >> I'm afraid not. Sorry, I should have been clearer. >> >> First, anonymous shmem large order allocation is dynamically controlled via >> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and >> the mTHP interfaces >> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). >> >> This means that during anonymous shmem initialization, these interfaces >> might be set to 'never'. so it will not call mapping_set_large_folios() >> because sbinfo->huge is 'SHMEM_HUGE_NEVER'. >> >> Even if shmem large order allocation is subsequently enabled via the >> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called >> again. > > I see your point, oh this is all a bit of a mess... > > It feels like entirely the wrong abstraction anyway, since at best you're > getting a global 'is enabled'. > > I guess what happened before was we'd never call into this with ! r/o thp for fs > && ! is_shmem. Right. > But now we are allowing it, but should STILL be gating on !is_shmem so yeah your > suggestion is correct I think actualyl. > > I do hate: > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) > > As a bit of code though. It's horrible. Indeed. > Let's abstract that... > > It'd be nice if we could find a way to clean things up in the lead up to changes > in series like this instead of sticking with the mess, but I guess since it > mostly removes stuff that's ok for now. I think this check can be removed from this patch. During the khugepaged's scan, it will call thp_vma_allowable_order() to check if the VMA is allowed to collapse into a PMD. Specifically, within the call chain thp_vma_allowable_order() -> __thp_vma_allowable_orders(), shmem is checked via shmem_allowable_huge_orders(), while other FSes are checked via file_thp_enabled(). For those other filesystems, Patch 5 has already added the following check, which I think is sufficient to filter out those FSes that do not support large folios: if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) return false; >> Anonymous shmem behaves similarly to anonymous pages: it is controlled by >> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to >> check for allowed large orders, rather than relying on >> mapping_max_folio_order(). >> >> The mapping_max_folio_order() is intended to control large page allocation >> only for tmpfs mounts. Therefore, I find the current code confusing and >> think it needs to be fixed: >> >> /* Don't consider 'deny' for emergencies and 'force' for testing */ >> if (sb != shm_mnt->mnt_sb && sbinfo->huge) >> mapping_set_large_folios(inode->i_mapping); > > Cheers, Lorenzo
On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote: > > > On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: > > On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: > > > > > > > > > On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: > > > > On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: > > > > > > > > > > > > > > > On 3/27/26 9:42 AM, Zi Yan wrote: > > > > > > collapse_file() requires FSes supporting large folio with at least > > > > > > PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with > > > > > > huge option turned on also sets large folio order on mapping, so the check > > > > > > also applies to shmem. > > > > > > > > > > > > While at it, replace VM_BUG_ON with returning failure values. > > > > > > > > > > > > Signed-off-by: Zi Yan <ziy@nvidia.com> > > > > > > --- > > > > > > mm/khugepaged.c | 7 +++++-- > > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > > > index d06d84219e1b..45b12ffb1550 100644 > > > > > > --- a/mm/khugepaged.c > > > > > > +++ b/mm/khugepaged.c > > > > > > @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, > > > > > > int nr_none = 0; > > > > > > bool is_shmem = shmem_file(file); > > > > > > - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); > > > > > > - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); > > > > > > + /* "huge" shmem sets mapping folio order and passes the check below */ > > > > > > + if (mapping_max_folio_order(mapping) < PMD_ORDER) > > > > > > + return SCAN_FAIL; > > > > > > > > > > This is not true for anonymous shmem, since its large order allocation logic > > > > > is similar to anonymous memory. That means it will not call > > > > > mapping_set_large_folios() for anonymous shmem. > > > > > > > > > > So I think the check should be: > > > > > > > > > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) > > > > > return SCAN_FAIL; > > > > > > > > Hmm but in shmem_init() we have: > > > > > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) > > > > SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; > > > > else > > > > shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ > > > > > > > > /* > > > > * Default to setting PMD-sized THP to inherit the global setting and > > > > * disable all other multi-size THPs. > > > > */ > > > > if (!shmem_orders_configured) > > > > huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); > > > > #endif > > > > > > > > And shm_mnt->mnt_sb is the superblock used for anon shmem. Also > > > > shmem_enabled_store() updates that if necessary. > > > > > > > > So we're still fine right? > > > > > > > > __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> > > > > __shmem_get_inode() which has: > > > > > > > > if (sbinfo->huge) > > > > mapping_set_large_folios(inode->i_mapping); > > > > > > > > Shared for both anon shmem and tmpfs-style shmem. > > > > > > > > So I think it's fine as-is. > > > > > > I'm afraid not. Sorry, I should have been clearer. > > > > > > First, anonymous shmem large order allocation is dynamically controlled via > > > the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and > > > the mTHP interfaces > > > (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). > > > > > > This means that during anonymous shmem initialization, these interfaces > > > might be set to 'never'. so it will not call mapping_set_large_folios() > > > because sbinfo->huge is 'SHMEM_HUGE_NEVER'. > > > > > > Even if shmem large order allocation is subsequently enabled via the > > > interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called > > > again. > > > > I see your point, oh this is all a bit of a mess... > > > > It feels like entirely the wrong abstraction anyway, since at best you're > > getting a global 'is enabled'. > > > > I guess what happened before was we'd never call into this with ! r/o thp for fs > > && ! is_shmem. > > Right. > > > But now we are allowing it, but should STILL be gating on !is_shmem so yeah your > > suggestion is correct I think actualyl. > > > > I do hate: > > > > if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) > > > > As a bit of code though. It's horrible. > > Indeed. > > > Let's abstract that... > > > > It'd be nice if we could find a way to clean things up in the lead up to changes > > in series like this instead of sticking with the mess, but I guess since it > > mostly removes stuff that's ok for now. > > I think this check can be removed from this patch. > > During the khugepaged's scan, it will call thp_vma_allowable_order() to > check if the VMA is allowed to collapse into a PMD. > > Specifically, within the call chain thp_vma_allowable_order() -> > __thp_vma_allowable_orders(), shmem is checked via > shmem_allowable_huge_orders(), while other FSes are checked via > file_thp_enabled(). It sucks not to have an assert. Maybe in that case make it a VM_WARN_ON_ONCE(). I hate that you're left tracing things back like that... > > For those other filesystems, Patch 5 has already added the following check, > which I think is sufficient to filter out those FSes that do not support > large folios: > > if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) > return false; 2 < 5, we won't tolerate bisection hazards. > > > > > Anonymous shmem behaves similarly to anonymous pages: it is controlled by > > > the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to > > > check for allowed large orders, rather than relying on > > > mapping_max_folio_order(). > > > > > > The mapping_max_folio_order() is intended to control large page allocation > > > only for tmpfs mounts. Therefore, I find the current code confusing and > > > think it needs to be fixed: > > > > > > /* Don't consider 'deny' for emergencies and 'force' for testing */ > > > if (sb != shm_mnt->mnt_sb && sbinfo->huge) > > > mapping_set_large_folios(inode->i_mapping); > > > > Cheers, Lorenzo > Cheers, Lorenzo
On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote: > On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote: >> >> >> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: >>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: >>>> >>>> >>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: >>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> On 3/27/26 9:42 AM, Zi Yan wrote: >>>>>>> collapse_file() requires FSes supporting large folio with at least >>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >>>>>>> huge option turned on also sets large folio order on mapping, so the check >>>>>>> also applies to shmem. >>>>>>> >>>>>>> While at it, replace VM_BUG_ON with returning failure values. >>>>>>> >>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>> --- >>>>>>> mm/khugepaged.c | 7 +++++-- >>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>>>> index d06d84219e1b..45b12ffb1550 100644 >>>>>>> --- a/mm/khugepaged.c >>>>>>> +++ b/mm/khugepaged.c >>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >>>>>>> int nr_none = 0; >>>>>>> bool is_shmem = shmem_file(file); >>>>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >>>>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >>>>>>> + /* "huge" shmem sets mapping folio order and passes the check below */ >>>>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>> + return SCAN_FAIL; >>>>>> >>>>>> This is not true for anonymous shmem, since its large order allocation logic >>>>>> is similar to anonymous memory. That means it will not call >>>>>> mapping_set_large_folios() for anonymous shmem. >>>>>> >>>>>> So I think the check should be: >>>>>> >>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>> return SCAN_FAIL; >>>>> >>>>> Hmm but in shmem_init() we have: >>>>> >>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) >>>>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; >>>>> else >>>>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ >>>>> >>>>> /* >>>>> * Default to setting PMD-sized THP to inherit the global setting and >>>>> * disable all other multi-size THPs. >>>>> */ >>>>> if (!shmem_orders_configured) >>>>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); >>>>> #endif >>>>> >>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also >>>>> shmem_enabled_store() updates that if necessary. >>>>> >>>>> So we're still fine right? >>>>> >>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> >>>>> __shmem_get_inode() which has: >>>>> >>>>> if (sbinfo->huge) >>>>> mapping_set_large_folios(inode->i_mapping); >>>>> >>>>> Shared for both anon shmem and tmpfs-style shmem. >>>>> >>>>> So I think it's fine as-is. >>>> >>>> I'm afraid not. Sorry, I should have been clearer. >>>> >>>> First, anonymous shmem large order allocation is dynamically controlled via >>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and >>>> the mTHP interfaces >>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). >>>> >>>> This means that during anonymous shmem initialization, these interfaces >>>> might be set to 'never'. so it will not call mapping_set_large_folios() >>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'. >>>> >>>> Even if shmem large order allocation is subsequently enabled via the >>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called >>>> again. >>> >>> I see your point, oh this is all a bit of a mess... >>> >>> It feels like entirely the wrong abstraction anyway, since at best you're >>> getting a global 'is enabled'. >>> >>> I guess what happened before was we'd never call into this with ! r/o thp for fs >>> && ! is_shmem. >> >> Right. >> >>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your >>> suggestion is correct I think actualyl. >>> >>> I do hate: >>> >>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>> >>> As a bit of code though. It's horrible. >> >> Indeed. >> >>> Let's abstract that... >>> >>> It'd be nice if we could find a way to clean things up in the lead up to changes >>> in series like this instead of sticking with the mess, but I guess since it >>> mostly removes stuff that's ok for now. >> >> I think this check can be removed from this patch. >> >> During the khugepaged's scan, it will call thp_vma_allowable_order() to >> check if the VMA is allowed to collapse into a PMD. >> >> Specifically, within the call chain thp_vma_allowable_order() -> >> __thp_vma_allowable_orders(), shmem is checked via >> shmem_allowable_huge_orders(), while other FSes are checked via >> file_thp_enabled(). But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config and can perform collapse anyway. This means without !is_shmem the check will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since I was in that TVA_FORCED_COLLAPSE email thread but does not remember everything there. > > It sucks not to have an assert. Maybe in that case make it a > VM_WARN_ON_ONCE(). Will do that as I replied to David already. > > I hate that you're left tracing things back like that... > >> >> For those other filesystems, Patch 5 has already added the following check, >> which I think is sufficient to filter out those FSes that do not support >> large folios: >> >> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) >> return false; > > 2 < 5, we won't tolerate bisection hazards. > >> >> >>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by >>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to >>>> check for allowed large orders, rather than relying on >>>> mapping_max_folio_order(). >>>> >>>> The mapping_max_folio_order() is intended to control large page allocation >>>> only for tmpfs mounts. Therefore, I find the current code confusing and >>>> think it needs to be fixed: >>>> >>>> /* Don't consider 'deny' for emergencies and 'force' for testing */ >>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge) >>>> mapping_set_large_folios(inode->i_mapping); >>> Hi Baolin, Do you want to send a fix for this? Also I wonder how I can distinguish between anonymous shmem code and tmpfs code. I thought they are the same thing except that they have different user interface, but it seems that I was wrong. Best Regards, Yan, Zi
On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote: >On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote: > >> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote: >>> >>> >>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: >>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: >>>>> >>>>> >>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: >>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote: >>>>>>>> collapse_file() requires FSes supporting large folio with at least >>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >>>>>>>> huge option turned on also sets large folio order on mapping, so the check >>>>>>>> also applies to shmem. >>>>>>>> >>>>>>>> While at it, replace VM_BUG_ON with returning failure values. >>>>>>>> >>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>>> --- >>>>>>>> mm/khugepaged.c | 7 +++++-- >>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>>>>> index d06d84219e1b..45b12ffb1550 100644 >>>>>>>> --- a/mm/khugepaged.c >>>>>>>> +++ b/mm/khugepaged.c >>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >>>>>>>> int nr_none = 0; >>>>>>>> bool is_shmem = shmem_file(file); >>>>>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >>>>>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >>>>>>>> + /* "huge" shmem sets mapping folio order and passes the check below */ >>>>>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>>> + return SCAN_FAIL; >>>>>>> >>>>>>> This is not true for anonymous shmem, since its large order allocation logic >>>>>>> is similar to anonymous memory. That means it will not call >>>>>>> mapping_set_large_folios() for anonymous shmem. >>>>>>> >>>>>>> So I think the check should be: >>>>>>> >>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>> return SCAN_FAIL; >>>>>> >>>>>> Hmm but in shmem_init() we have: >>>>>> >>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) >>>>>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; >>>>>> else >>>>>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ >>>>>> >>>>>> /* >>>>>> * Default to setting PMD-sized THP to inherit the global setting and >>>>>> * disable all other multi-size THPs. >>>>>> */ >>>>>> if (!shmem_orders_configured) >>>>>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); >>>>>> #endif >>>>>> >>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also >>>>>> shmem_enabled_store() updates that if necessary. >>>>>> >>>>>> So we're still fine right? >>>>>> >>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> >>>>>> __shmem_get_inode() which has: >>>>>> >>>>>> if (sbinfo->huge) >>>>>> mapping_set_large_folios(inode->i_mapping); >>>>>> >>>>>> Shared for both anon shmem and tmpfs-style shmem. >>>>>> >>>>>> So I think it's fine as-is. >>>>> >>>>> I'm afraid not. Sorry, I should have been clearer. >>>>> >>>>> First, anonymous shmem large order allocation is dynamically controlled via >>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and >>>>> the mTHP interfaces >>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). >>>>> >>>>> This means that during anonymous shmem initialization, these interfaces >>>>> might be set to 'never'. so it will not call mapping_set_large_folios() >>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'. >>>>> >>>>> Even if shmem large order allocation is subsequently enabled via the >>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called >>>>> again. >>>> >>>> I see your point, oh this is all a bit of a mess... >>>> >>>> It feels like entirely the wrong abstraction anyway, since at best you're >>>> getting a global 'is enabled'. >>>> >>>> I guess what happened before was we'd never call into this with ! r/o thp for fs >>>> && ! is_shmem. >>> >>> Right. >>> >>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your >>>> suggestion is correct I think actualyl. >>>> >>>> I do hate: >>>> >>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>> >>>> As a bit of code though. It's horrible. >>> >>> Indeed. >>> >>>> Let's abstract that... >>>> >>>> It'd be nice if we could find a way to clean things up in the lead up to changes >>>> in series like this instead of sticking with the mess, but I guess since it >>>> mostly removes stuff that's ok for now. >>> >>> I think this check can be removed from this patch. >>> >>> During the khugepaged's scan, it will call thp_vma_allowable_order() to >>> check if the VMA is allowed to collapse into a PMD. >>> >>> Specifically, within the call chain thp_vma_allowable_order() -> >>> __thp_vma_allowable_orders(), shmem is checked via >>> shmem_allowable_huge_orders(), while other FSes are checked via >>> file_thp_enabled(). > >But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config >and can perform collapse anyway. This means without !is_shmem the check >will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since Right. That will break MADV_COLLAPSE, IIUC. For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by whether the inode mapping got mapping_set_large_folios() at creation time. Using mmap(MAP_SHARED | MAP_ANONYMOUS): - create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never - collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always With the !is_shmem guard, collapse succeeds. Without it, the same setup fails with -EINVAL. Thanks, Lance >I was in that TVA_FORCED_COLLAPSE email thread but does not remember >everything there. > > >> >> It sucks not to have an assert. Maybe in that case make it a >> VM_WARN_ON_ONCE(). > >Will do that as I replied to David already. > >> >> I hate that you're left tracing things back like that... >> >>> >>> For those other filesystems, Patch 5 has already added the following check, >>> which I think is sufficient to filter out those FSes that do not support >>> large folios: >>> >>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) >>> return false; >> >> 2 < 5, we won't tolerate bisection hazards. >> >>> >>> >>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by >>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to >>>>> check for allowed large orders, rather than relying on >>>>> mapping_max_folio_order(). >>>>> >>>>> The mapping_max_folio_order() is intended to control large page allocation >>>>> only for tmpfs mounts. Therefore, I find the current code confusing and >>>>> think it needs to be fixed: >>>>> >>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */ >>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge) >>>>> mapping_set_large_folios(inode->i_mapping); >>>> > >Hi Baolin, > >Do you want to send a fix for this? > >Also I wonder how I can distinguish between anonymous shmem code and tmpfs code. >I thought they are the same thing except that they have different user interface, >but it seems that I was wrong. > > >Best Regards, >Yan, Zi > >
On 27 Mar 2026, at 12:22, Lance Yang wrote: > On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote: >> On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote: >> >>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote: >>>> >>>> >>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: >>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: >>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote: >>>>>>>>> collapse_file() requires FSes supporting large folio with at least >>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >>>>>>>>> huge option turned on also sets large folio order on mapping, so the check >>>>>>>>> also applies to shmem. >>>>>>>>> >>>>>>>>> While at it, replace VM_BUG_ON with returning failure values. >>>>>>>>> >>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>>>> --- >>>>>>>>> mm/khugepaged.c | 7 +++++-- >>>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>>>>>> index d06d84219e1b..45b12ffb1550 100644 >>>>>>>>> --- a/mm/khugepaged.c >>>>>>>>> +++ b/mm/khugepaged.c >>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >>>>>>>>> int nr_none = 0; >>>>>>>>> bool is_shmem = shmem_file(file); >>>>>>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >>>>>>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >>>>>>>>> + /* "huge" shmem sets mapping folio order and passes the check below */ >>>>>>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>>>> + return SCAN_FAIL; >>>>>>>> >>>>>>>> This is not true for anonymous shmem, since its large order allocation logic >>>>>>>> is similar to anonymous memory. That means it will not call >>>>>>>> mapping_set_large_folios() for anonymous shmem. >>>>>>>> >>>>>>>> So I think the check should be: >>>>>>>> >>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>>> return SCAN_FAIL; >>>>>>> >>>>>>> Hmm but in shmem_init() we have: >>>>>>> >>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) >>>>>>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; >>>>>>> else >>>>>>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ >>>>>>> >>>>>>> /* >>>>>>> * Default to setting PMD-sized THP to inherit the global setting and >>>>>>> * disable all other multi-size THPs. >>>>>>> */ >>>>>>> if (!shmem_orders_configured) >>>>>>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); >>>>>>> #endif >>>>>>> >>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also >>>>>>> shmem_enabled_store() updates that if necessary. >>>>>>> >>>>>>> So we're still fine right? >>>>>>> >>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> >>>>>>> __shmem_get_inode() which has: >>>>>>> >>>>>>> if (sbinfo->huge) >>>>>>> mapping_set_large_folios(inode->i_mapping); >>>>>>> >>>>>>> Shared for both anon shmem and tmpfs-style shmem. >>>>>>> >>>>>>> So I think it's fine as-is. >>>>>> >>>>>> I'm afraid not. Sorry, I should have been clearer. >>>>>> >>>>>> First, anonymous shmem large order allocation is dynamically controlled via >>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and >>>>>> the mTHP interfaces >>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). >>>>>> >>>>>> This means that during anonymous shmem initialization, these interfaces >>>>>> might be set to 'never'. so it will not call mapping_set_large_folios() >>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'. >>>>>> >>>>>> Even if shmem large order allocation is subsequently enabled via the >>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called >>>>>> again. >>>>> >>>>> I see your point, oh this is all a bit of a mess... >>>>> >>>>> It feels like entirely the wrong abstraction anyway, since at best you're >>>>> getting a global 'is enabled'. >>>>> >>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs >>>>> && ! is_shmem. >>>> >>>> Right. >>>> >>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your >>>>> suggestion is correct I think actualyl. >>>>> >>>>> I do hate: >>>>> >>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>> >>>>> As a bit of code though. It's horrible. >>>> >>>> Indeed. >>>> >>>>> Let's abstract that... >>>>> >>>>> It'd be nice if we could find a way to clean things up in the lead up to changes >>>>> in series like this instead of sticking with the mess, but I guess since it >>>>> mostly removes stuff that's ok for now. >>>> >>>> I think this check can be removed from this patch. >>>> >>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to >>>> check if the VMA is allowed to collapse into a PMD. >>>> >>>> Specifically, within the call chain thp_vma_allowable_order() -> >>>> __thp_vma_allowable_orders(), shmem is checked via >>>> shmem_allowable_huge_orders(), while other FSes are checked via >>>> file_thp_enabled(). >> >> But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config >> and can perform collapse anyway. This means without !is_shmem the check >> will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since > > Right. That will break MADV_COLLAPSE, IIUC. > > For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the > TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by > whether the inode mapping got mapping_set_large_folios() at creation > time. > > Using mmap(MAP_SHARED | MAP_ANONYMOUS): > - create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never > - collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always > > With the !is_shmem guard, collapse succeeds. Without it, the same setup > fails with -EINVAL. Thank you for the confirmation. I will fix it. > > Thanks, > Lance > >> I was in that TVA_FORCED_COLLAPSE email thread but does not remember >> everything there. >> >> >>> >>> It sucks not to have an assert. Maybe in that case make it a >>> VM_WARN_ON_ONCE(). >> >> Will do that as I replied to David already. >> >>> >>> I hate that you're left tracing things back like that... >>> >>>> >>>> For those other filesystems, Patch 5 has already added the following check, >>>> which I think is sufficient to filter out those FSes that do not support >>>> large folios: >>>> >>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) >>>> return false; >>> >>> 2 < 5, we won't tolerate bisection hazards. >>> >>>> >>>> >>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by >>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to >>>>>> check for allowed large orders, rather than relying on >>>>>> mapping_max_folio_order(). >>>>>> >>>>>> The mapping_max_folio_order() is intended to control large page allocation >>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and >>>>>> think it needs to be fixed: >>>>>> >>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */ >>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge) >>>>>> mapping_set_large_folios(inode->i_mapping); >>>>> >> >> Hi Baolin, >> >> Do you want to send a fix for this? >> >> Also I wonder how I can distinguish between anonymous shmem code and tmpfs code. >> I thought they are the same thing except that they have different user interface, >> but it seems that I was wrong. >> >> >> Best Regards, >> Yan, Zi >> >> Best Regards, Yan, Zi
On 3/28/26 12:30 AM, Zi Yan wrote: > On 27 Mar 2026, at 12:22, Lance Yang wrote: > >> On Fri, Mar 27, 2026 at 11:00:26AM -0400, Zi Yan wrote: >>> On 27 Mar 2026, at 10:31, Lorenzo Stoakes (Oracle) wrote: >>> >>>> On Fri, Mar 27, 2026 at 10:26:53PM +0800, Baolin Wang wrote: >>>>> >>>>> >>>>> On 3/27/26 10:12 PM, Lorenzo Stoakes (Oracle) wrote: >>>>>> On Fri, Mar 27, 2026 at 09:45:03PM +0800, Baolin Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 3/27/26 8:02 PM, Lorenzo Stoakes (Oracle) wrote: >>>>>>>> On Fri, Mar 27, 2026 at 05:44:49PM +0800, Baolin Wang wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3/27/26 9:42 AM, Zi Yan wrote: >>>>>>>>>> collapse_file() requires FSes supporting large folio with at least >>>>>>>>>> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >>>>>>>>>> huge option turned on also sets large folio order on mapping, so the check >>>>>>>>>> also applies to shmem. >>>>>>>>>> >>>>>>>>>> While at it, replace VM_BUG_ON with returning failure values. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>>>>> --- >>>>>>>>>> mm/khugepaged.c | 7 +++++-- >>>>>>>>>> 1 file changed, 5 insertions(+), 2 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>>>>>>>> index d06d84219e1b..45b12ffb1550 100644 >>>>>>>>>> --- a/mm/khugepaged.c >>>>>>>>>> +++ b/mm/khugepaged.c >>>>>>>>>> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >>>>>>>>>> int nr_none = 0; >>>>>>>>>> bool is_shmem = shmem_file(file); >>>>>>>>>> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >>>>>>>>>> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >>>>>>>>>> + /* "huge" shmem sets mapping folio order and passes the check below */ >>>>>>>>>> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>>>>> + return SCAN_FAIL; >>>>>>>>> >>>>>>>>> This is not true for anonymous shmem, since its large order allocation logic >>>>>>>>> is similar to anonymous memory. That means it will not call >>>>>>>>> mapping_set_large_folios() for anonymous shmem. >>>>>>>>> >>>>>>>>> So I think the check should be: >>>>>>>>> >>>>>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>>>>> return SCAN_FAIL; >>>>>>>> >>>>>>>> Hmm but in shmem_init() we have: >>>>>>>> >>>>>>>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>> if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) >>>>>>>> SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; >>>>>>>> else >>>>>>>> shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ >>>>>>>> >>>>>>>> /* >>>>>>>> * Default to setting PMD-sized THP to inherit the global setting and >>>>>>>> * disable all other multi-size THPs. >>>>>>>> */ >>>>>>>> if (!shmem_orders_configured) >>>>>>>> huge_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER); >>>>>>>> #endif >>>>>>>> >>>>>>>> And shm_mnt->mnt_sb is the superblock used for anon shmem. Also >>>>>>>> shmem_enabled_store() updates that if necessary. >>>>>>>> >>>>>>>> So we're still fine right? >>>>>>>> >>>>>>>> __shmem_file_setup() (used for anon shmem) calls shmem_get_inode() -> >>>>>>>> __shmem_get_inode() which has: >>>>>>>> >>>>>>>> if (sbinfo->huge) >>>>>>>> mapping_set_large_folios(inode->i_mapping); >>>>>>>> >>>>>>>> Shared for both anon shmem and tmpfs-style shmem. >>>>>>>> >>>>>>>> So I think it's fine as-is. >>>>>>> >>>>>>> I'm afraid not. Sorry, I should have been clearer. >>>>>>> >>>>>>> First, anonymous shmem large order allocation is dynamically controlled via >>>>>>> the global interface (/sys/kernel/mm/transparent_hugepage/shmem_enabled) and >>>>>>> the mTHP interfaces >>>>>>> (/sys/kernel/mm/transparent_hugepage/hugepages-*kB/shmem_enabled). >>>>>>> >>>>>>> This means that during anonymous shmem initialization, these interfaces >>>>>>> might be set to 'never'. so it will not call mapping_set_large_folios() >>>>>>> because sbinfo->huge is 'SHMEM_HUGE_NEVER'. >>>>>>> >>>>>>> Even if shmem large order allocation is subsequently enabled via the >>>>>>> interfaces, __shmem_file_setup -> mapping_set_large_folios() is not called >>>>>>> again. >>>>>> >>>>>> I see your point, oh this is all a bit of a mess... >>>>>> >>>>>> It feels like entirely the wrong abstraction anyway, since at best you're >>>>>> getting a global 'is enabled'. >>>>>> >>>>>> I guess what happened before was we'd never call into this with ! r/o thp for fs >>>>>> && ! is_shmem. >>>>> >>>>> Right. >>>>> >>>>>> But now we are allowing it, but should STILL be gating on !is_shmem so yeah your >>>>>> suggestion is correct I think actualyl. >>>>>> >>>>>> I do hate: >>>>>> >>>>>> if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) >>>>>> >>>>>> As a bit of code though. It's horrible. >>>>> >>>>> Indeed. >>>>> >>>>>> Let's abstract that... >>>>>> >>>>>> It'd be nice if we could find a way to clean things up in the lead up to changes >>>>>> in series like this instead of sticking with the mess, but I guess since it >>>>>> mostly removes stuff that's ok for now. >>>>> >>>>> I think this check can be removed from this patch. >>>>> >>>>> During the khugepaged's scan, it will call thp_vma_allowable_order() to >>>>> check if the VMA is allowed to collapse into a PMD. >>>>> >>>>> Specifically, within the call chain thp_vma_allowable_order() -> >>>>> __thp_vma_allowable_orders(), shmem is checked via >>>>> shmem_allowable_huge_orders(), while other FSes are checked via >>>>> file_thp_enabled(). >>> >>> But for madvise(MADV_COLLAPSE) case, IIRC, it ignores shmem huge config >>> and can perform collapse anyway. This means without !is_shmem the check >>> will break madvise(MADV_COLLAPSE). Let me know if I get it wrong, since >> >> Right. That will break MADV_COLLAPSE, IIUC. >> >> For MADV_COLLAPSE on anonymous shmem, eligibility is determined by the >> TVA_FORCED_COLLAPSE path via shmem_allowable_huge_orders(), not by >> whether the inode mapping got mapping_set_large_folios() at creation >> time. >> >> Using mmap(MAP_SHARED | MAP_ANONYMOUS): >> - create time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=never >> - collapse time: shmem_enabled=never, hugepages-2048kB/shmem_enabled=always >> >> With the !is_shmem guard, collapse succeeds. Without it, the same setup >> fails with -EINVAL. Right. So my suggestion is that the check should be: if (!is_shmem && mapping_max_folio_order(mapping) < PMD_ORDER) or just keep a single VM_WARN_ONCE() here, becuase I hope the thp_vma_allowable_order() will filter out those FSes that do not support large folios. >>> I was in that TVA_FORCED_COLLAPSE email thread but does not remember >>> everything there. >>> >>> >>>> >>>> It sucks not to have an assert. Maybe in that case make it a >>>> VM_WARN_ON_ONCE(). >>> >>> Will do that as I replied to David already. >>> >>>> >>>> I hate that you're left tracing things back like that... >>>> >>>>> >>>>> For those other filesystems, Patch 5 has already added the following check, >>>>> which I think is sufficient to filter out those FSes that do not support >>>>> large folios: >>>>> >>>>> if (mapping_max_folio_order(inode->i_mapping) < PMD_ORDER) >>>>> return false; >>>> >>>> 2 < 5, we won't tolerate bisection hazards. >>>> >>>>> >>>>> >>>>>>> Anonymous shmem behaves similarly to anonymous pages: it is controlled by >>>>>>> the 'shmem_enabled' interfaces and uses shmem_allowable_huge_orders() to >>>>>>> check for allowed large orders, rather than relying on >>>>>>> mapping_max_folio_order(). >>>>>>> >>>>>>> The mapping_max_folio_order() is intended to control large page allocation >>>>>>> only for tmpfs mounts. Therefore, I find the current code confusing and >>>>>>> think it needs to be fixed: >>>>>>> >>>>>>> /* Don't consider 'deny' for emergencies and 'force' for testing */ >>>>>>> if (sb != shm_mnt->mnt_sb && sbinfo->huge) >>>>>>> mapping_set_large_folios(inode->i_mapping); >>>>>> >>> >>> Hi Baolin, >>> >>> Do you want to send a fix for this? >>> >>> Also I wonder how I can distinguish between anonymous shmem code and tmpfs code. >>> I thought they are the same thing except that they have different user interface, >>> but it seems that I was wrong. Sure. I can send a patch to make the code clear.
On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote: >collapse_file() requires FSes supporting large folio with at least >PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >huge option turned on also sets large folio order on mapping, so the check >also applies to shmem. > >While at it, replace VM_BUG_ON with returning failure values. > >Signed-off-by: Zi Yan <ziy@nvidia.com> >--- > mm/khugepaged.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > >diff --git a/mm/khugepaged.c b/mm/khugepaged.c >index d06d84219e1b..45b12ffb1550 100644 >--- a/mm/khugepaged.c >+++ b/mm/khugepaged.c >@@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, > int nr_none = 0; > bool is_shmem = shmem_file(file); > >- VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >- VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >+ /* "huge" shmem sets mapping folio order and passes the check below */ >+ if (mapping_max_folio_order(mapping) < PMD_ORDER) >+ return SCAN_FAIL; Yep, for shmem inodes, if the mount has huge= enabled, inode creation marks the mapping are large-folio capable: /* Don't consider 'deny' for emergencies and 'force' for testing */ if (sbinfo->huge) mapping_set_large_folios(inode->i_mapping); LGTM! Reviewed-by: Lance Yang <lance.yang@linux.dev>
On 2026/3/27 15:29, Lance Yang wrote: > > On Thu, Mar 26, 2026 at 09:42:47PM -0400, Zi Yan wrote: >> collapse_file() requires FSes supporting large folio with at least >> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. shmem with >> huge option turned on also sets large folio order on mapping, so the check >> also applies to shmem. >> >> While at it, replace VM_BUG_ON with returning failure values. >> >> Signed-off-by: Zi Yan <ziy@nvidia.com> >> --- >> mm/khugepaged.c | 7 +++++-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index d06d84219e1b..45b12ffb1550 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1899,8 +1899,11 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, >> int nr_none = 0; >> bool is_shmem = shmem_file(file); >> >> - VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); >> - VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); >> + /* "huge" shmem sets mapping folio order and passes the check below */ >> + if (mapping_max_folio_order(mapping) < PMD_ORDER) >> + return SCAN_FAIL; > > Yep, for shmem inodes, if the mount has huge= enabled, inode creation > marks the mapping are large-folio capable: Oops, s/are/as/ > > /* Don't consider 'deny' for emergencies and 'force' for testing */ > if (sbinfo->huge) > mapping_set_large_folios(inode->i_mapping); > > LGTM! > > Reviewed-by: Lance Yang <lance.yang@linux.dev>
© 2016 - 2026 Red Hat, Inc.