mm/huge_memory.c | 3 +++ 1 file changed, 3 insertions(+)
file_thp_enabled() incorrectly returns true for guest_memfd and secretmem
inodes because they appear as regular read-only files when
CONFIG_READ_ONLY_THP_FOR_FS is enabled. This allows khugepaged and
MADV_COLLAPSE to create large folios in the page cache, but their fault
handlers do not support large folios.
Add explicit checks for GUEST_MEMFD_MAGIC and SECRETMEM_MAGIC to reject
these filesystems early in file_thp_enabled().
Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44
Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
---
mm/huge_memory.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21..4f57c78b57dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -93,6 +93,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
return false;
inode = file_inode(vma->vm_file);
+ if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC ||
+ inode->i_sb->s_magic == SECRETMEM_MAGIC)
+ return false;
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
--
2.43.0
Hi Deepanshu,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Deepanshu-Kartikey/mm-thp-Deny-THP-for-guest_memfd-and-secretmem-in-file_thp_enabled/20260209-113800
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20260209033558.22943-1-kartikey406%40gmail.com
patch subject: [PATCH] mm: thp: Deny THP for guest_memfd and secretmem in file_thp_enabled()
config: loongarch-defconfig (https://download.01.org/0day-ci/archive/20260211/202602110124.Y72YFz1K-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260211/202602110124.Y72YFz1K-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602110124.Y72YFz1K-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/huge_memory.c:96:30: error: use of undeclared identifier 'GUEST_MEMFD_MAGIC'
96 | if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC ||
| ^
>> mm/huge_memory.c:97:30: error: use of undeclared identifier 'SECRETMEM_MAGIC'
97 | inode->i_sb->s_magic == SECRETMEM_MAGIC)
| ^
2 errors generated.
vim +/GUEST_MEMFD_MAGIC +96 mm/huge_memory.c
84
85 static inline bool file_thp_enabled(struct vm_area_struct *vma)
86 {
87 struct inode *inode;
88
89 if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
90 return false;
91
92 if (!vma->vm_file)
93 return false;
94
95 inode = file_inode(vma->vm_file);
> 96 if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC ||
> 97 inode->i_sb->s_magic == SECRETMEM_MAGIC)
98 return false;
99
100 return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
101 }
102
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Deepanshu,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Deepanshu-Kartikey/mm-thp-Deny-THP-for-guest_memfd-and-secretmem-in-file_thp_enabled/20260209-113800
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20260209033558.22943-1-kartikey406%40gmail.com
patch subject: [PATCH] mm: thp: Deny THP for guest_memfd and secretmem in file_thp_enabled()
config: arc-randconfig-001-20260210 (https://download.01.org/0day-ci/archive/20260210/202602100727.b1U4CHAA-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 14.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260210/202602100727.b1U4CHAA-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602100727.b1U4CHAA-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/huge_memory.c: In function 'file_thp_enabled':
>> mm/huge_memory.c:96:37: error: 'GUEST_MEMFD_MAGIC' undeclared (first use in this function)
96 | if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC ||
| ^~~~~~~~~~~~~~~~~
mm/huge_memory.c:96:37: note: each undeclared identifier is reported only once for each function it appears in
>> mm/huge_memory.c:97:37: error: 'SECRETMEM_MAGIC' undeclared (first use in this function)
97 | inode->i_sb->s_magic == SECRETMEM_MAGIC)
| ^~~~~~~~~~~~~~~
vim +/GUEST_MEMFD_MAGIC +96 mm/huge_memory.c
84
85 static inline bool file_thp_enabled(struct vm_area_struct *vma)
86 {
87 struct inode *inode;
88
89 if (!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
90 return false;
91
92 if (!vma->vm_file)
93 return false;
94
95 inode = file_inode(vma->vm_file);
> 96 if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC ||
> 97 inode->i_sb->s_magic == SECRETMEM_MAGIC)
98 return false;
99
100 return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
101 }
102
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 2/9/26 04:35, Deepanshu Kartikey wrote: > file_thp_enabled() incorrectly returns true for guest_memfd and secretmem > inodes because they appear as regular read-only files when > CONFIG_READ_ONLY_THP_FOR_FS is enabled. This allows khugepaged and > MADV_COLLAPSE to create large folios in the page cache, but their fault > handlers do not support large folios. > > Add explicit checks for GUEST_MEMFD_MAGIC and SECRETMEM_MAGIC to reject > these filesystems early in file_thp_enabled(). > > Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 > Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com > Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> So we were able to reproduce this with secretmem, right? We want to add "Fixes:" for the introducing commits, which would be he commits that enable secretmem and mapping of guest_memfd pages to user space. Can you identify them? And also Cc: stable@vger.kernel.org > --- > mm/huge_memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 40cf59301c21..4f57c78b57dd 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -93,6 +93,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > return false; > > inode = file_inode(vma->vm_file); > + if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC || > + inode->i_sb->s_magic == SECRETMEM_MAGIC) > + return false; That's nasty. We want some way to identify that through the mapping. Unfortunately CONFIG_READ_ONLY_THP_FOR_FS ignores any mapping_set_large_folios() configs by design. And CONFIG_READ_ONLY_THP_FOR_FS might go away soon, but we need a fix until then. While we can identify secretmem through vma_is_secretmem(), we can't do the same for guest_memfd as it's built as a module. Unfortunately AS_NO_DIRECT_MAP[1] won't work. Maybe introduce a AS_NO_READ_ONLY_THP_FOR_FS, which we can just easily rip out along with CONFIG_READ_ONLY_THP_FOR_FS later? [1] https://lore.kernel.org/r/20260126164445.11867-6-kalyazin@amazon.com -- Cheers, David
On 2/9/26 11:24, David Hildenbrand (Arm) wrote: > On 2/9/26 04:35, Deepanshu Kartikey wrote: >> file_thp_enabled() incorrectly returns true for guest_memfd and secretmem >> inodes because they appear as regular read-only files when >> CONFIG_READ_ONLY_THP_FOR_FS is enabled. This allows khugepaged and >> MADV_COLLAPSE to create large folios in the page cache, but their fault >> handlers do not support large folios. >> >> Add explicit checks for GUEST_MEMFD_MAGIC and SECRETMEM_MAGIC to reject >> these filesystems early in file_thp_enabled(). >> >> Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com >> Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 >> Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com >> Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> > > So we were able to reproduce this with secretmem, right? > > We want to add "Fixes:" for the introducing commits, which would be he > commits that enable secretmem and mapping of guest_memfd pages to user > space. Can you identify them? > > And also > > Cc: stable@vger.kernel.org > >> --- >> mm/huge_memory.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 40cf59301c21..4f57c78b57dd 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -93,6 +93,9 @@ static inline bool file_thp_enabled(struct >> vm_area_struct *vma) >> return false; >> inode = file_inode(vma->vm_file); >> + if (inode->i_sb->s_magic == GUEST_MEMFD_MAGIC || >> + inode->i_sb->s_magic == SECRETMEM_MAGIC) >> + return false; > > That's nasty. We want some way to identify that through the mapping. > > Unfortunately CONFIG_READ_ONLY_THP_FOR_FS ignores any > mapping_set_large_folios() configs by design. > > And CONFIG_READ_ONLY_THP_FOR_FS might go away soon, but we need a fix > until then. > > While we can identify secretmem through vma_is_secretmem(), we can't do > the same for guest_memfd as it's built as a module. > > Unfortunately AS_NO_DIRECT_MAP[1] won't work. > > Maybe introduce a AS_NO_READ_ONLY_THP_FOR_FS, which we can just easily > rip out along with CONFIG_READ_ONLY_THP_FOR_FS later? On second thought, why do we pass the !inode_is_open_for_write(inode) in file_thp_enabled()? Isn't that the main problem for these memfd things? Maybe a get_write_access() is missing somewhere? -- Cheers, David
On Mon, Feb 9, 2026 at 4:12 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> > Maybe introduce a AS_NO_READ_ONLY_THP_FOR_FS, which we can just easily
> > rip out along with CONFIG_READ_ONLY_THP_FOR_FS later?
>
> On second thought, why do we pass the
>
> !inode_is_open_for_write(inode)
>
> in file_thp_enabled()?
>
> Isn't that the main problem for these memfd things?
>
> Maybe a get_write_access() is missing somewhere?
>
Hi David,
Thanks for the suggestion. I looked into the get_write_access() path.
Both guest_memfd and secretmem use alloc_file_pseudo() which skips
calling get_write_access(), so i_writecount stays 0. That's why
file_thp_enabled() sees them as read-only files.
We could add get_write_access() after alloc_file_pseudo() in both, but
I think that would be a hack rather than a proper fix:
- i_writecount has a specific semantic: tracking how many fds have the
file open for writing. We'd be bumping it just to influence
file_thp_enabled() behavior.
- It doesn't express the actual intent. The real issue is that
CONFIG_READ_ONLY_THP_FOR_FS was never meant for pseudo-filesystem
backed files.
I think the AS_NO_READ_ONLY_THP_FOR_FS flag you suggested earlier is
the cleaner approach. It is explicit, has no side effects, and is easy
to rip out when CONFIG_READ_ONLY_THP_FOR_FS goes away.
Here is the diff:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..23f559fc1a4c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -211,6 +211,7 @@ enum mapping_flags {
AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't
account usage to user cgroups */
AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
+ AS_NO_READ_ONLY_THP_FOR_FS = 12,
/* Bits 16-25 are used for FOLIO_ORDER */
AS_FOLIO_ORDER_BITS = 5,
AS_FOLIO_ORDER_MIN = 16,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21..4bdda92ce01e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -94,6 +94,9 @@ static inline bool file_thp_enabled(struct
vm_area_struct *vma)
inode = file_inode(vma->vm_file);
+ if (test_bit(AS_NO_READ_ONLY_THP_FOR_FS, &inode->i_mapping->flags))
+ return false;
+
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
diff --git a/mm/secretmem.c b/mm/secretmem.c
index edf111e0a1bb..56d93a74f5fc 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -205,7 +205,8 @@ static struct file *secretmem_file_create(unsigned
long flags)
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
mapping_set_unevictable(inode->i_mapping);
+ set_bit(AS_NO_READ_ONLY_THP_FOR_FS, &inode->i_mapping->flags);
inode->i_op = &secretmem_iops;
inode->i_mapping->a_ops = &secretmem_aops;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fdaea3422c30..b93a324c81bd 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -597,6 +597,7 @@ static int __kvm_gmem_create(struct kvm *kvm,
loff_t size, u64 flags)
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
mapping_set_inaccessible(inode->i_mapping);
+ set_bit(AS_NO_READ_ONLY_THP_FOR_FS, &inode->i_mapping->flags);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
Please let me know if this looks good and I will send a formal v2.
Thanks,
Deepanshu
Deepanshu Kartikey <kartikey406@gmail.com> writes: > On Mon, Feb 9, 2026 at 4:12 PM David Hildenbrand (Arm) <david@kernel.org> wrote: >> >> > Maybe introduce a AS_NO_READ_ONLY_THP_FOR_FS, which we can just easily >> > rip out along with CONFIG_READ_ONLY_THP_FOR_FS later? >> >> On second thought, why do we pass the >> >> !inode_is_open_for_write(inode) >> >> in file_thp_enabled()? >> >> Isn't that the main problem for these memfd things? >> >> Maybe a get_write_access() is missing somewhere? >> > > Hi David, > > Thanks for the suggestion. I looked into the get_write_access() path. > > Both guest_memfd and secretmem use alloc_file_pseudo() which skips > calling get_write_access(), so i_writecount stays 0. That's why > file_thp_enabled() sees them as read-only files. > > We could add get_write_access() after alloc_file_pseudo() in both, but > I think that would be a hack rather than a proper fix: > > - i_writecount has a specific semantic: tracking how many fds have the > file open for writing. We'd be bumping it just to influence > file_thp_enabled() behavior. > I agree re-using i_writecount feels odd since it is abusing the idea of being written to. I might have misunderstood the full context of i_writecount though. > - It doesn't express the actual intent. The real issue is that > CONFIG_READ_ONLY_THP_FOR_FS was never meant for pseudo-filesystem > backed files. > > I think the AS_NO_READ_ONLY_THP_FOR_FS flag you suggested earlier is > the cleaner approach. It is explicit, has no side effects, and is easy > to rip out when CONFIG_READ_ONLY_THP_FOR_FS goes away. > I was considering other address space flags and I think the best might be to make khugepaged respect AS_FOLIO_ORDER_MAX and have somewhere in __vma_thp_allowable_orders() check the maximum allowed order for the address space. khugepaged is about consolidating memory to huge pages, so if the address space doesn't allow a larger folio order, then khugepaged should not operate on that memory. The other options are + AS_UNEVICTABLE: Sounds like khugepaged should respect AS_UNEVICTABLE, but IIUC evictability is more closely related to swapping and khugepaged might operate on swappable memory? Both guest_memfd and secretmem set AS_UNEVICTABLE. + AS_INACCESSIBLE: This is only used by guest_memfd, and is mostly used to block migration. khugepaged kind of migrates the memory contents too, but someday we want guest_memfd to support migration, and at that time we would still want to block khugepaged, so I don't think we want to reuse a flag that couples khugepaged to migration. > > [...snip...] >
On 2/9/26 19:22, Ackerley Tng wrote: > Deepanshu Kartikey <kartikey406@gmail.com> writes: > >> On Mon, Feb 9, 2026 at 4:12 PM David Hildenbrand (Arm) <david@kernel.org> wrote: >>> >>> >>> On second thought, why do we pass the >>> >>> !inode_is_open_for_write(inode) >>> >>> in file_thp_enabled()? >>> >>> Isn't that the main problem for these memfd things? >>> >>> Maybe a get_write_access() is missing somewhere? >>> >> >> Hi David, >> >> Thanks for the suggestion. I looked into the get_write_access() path. >> >> Both guest_memfd and secretmem use alloc_file_pseudo() which skips >> calling get_write_access(), so i_writecount stays 0. That's why >> file_thp_enabled() sees them as read-only files. >> >> We could add get_write_access() after alloc_file_pseudo() in both, but >> I think that would be a hack rather than a proper fix: >> >> - i_writecount has a specific semantic: tracking how many fds have the >> file open for writing. We'd be bumping it just to influence >> file_thp_enabled() behavior. >> > > I agree re-using i_writecount feels odd since it is abusing the idea of > being written to. I might have misunderstood the full context of > i_writecount though. i_writecount means "the file is open with write access" IIUC. So one can mmap(PROT_WRITE) it etc. And that's kind of the thing: the virtual file is open with write access. That's why I am still wondering whether mimicking that is actually the right fix. > >> - It doesn't express the actual intent. The real issue is that >> CONFIG_READ_ONLY_THP_FOR_FS was never meant for pseudo-filesystem >> backed files. >> >> I think the AS_NO_READ_ONLY_THP_FOR_FS flag you suggested earlier is >> the cleaner approach. It is explicit, has no side effects, and is easy >> to rip out when CONFIG_READ_ONLY_THP_FOR_FS goes away. >> > > I was considering other address space flags and I think the best might > be to make khugepaged respect AS_FOLIO_ORDER_MAX and have somewhere in > __vma_thp_allowable_orders() check the maximum allowed order for the > address space. The thing is that CONFIG_READ_ONLY_THP_FOR_FS explicitly bypasses these folio order checks. Changing it would degrade filesystems that do not support large folios yet. IOW, it would be similar to ripping out CONFIG_READ_ONLY_THP_FOR_FS. Which we plan for one of the next releases :) > > khugepaged is about consolidating memory to huge pages, so if the > address space doesn't allow a larger folio order, then khugepaged should > not operate on that memory. > > The other options are > > + AS_UNEVICTABLE: Sounds like khugepaged should respect AS_UNEVICTABLE, > but IIUC evictability is more closely related to swapping and > khugepaged might operate on swappable memory? Right, it does not really make sense > + AS_INACCESSIBLE: This is only used by guest_memfd, and is mostly used > to block migration. khugepaged kind of migrates the memory contents > too, but someday we want guest_memfd to support migration, and at that > time we would still want to block khugepaged, so I don't think we want > to reuse a flag that couples khugepaged to migration. It could be used at least for the time being and to fix the issue. -- Cheers, David
On 2/9/26 20:45, David Hildenbrand (Arm) wrote:
> On 2/9/26 19:22, Ackerley Tng wrote:
>> Deepanshu Kartikey <kartikey406@gmail.com> writes:
>>
>>> On Mon, Feb 9, 2026 at 4:12 PM David Hildenbrand (Arm)
>>> <david@kernel.org> wrote:
>>>
>>> Hi David,
>>>
>>> Thanks for the suggestion. I looked into the get_write_access() path.
>>>
>>> Both guest_memfd and secretmem use alloc_file_pseudo() which skips
>>> calling get_write_access(), so i_writecount stays 0. That's why
>>> file_thp_enabled() sees them as read-only files.
>>>
>>> We could add get_write_access() after alloc_file_pseudo() in both, but
>>> I think that would be a hack rather than a proper fix:
>>>
>>> - i_writecount has a specific semantic: tracking how many fds have the
>>> file open for writing. We'd be bumping it just to influence
>>> file_thp_enabled() behavior.
>>>
>>
>> I agree re-using i_writecount feels odd since it is abusing the idea of
>> being written to. I might have misunderstood the full context of
>> i_writecount though.
>
> i_writecount means "the file is open with write access" IIUC. So one can
> mmap(PROT_WRITE) it etc.
>
> And that's kind of the thing: the virtual file is open with write
> access. That's why I am still wondering whether mimicking that is
> actually the right fix.
>
>>
>>> - It doesn't express the actual intent. The real issue is that
>>> CONFIG_READ_ONLY_THP_FOR_FS was never meant for pseudo-filesystem
>>> backed files.
>>>
>>> I think the AS_NO_READ_ONLY_THP_FOR_FS flag you suggested earlier is
>>> the cleaner approach. It is explicit, has no side effects, and is easy
>>> to rip out when CONFIG_READ_ONLY_THP_FOR_FS goes away.
>>>
>>
>> I was considering other address space flags and I think the best might
>> be to make khugepaged respect AS_FOLIO_ORDER_MAX and have somewhere in
>> __vma_thp_allowable_orders() check the maximum allowed order for the
>> address space.
>
> The thing is that CONFIG_READ_ONLY_THP_FOR_FS explicitly bypasses these
> folio order checks. Changing it would degrade filesystems that do not
> support large folios yet. IOW, it would be similar to ripping out
> CONFIG_READ_ONLY_THP_FOR_FS. Which we plan for one of the next releases :)
>
>>
>> khugepaged is about consolidating memory to huge pages, so if the
>> address space doesn't allow a larger folio order, then khugepaged should
>> not operate on that memory.
>>
>> The other options are
>>
>> + AS_UNEVICTABLE: Sounds like khugepaged should respect AS_UNEVICTABLE,
>> but IIUC evictability is more closely related to swapping and
>> khugepaged might operate on swappable memory?
> Right, it does not really make sense
>
>> + AS_INACCESSIBLE: This is only used by guest_memfd, and is mostly used
>> to block migration. khugepaged kind of migrates the memory contents
>> too, but someday we want guest_memfd to support migration, and at that
>> time we would still want to block khugepaged, so I don't think we want
>> to reuse a flag that couples khugepaged to migration.
>
> It could be used at least for the time being and to fix the issue.
mapping_inaccessible(mapping) indeed looks like the easiest fix, given that
shmem "somehow" works, lol.
BUT, something just occurred to me.
We added the mc-handling in
commit 98c76c9f1ef7599b39bfd4bd99b8a760d4a8cd3b
Author: Jiaqi Yan <jiaqiyan@google.com>
Date: Wed Mar 29 08:11:19 2023 -0700
mm/khugepaged: recover from poisoned anonymous memory
..
So I assume kernels before that would crash when collapsing?
Looking at 5.15.199, it does not contain 98c76c9f1e [1].
So I suspect we need a fix+stable backport.
Who volunteers to try a secretmem reproducer on a stable kernel? :)
The following is a bit nasty as well but should do the trick until we rip
out the CONFIG_READ_ONLY_THP_FOR_FS stuff.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03886d4ccecc..4ac1cb36b861 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
#include <linux/pgalloc.h>
#include <linux/pgalloc_tag.h>
#include <linux/pagewalk.h>
+#include <linux/secretmem.h>
#include <asm/tlb.h>
#include "internal.h"
@@ -94,6 +95,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
inode = file_inode(vma->vm_file);
+ if (mapping_inaccessible(inode->i_mapping) ||
+ secretmem_mapping(inode->i_mapping))
+ return false;
+
return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}
[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/khugepaged.c?h=v5.15.199
--
Cheers,
David
On Tue, Feb 10, 2026 at 1:43 AM David Hildenbrand (Arm) <david@kernel.org> wrote: > > The following is a bit nasty as well but should do the trick until we rip > out the CONFIG_READ_ONLY_THP_FOR_FS stuff. > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 03886d4ccecc..4ac1cb36b861 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -40,6 +40,7 @@ > #include <linux/pgalloc.h> > #include <linux/pgalloc_tag.h> > #include <linux/pagewalk.h> > +#include <linux/secretmem.h> > > #include <asm/tlb.h> > #include "internal.h" > @@ -94,6 +95,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > > inode = file_inode(vma->vm_file); > > + if (mapping_inaccessible(inode->i_mapping) || > + secretmem_mapping(inode->i_mapping)) > + return false; > + > return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > } > > Hi David, Agreed, using mapping_inaccessible() for guest_memfd and secretmem_mapping() for secretmem is much simpler than introducing a new AS flag. No changes needed outside of file_thp_enabled(). I will send a v2 with your suggested diff and test it on syzbot. Thanks, Deepanshu
On 2/10/26 02:51, Deepanshu Kartikey wrote: > On Tue, Feb 10, 2026 at 1:43 AM David Hildenbrand (Arm) > <david@kernel.org> wrote: >> > >> The following is a bit nasty as well but should do the trick until we rip >> out the CONFIG_READ_ONLY_THP_FOR_FS stuff. >> >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 03886d4ccecc..4ac1cb36b861 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -40,6 +40,7 @@ >> #include <linux/pgalloc.h> >> #include <linux/pgalloc_tag.h> >> #include <linux/pagewalk.h> >> +#include <linux/secretmem.h> >> >> #include <asm/tlb.h> >> #include "internal.h" >> @@ -94,6 +95,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) >> >> inode = file_inode(vma->vm_file); >> >> + if (mapping_inaccessible(inode->i_mapping) || >> + secretmem_mapping(inode->i_mapping)) >> + return false; >> + >> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); >> } >> >> > > Hi David, > > Agreed, using mapping_inaccessible() for guest_memfd and > secretmem_mapping() for secretmem is much simpler than introducing a > new AS flag. No changes needed outside of file_thp_enabled(). > > I will send a v2 with your suggested diff and test it on syzbot. Let's wait a bit until we are in agreement that this is the right thing to do :) -- Cheers, David
"David Hildenbrand (Arm)" <david@kernel.org> writes:
> On 2/9/26 20:45, David Hildenbrand (Arm) wrote:
>> On 2/9/26 19:22, Ackerley Tng wrote:
>>> Deepanshu Kartikey <kartikey406@gmail.com> writes:
>>>
>>>> On Mon, Feb 9, 2026 at 4:12 PM David Hildenbrand (Arm)
>>>> <david@kernel.org> wrote:
>>>>
>>>> Hi David,
>>>>
>>>> Thanks for the suggestion. I looked into the get_write_access() path.
>>>>
>>>> Both guest_memfd and secretmem use alloc_file_pseudo() which skips
>>>> calling get_write_access(), so i_writecount stays 0. That's why
>>>> file_thp_enabled() sees them as read-only files.
>>>>
>>>> We could add get_write_access() after alloc_file_pseudo() in both, but
>>>> I think that would be a hack rather than a proper fix:
>>>>
>>>> - i_writecount has a specific semantic: tracking how many fds have the
>>>> file open for writing. We'd be bumping it just to influence
>>>> file_thp_enabled() behavior.
>>>>
>>>
>>> I agree re-using i_writecount feels odd since it is abusing the idea of
>>> being written to. I might have misunderstood the full context of
>>> i_writecount though.
>>
>> i_writecount means "the file is open with write access" IIUC. So one can
>> mmap(PROT_WRITE) it etc.
>>
>> And that's kind of the thing: the virtual file is open with write
>> access. That's why I am still wondering whether mimicking that is
>> actually the right fix.
>>
>>>
>>>> - It doesn't express the actual intent. The real issue is that
>>>> CONFIG_READ_ONLY_THP_FOR_FS was never meant for pseudo-filesystem
>>>> backed files.
>>>>
>>>> I think the AS_NO_READ_ONLY_THP_FOR_FS flag you suggested earlier is
>>>> the cleaner approach. It is explicit, has no side effects, and is easy
>>>> to rip out when CONFIG_READ_ONLY_THP_FOR_FS goes away.
>>>>
>>>
>>> I was considering other address space flags and I think the best might
>>> be to make khugepaged respect AS_FOLIO_ORDER_MAX and have somewhere in
>>> __vma_thp_allowable_orders() check the maximum allowed order for the
>>> address space.
>>
>> The thing is that CONFIG_READ_ONLY_THP_FOR_FS explicitly bypasses these
>> folio order checks.
Ah that's true.
>> Changing it would degrade filesystems that do not
>> support large folios yet. IOW, it would be similar to ripping out
>> CONFIG_READ_ONLY_THP_FOR_FS. Which we plan for one of the next releases :)
>>
>>>
>>> khugepaged is about consolidating memory to huge pages, so if the
>>> address space doesn't allow a larger folio order, then khugepaged should
>>> not operate on that memory.
>>>
>>> The other options are
>>>
>>> + AS_UNEVICTABLE: Sounds like khugepaged should respect AS_UNEVICTABLE,
>>> but IIUC evictability is more closely related to swapping and
>>> khugepaged might operate on swappable memory?
>> Right, it does not really make sense
>>
>>> + AS_INACCESSIBLE: This is only used by guest_memfd, and is mostly used
>>> to block migration. khugepaged kind of migrates the memory contents
>>> too, but someday we want guest_memfd to support migration, and at that
>>> time we would still want to block khugepaged, so I don't think we want
>>> to reuse a flag that couples khugepaged to migration.
>>
>> It could be used at least for the time being and to fix the issue.
>
> mapping_inaccessible(mapping) indeed looks like the easiest fix, given that
> shmem "somehow" works, lol.
>
I could also check shmem, but I'm not sure which conditions to set up
shmem for, since shmem could be used in so many ways. Any suggestions?
Off the top of my head, shmem lots of special-casing in the khugepaged
flow...
> BUT, something just occurred to me.
>
> We added the mc-handling in
>
> commit 98c76c9f1ef7599b39bfd4bd99b8a760d4a8cd3b
> Author: Jiaqi Yan <jiaqiyan@google.com>
> Date: Wed Mar 29 08:11:19 2023 -0700
>
> mm/khugepaged: recover from poisoned anonymous memory
>
> ..
>
> So I assume kernels before that would crash when collapsing?
>
> Looking at 5.15.199, it does not contain 98c76c9f1e [1].
>
> So I suspect we need a fix+stable backport.
>
> Who volunteers to try a secretmem reproducer on a stable kernel? :)
>
I could give this a shot. 5.15.199 doesn't have AS_INACCESSIBLE. Should
we backport AS_INACCESSIBLE there or could the fix for 5.15.199 just be
special-casing secretmem like you suggested below?
>
> The following is a bit nasty as well but should do the trick until we rip
> out the CONFIG_READ_ONLY_THP_FOR_FS stuff.
>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 03886d4ccecc..4ac1cb36b861 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -40,6 +40,7 @@
> #include <linux/pgalloc.h>
> #include <linux/pgalloc_tag.h>
> #include <linux/pagewalk.h>
> +#include <linux/secretmem.h>
>
> #include <asm/tlb.h>
> #include "internal.h"
> @@ -94,6 +95,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
>
> inode = file_inode(vma->vm_file);
>
> + if (mapping_inaccessible(inode->i_mapping) ||
> + secretmem_mapping(inode->i_mapping))
> + return false;
> +
Regarding the degradation of filesystems that don't support large folios
yet: Do you mean having the collapse function respect AS_FOLIO_ORDER_MAX
would disable collapsing for filesystems that actually want pages to be
collapsed, but don't update max folio order and hence appear to not
support large folios yet?
What about a check like this instead
if (!mapping_large_folio_support())
return false;
And then when CONFIG_READ_ONLY_THP_FOR_FS is removed, part of that work
would involve getting filesystems to update AS_FOLIO_ORDER_MAX?
> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
> }
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/khugepaged.c?h=v5.15.199
>
> --
> Cheers,
>
> David
>> BUT, something just occurred to me. >> >> We added the mc-handling in >> >> commit 98c76c9f1ef7599b39bfd4bd99b8a760d4a8cd3b >> Author: Jiaqi Yan <jiaqiyan@google.com> >> Date: Wed Mar 29 08:11:19 2023 -0700 >> >> mm/khugepaged: recover from poisoned anonymous memory >> >> .. >> >> So I assume kernels before that would crash when collapsing? >> >> Looking at 5.15.199, it does not contain 98c76c9f1e [1]. >> >> So I suspect we need a fix+stable backport. >> >> Who volunteers to try a secretmem reproducer on a stable kernel? :) >> > > I could give this a shot. 5.15.199 doesn't have AS_INACCESSIBLE. Should > we backport AS_INACCESSIBLE there or could the fix for 5.15.199 just be > special-casing secretmem like you suggested below? Yes. If there is no guest_memfd we wouldn't need it. > >> >> The following is a bit nasty as well but should do the trick until we rip >> out the CONFIG_READ_ONLY_THP_FOR_FS stuff. >> >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index 03886d4ccecc..4ac1cb36b861 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -40,6 +40,7 @@ >> #include <linux/pgalloc.h> >> #include <linux/pgalloc_tag.h> >> #include <linux/pagewalk.h> >> +#include <linux/secretmem.h> >> >> #include <asm/tlb.h> >> #include "internal.h" >> @@ -94,6 +95,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) >> >> inode = file_inode(vma->vm_file); >> >> + if (mapping_inaccessible(inode->i_mapping) || >> + secretmem_mapping(inode->i_mapping)) >> + return false; >> + > > Regarding the degradation of filesystems that don't support large folios > yet: Do you mean having the collapse function respect AS_FOLIO_ORDER_MAX > would disable collapsing for filesystems that actually want pages to be > collapsed, but don't update max folio order and hence appear to not > support large folios yet? > > What about a check like this instead > > if (!mapping_large_folio_support()) > return false; That would essentially disable CONFIG_READ_ONLY_THP_FOR_FS (support for THP before filesystems started supporting large folios officially), no? -- Cheers, David
© 2016 - 2026 Red Hat, Inc.