They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
large folio support, so that read-only THPs created in these FSes are not
seen by the FSes when the underlying fd becomes writable. Now read-only PMD
THPs only appear in a FS with large folio support and the supported orders
include PMD_ORDRE.
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
fs/open.c | 27 ---------------------------
include/linux/pagemap.h | 29 -----------------------------
mm/filemap.c | 1 -
mm/huge_memory.c | 1 -
mm/khugepaged.c | 29 ++---------------------------
5 files changed, 2 insertions(+), 85 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 91f1139591ab..cef382d9d8b8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -970,33 +970,6 @@ static int do_dentry_open(struct file *f,
if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
return -EINVAL;
- /*
- * XXX: Huge page cache doesn't support writing yet. Drop all page
- * cache for this file before processing writes.
- */
- if (f->f_mode & FMODE_WRITE) {
- /*
- * Depends on full fence from get_write_access() to synchronize
- * against collapse_file() regarding i_writecount and nr_thps
- * updates. Ensures subsequent insertion of THPs into the page
- * cache will fail.
- */
- if (filemap_nr_thps(inode->i_mapping)) {
- struct address_space *mapping = inode->i_mapping;
-
- filemap_invalidate_lock(inode->i_mapping);
- /*
- * unmap_mapping_range just need to be called once
- * here, because the private pages is not need to be
- * unmapped mapping (e.g. data segment of dynamic
- * shared libraries here).
- */
- unmap_mapping_range(mapping, 0, 0, 0);
- truncate_inode_pages(mapping, 0);
- filemap_invalidate_unlock(inode->i_mapping);
- }
- }
-
return 0;
cleanup_all:
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..dad3f8846cdc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -530,35 +530,6 @@ static inline size_t mapping_max_folio_size(const struct address_space *mapping)
return PAGE_SIZE << mapping_max_folio_order(mapping);
}
-static inline int filemap_nr_thps(const struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
- return atomic_read(&mapping->nr_thps);
-#else
- return 0;
-#endif
-}
-
-static inline void filemap_nr_thps_inc(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
- if (!mapping_large_folio_support(mapping))
- atomic_inc(&mapping->nr_thps);
-#else
- WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
-static inline void filemap_nr_thps_dec(struct address_space *mapping)
-{
-#ifdef CONFIG_READ_ONLY_THP_FOR_FS
- if (!mapping_large_folio_support(mapping))
- atomic_dec(&mapping->nr_thps);
-#else
- WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
-#endif
-}
-
struct address_space *folio_mapping(const struct folio *folio);
/**
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b933a1da9bd..4248e7cdecf3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -189,7 +189,6 @@ static void filemap_unaccount_folio(struct address_space *mapping,
lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
} else if (folio_test_pmd_mappable(folio)) {
lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
- filemap_nr_thps_dec(mapping);
}
if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
mod_node_page_state(folio_pgdat(folio),
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c20..c7873dbdc470 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3833,7 +3833,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
} else {
lruvec_stat_mod_folio(folio,
NR_FILE_THPS, -nr);
- filemap_nr_thps_dec(mapping);
}
}
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 45b12ffb1550..8004ab8de6d2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2104,20 +2104,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
goto xa_unlocked;
}
- if (!is_shmem) {
- filemap_nr_thps_inc(mapping);
- /*
- * Paired with the fence in do_dentry_open() -> get_write_access()
- * to ensure i_writecount is up to date and the update to nr_thps
- * is visible. Ensures the page cache will be truncated if the
- * file is opened writable.
- */
- smp_mb();
- if (inode_is_open_for_write(mapping->host)) {
- result = SCAN_FAIL;
- filemap_nr_thps_dec(mapping);
- }
- }
+ if (!is_shmem && inode_is_open_for_write(mapping->host))
+ result = SCAN_FAIL;
xa_locked:
xas_unlock_irq(&xas);
@@ -2296,19 +2284,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
folio_putback_lru(folio);
folio_put(folio);
}
- /*
- * Undo the updates of filemap_nr_thps_inc for non-SHMEM
- * file only. This undo is not needed unless failure is
- * due to SCAN_COPY_MC.
- */
- if (!is_shmem && result == SCAN_COPY_MC) {
- filemap_nr_thps_dec(mapping);
- /*
- * Paired with the fence in do_dentry_open() -> get_write_access()
- * to ensure the update to nr_thps is visible.
- */
- smp_mb();
- }
new_folio->mapping = NULL;
--
2.43.0
On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
> large folio support, so that read-only THPs created in these FSes are not
> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
> THPs only appear in a FS with large folio support and the supported orders
> include PMD_ORDRE.
Typo: PMD_ORDRE -> PMD_ORDER
>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
This looks obviously-correct since this stuff wouldn't have been invoked for
large folio file systems before + they already had to handle it separately, and
this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
suggests you didn't miss anything), so:
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
> fs/open.c | 27 ---------------------------
> include/linux/pagemap.h | 29 -----------------------------
> mm/filemap.c | 1 -
> mm/huge_memory.c | 1 -
> mm/khugepaged.c | 29 ++---------------------------
> 5 files changed, 2 insertions(+), 85 deletions(-)
>
> diff --git a/fs/open.c b/fs/open.c
> index 91f1139591ab..cef382d9d8b8 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -970,33 +970,6 @@ static int do_dentry_open(struct file *f,
> if ((f->f_flags & O_DIRECT) && !(f->f_mode & FMODE_CAN_ODIRECT))
> return -EINVAL;
>
> - /*
> - * XXX: Huge page cache doesn't support writing yet. Drop all page
> - * cache for this file before processing writes.
> - */
> - if (f->f_mode & FMODE_WRITE) {
> - /*
> - * Depends on full fence from get_write_access() to synchronize
> - * against collapse_file() regarding i_writecount and nr_thps
> - * updates. Ensures subsequent insertion of THPs into the page
> - * cache will fail.
> - */
> - if (filemap_nr_thps(inode->i_mapping)) {
> - struct address_space *mapping = inode->i_mapping;
> -
> - filemap_invalidate_lock(inode->i_mapping);
> - /*
> - * unmap_mapping_range just need to be called once
> - * here, because the private pages is not need to be
> - * unmapped mapping (e.g. data segment of dynamic
> - * shared libraries here).
> - */
> - unmap_mapping_range(mapping, 0, 0, 0);
> - truncate_inode_pages(mapping, 0);
> - filemap_invalidate_unlock(inode->i_mapping);
> - }
> - }
> -
> return 0;
>
> cleanup_all:
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ec442af3f886..dad3f8846cdc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -530,35 +530,6 @@ static inline size_t mapping_max_folio_size(const struct address_space *mapping)
> return PAGE_SIZE << mapping_max_folio_order(mapping);
> }
>
> -static inline int filemap_nr_thps(const struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> - return atomic_read(&mapping->nr_thps);
> -#else
> - return 0;
> -#endif
> -}
> -
> -static inline void filemap_nr_thps_inc(struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> - if (!mapping_large_folio_support(mapping))
> - atomic_inc(&mapping->nr_thps);
> -#else
> - WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
> -#endif
> -}
> -
> -static inline void filemap_nr_thps_dec(struct address_space *mapping)
> -{
> -#ifdef CONFIG_READ_ONLY_THP_FOR_FS
> - if (!mapping_large_folio_support(mapping))
> - atomic_dec(&mapping->nr_thps);
> -#else
> - WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);
> -#endif
> -}
> -
> struct address_space *folio_mapping(const struct folio *folio);
>
> /**
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2b933a1da9bd..4248e7cdecf3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -189,7 +189,6 @@ static void filemap_unaccount_folio(struct address_space *mapping,
> lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr);
> } else if (folio_test_pmd_mappable(folio)) {
> lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
> - filemap_nr_thps_dec(mapping);
> }
> if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags))
> mod_node_page_state(folio_pgdat(folio),
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b2a6060b3c20..c7873dbdc470 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3833,7 +3833,6 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> } else {
> lruvec_stat_mod_folio(folio,
> NR_FILE_THPS, -nr);
> - filemap_nr_thps_dec(mapping);
> }
> }
> }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 45b12ffb1550..8004ab8de6d2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2104,20 +2104,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> goto xa_unlocked;
> }
>
> - if (!is_shmem) {
> - filemap_nr_thps_inc(mapping);
> - /*
> - * Paired with the fence in do_dentry_open() -> get_write_access()
> - * to ensure i_writecount is up to date and the update to nr_thps
> - * is visible. Ensures the page cache will be truncated if the
> - * file is opened writable.
> - */
> - smp_mb();
> - if (inode_is_open_for_write(mapping->host)) {
> - result = SCAN_FAIL;
> - filemap_nr_thps_dec(mapping);
> - }
> - }
> + if (!is_shmem && inode_is_open_for_write(mapping->host))
> + result = SCAN_FAIL;
>
> xa_locked:
> xas_unlock_irq(&xas);
> @@ -2296,19 +2284,6 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> folio_putback_lru(folio);
> folio_put(folio);
> }
> - /*
> - * Undo the updates of filemap_nr_thps_inc for non-SHMEM
> - * file only. This undo is not needed unless failure is
> - * due to SCAN_COPY_MC.
> - */
> - if (!is_shmem && result == SCAN_COPY_MC) {
> - filemap_nr_thps_dec(mapping);
> - /*
> - * Paired with the fence in do_dentry_open() -> get_write_access()
> - * to ensure the update to nr_thps is visible.
> - */
> - smp_mb();
> - }
>
> new_folio->mapping = NULL;
>
> --
> 2.43.0
>
Cheers, Lorenzo
On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote: > On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote: >> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without >> large folio support, so that read-only THPs created in these FSes are not >> seen by the FSes when the underlying fd becomes writable. Now read-only PMD >> THPs only appear in a FS with large folio support and the supported orders >> include PMD_ORDRE. > > Typo: PMD_ORDRE -> PMD_ORDER > >> >> Signed-off-by: Zi Yan <ziy@nvidia.com> > > This looks obviously-correct since this stuff wouldn't have been invoked for > large folio file systems before + they already had to handle it separately, and > this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep > suggests you didn't miss anything), so: There could now be a race between collapsing and the file getting opened r/w. Are we sure that all code can really deal with that? IOW, "they already had to handle it separately" -- is that true? khugepaged would have never collapse in writable files, so I wonder if all code paths are prepared for that. -- Cheers, David
On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
> On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
> >> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
> >> large folio support, so that read-only THPs created in these FSes are not
> >> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
> >> THPs only appear in a FS with large folio support and the supported orders
> >> include PMD_ORDRE.
> >
> > Typo: PMD_ORDRE -> PMD_ORDER
> >
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >
> > This looks obviously-correct since this stuff wouldn't have been invoked for
> > large folio file systems before + they already had to handle it separately, and
> > this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
> > suggests you didn't miss anything), so:
>
> There could now be a race between collapsing and the file getting opened
> r/w.
>
> Are we sure that all code can really deal with that?
>
> IOW, "they already had to handle it separately" -- is that true?
> khugepaged would have never collapse in writable files, so I wonder if
> all code paths are prepared for that.
OK I guess I overlooked a part of this code... :) see below.
This is fine and would be a no-op anyway
- if (f->f_mode & FMODE_WRITE) {
- /*
- * Depends on full fence from get_write_access() to synchronize
- * against collapse_file() regarding i_writecount and nr_thps
- * updates. Ensures subsequent insertion of THPs into the page
- * cache will fail.
- */
- if (filemap_nr_thps(inode->i_mapping)) {
But this:
- if (!is_shmem) {
- filemap_nr_thps_inc(mapping);
- /*
- * Paired with the fence in do_dentry_open() -> get_write_access()
- * to ensure i_writecount is up to date and the update to nr_thps
- * is visible. Ensures the page cache will be truncated if the
- * file is opened writable.
- */
- smp_mb();
We can drop barrier
- if (inode_is_open_for_write(mapping->host)) {
- result = SCAN_FAIL;
But this is a functional change!
Yup missed this.
- filemap_nr_thps_dec(mapping);
- }
- }
For below:
- /*
- * Undo the updates of filemap_nr_thps_inc for non-SHMEM
- * file only. This undo is not needed unless failure is
- * due to SCAN_COPY_MC.
- */
- if (!is_shmem && result == SCAN_COPY_MC) {
- filemap_nr_thps_dec(mapping);
- /*
- * Paired with the fence in do_dentry_open() -> get_write_access()
- * to ensure the update to nr_thps is visible.
- */
- smp_mb();
- }
Here is probably fine to remove if barrier _only_ for nr_thps.
>
> --
> Cheers,
>
> David
Sorry Zi, R-b tag withdrawn... :( I missed that 1 functional change there.
Cheers, Lorenzo
On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/27/26 13:23, Lorenzo Stoakes (Oracle) wrote:
>>> On Thu, Mar 26, 2026 at 09:42:48PM -0400, Zi Yan wrote:
>>>> They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
>>>> large folio support, so that read-only THPs created in these FSes are not
>>>> seen by the FSes when the underlying fd becomes writable. Now read-only PMD
>>>> THPs only appear in a FS with large folio support and the supported orders
>>>> include PMD_ORDRE.
>>>
>>> Typo: PMD_ORDRE -> PMD_ORDER
>>>
>>>>
>>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>>
>>> This looks obviously-correct since this stuff wouldn't have been invoked for
>>> large folio file systems before + they already had to handle it separately, and
>>> this function is only tied to CONFIG_READ_ONLY_THP_FOR_FS (+ a quick grep
>>> suggests you didn't miss anything), so:
>>
>> There could now be a race between collapsing and the file getting opened
>> r/w.
>>
>> Are we sure that all code can really deal with that?
>>
>> IOW, "they already had to handle it separately" -- is that true?
>> khugepaged would have never collapse in writable files, so I wonder if
>> all code paths are prepared for that.
>
> OK I guess I overlooked a part of this code... :) see below.
>
> This is fine and would be a no-op anyway
>
> - if (f->f_mode & FMODE_WRITE) {
> - /*
> - * Depends on full fence from get_write_access() to synchronize
> - * against collapse_file() regarding i_writecount and nr_thps
> - * updates. Ensures subsequent insertion of THPs into the page
> - * cache will fail.
> - */
> - if (filemap_nr_thps(inode->i_mapping)) {
>
> But this:
>
> - if (!is_shmem) {
> - filemap_nr_thps_inc(mapping);
> - /*
> - * Paired with the fence in do_dentry_open() -> get_write_access()
> - * to ensure i_writecount is up to date and the update to nr_thps
> - * is visible. Ensures the page cache will be truncated if the
> - * file is opened writable.
> - */
> - smp_mb();
>
> We can drop barrier
>
> - if (inode_is_open_for_write(mapping->host)) {
> - result = SCAN_FAIL;
>
> But this is a functional change!
>
> Yup missed this.
But I added
+ if (!is_shmem && inode_is_open_for_write(mapping->host))
+ result = SCAN_FAIL;
That keeps the original bail out, right?
>
> - filemap_nr_thps_dec(mapping);
> - }
> - }
>
> For below:
>
> - /*
> - * Undo the updates of filemap_nr_thps_inc for non-SHMEM
> - * file only. This undo is not needed unless failure is
> - * due to SCAN_COPY_MC.
> - */
> - if (!is_shmem && result == SCAN_COPY_MC) {
> - filemap_nr_thps_dec(mapping);
> - /*
> - * Paired with the fence in do_dentry_open() -> get_write_access()
> - * to ensure the update to nr_thps is visible.
> - */
> - smp_mb();
> - }
>
> Here is probably fine to remove if barrier _only_ for nr_thps.
>
>>
>> --
>> Cheers,
>>
>> David
>
> Sorry Zi, R-b tag withdrawn... :( I missed that 1 functional change there.
>
> Cheers, Lorenzo
Best Regards,
Yan, Zi
On 3/27/26 16:05, Zi Yan wrote:
> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
>
>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>
>>> There could now be a race between collapsing and the file getting opened
>>> r/w.
>>>
>>> Are we sure that all code can really deal with that?
>>>
>>> IOW, "they already had to handle it separately" -- is that true?
>>> khugepaged would have never collapse in writable files, so I wonder if
>>> all code paths are prepared for that.
>>
>> OK I guess I overlooked a part of this code... :) see below.
>>
>> This is fine and would be a no-op anyway
>>
>> - if (f->f_mode & FMODE_WRITE) {
>> - /*
>> - * Depends on full fence from get_write_access() to synchronize
>> - * against collapse_file() regarding i_writecount and nr_thps
>> - * updates. Ensures subsequent insertion of THPs into the page
>> - * cache will fail.
>> - */
>> - if (filemap_nr_thps(inode->i_mapping)) {
>>
>> But this:
>>
>> - if (!is_shmem) {
>> - filemap_nr_thps_inc(mapping);
>> - /*
>> - * Paired with the fence in do_dentry_open() -> get_write_access()
>> - * to ensure i_writecount is up to date and the update to nr_thps
>> - * is visible. Ensures the page cache will be truncated if the
>> - * file is opened writable.
>> - */
>> - smp_mb();
>>
>> We can drop barrier
>>
>> - if (inode_is_open_for_write(mapping->host)) {
>> - result = SCAN_FAIL;
>>
>> But this is a functional change!
>>
>> Yup missed this.
>
> But I added
>
> + if (!is_shmem && inode_is_open_for_write(mapping->host))
> + result = SCAN_FAIL;
>
> That keeps the original bail out, right?
Independent of that, are we sure that the possible race we allow is ok?
--
Cheers,
David
On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote:
> On 3/27/26 16:05, Zi Yan wrote:
>> On 27 Mar 2026, at 10:23, Lorenzo Stoakes (Oracle) wrote:
>>
>>> On Fri, Mar 27, 2026 at 02:58:12PM +0100, David Hildenbrand (Arm) wrote:
>>>>
>>>> There could now be a race between collapsing and the file getting opened
>>>> r/w.
>>>>
>>>> Are we sure that all code can really deal with that?
>>>>
>>>> IOW, "they already had to handle it separately" -- is that true?
>>>> khugepaged would have never collapse in writable files, so I wonder if
>>>> all code paths are prepared for that.
>>>
>>> OK I guess I overlooked a part of this code... :) see below.
>>>
>>> This is fine and would be a no-op anyway
>>>
>>> - if (f->f_mode & FMODE_WRITE) {
>>> - /*
>>> - * Depends on full fence from get_write_access() to synchronize
>>> - * against collapse_file() regarding i_writecount and nr_thps
>>> - * updates. Ensures subsequent insertion of THPs into the page
>>> - * cache will fail.
>>> - */
>>> - if (filemap_nr_thps(inode->i_mapping)) {
>>>
>>> But this:
>>>
>>> - if (!is_shmem) {
>>> - filemap_nr_thps_inc(mapping);
>>> - /*
>>> - * Paired with the fence in do_dentry_open() -> get_write_access()
>>> - * to ensure i_writecount is up to date and the update to nr_thps
>>> - * is visible. Ensures the page cache will be truncated if the
>>> - * file is opened writable.
>>> - */
>>> - smp_mb();
>>>
>>> We can drop barrier
>>>
>>> - if (inode_is_open_for_write(mapping->host)) {
>>> - result = SCAN_FAIL;
>>>
>>> But this is a functional change!
>>>
>>> Yup missed this.
>>
>> But I added
>>
>> + if (!is_shmem && inode_is_open_for_write(mapping->host))
>> + result = SCAN_FAIL;
>>
>> That keeps the original bail out, right?
>
> Independent of that, are we sure that the possible race we allow is ok?
Let me think.
do_dentry_open() -> file_get_write_access() -> get_write_access() bumps
inode->i_writecount atomically and it turns inode_is_open_for_write()
to true. Then, do_dentry_open() also truncates all pages
if filemap_nr_thps() is not zero. This pairs with khugepaged’s first
filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening
a fd with write when there is a read-only THP.
After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs
on FSes with large folio support (to be precise THP support). If a fd
is opened for write before inode_is_open_for_write() check, khugepaged
will stop. It is fine. But if a fd is opened for write after
inode_is_open_for_write() check, khugepaged will try to collapse a read-only
THP and the fd can be written at the same time.
I notice that fd write requires locking the to-be-written folio first
(I see it from f_ops->write_iter() -> write_begin_get_folio() and assume
f_ops->write() has the same locking requirement) and khugepaged has already
locked the to-be-collapsed folio before inode_is_open_for_write(). So if the
fd is opened for write after inode_is_open_for_write() check, its write
will wait for khugepaged collapse and see a new THP. Since the FS
supports THP, writing to the new THP should be fine.
Let me know if my analysis above makes sense. If yes, I will add it
to the commit message and add a succinct comment about it before
inode_is_open_for_write().
Best Regards,
Yan, Zi
On 4/1/26 17:32, Zi Yan wrote: > On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote: > >> On 3/27/26 16:05, Zi Yan wrote: >>> >>> >>> But I added >>> >>> + if (!is_shmem && inode_is_open_for_write(mapping->host)) >>> + result = SCAN_FAIL; >>> >>> That keeps the original bail out, right? >> >> Independent of that, are we sure that the possible race we allow is ok? > > Let me think. > > do_dentry_open() -> file_get_write_access() -> get_write_access() bumps > inode->i_writecount atomically and it turns inode_is_open_for_write() > to true. Then, do_dentry_open() also truncates all pages > if filemap_nr_thps() is not zero. This pairs with khugepaged’s first > filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening > a fd with write when there is a read-only THP. > > After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs > on FSes with large folio support (to be precise THP support). If a fd > is opened for write before inode_is_open_for_write() check, khugepaged > will stop. It is fine. But if a fd is opened for write after > inode_is_open_for_write() check, khugepaged will try to collapse a read-only > THP and the fd can be written at the same time. Exactly, that's the race I mean. > > I notice that fd write requires locking the to-be-written folio first > (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume > f_ops->write() has the same locking requirement) and khugepaged has already > locked the to-be-collapsed folio before inode_is_open_for_write(). So if the > fd is opened for write after inode_is_open_for_write() check, its write > will wait for khugepaged collapse and see a new THP. Since the FS > supports THP, writing to the new THP should be fine. > > Let me know if my analysis above makes sense. If yes, I will add it > to the commit message and add a succinct comment about it before > inode_is_open_for_write(). khugepaged code is the only code that replaces folios in the pagecache by other folios. So my main concern is if that is problematic on concurrent write access. You argue that the folio lock is sufficient. That's certainly true for individual folios, but I am more concerned about the replacement part. I don't have anything concrete, primarily just pointing out that this is a change that might unlock some code paths that could not have been triggered before. -- Cheers, David
On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote: > On 4/1/26 17:32, Zi Yan wrote: >> On 1 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote: >> >>> On 3/27/26 16:05, Zi Yan wrote: >>>> >>>> >>>> But I added >>>> >>>> + if (!is_shmem && inode_is_open_for_write(mapping->host)) >>>> + result = SCAN_FAIL; >>>> >>>> That keeps the original bail out, right? >>> >>> Independent of that, are we sure that the possible race we allow is ok? >> >> Let me think. >> >> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps >> inode->i_writecount atomically and it turns inode_is_open_for_write() >> to true. Then, do_dentry_open() also truncates all pages >> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first >> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening >> a fd with write when there is a read-only THP. >> >> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs >> on FSes with large folio support (to be precise THP support). If a fd >> is opened for write before inode_is_open_for_write() check, khugepaged >> will stop. It is fine. But if a fd is opened for write after >> inode_is_open_for_write() check, khugepaged will try to collapse a read-only >> THP and the fd can be written at the same time. > > Exactly, that's the race I mean. > >> >> I notice that fd write requires locking the to-be-written folio first >> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume >> f_ops->write() has the same locking requirement) and khugepaged has already >> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the >> fd is opened for write after inode_is_open_for_write() check, its write >> will wait for khugepaged collapse and see a new THP. Since the FS >> supports THP, writing to the new THP should be fine. >> >> Let me know if my analysis above makes sense. If yes, I will add it >> to the commit message and add a succinct comment about it before >> inode_is_open_for_write(). > > khugepaged code is the only code that replaces folios in the pagecache > by other folios. So my main concern is if that is problematic on > concurrent write access. folio_split() does it too, although it replaces a large folio with a bunch of after-split folios. It is a kinda reverse process of collapse_file(). > > You argue that the folio lock is sufficient. That's certainly true for > individual folios, but I am more concerned about the replacement part. For the replacement part, both old and new folios are locked during the process. A parallel writer uses filemap_get_entry() to get the folio from mapping, but all of them check folio->mapping after acquiring the folio lock, except mincore_page() which is a reader. A writer can see either old folio or new folio during the process, but 1. if it sees the old one, it waits on the old folio lock. After it acquires the lock, it sees old_folio->mapping is NULL, no longer matches the original mapping. The writer will try again. 2. if it sees the new one, it waits on the new folio lock. After it acquires the lock, it sees new_folio->mapping matches the original mapping and proceeds to its writes. 3. if khugepaged needs to do a rollback, the old folio will stay the same and the writer will see the old one after it gets the old folio lock. > > I don't have anything concrete, primarily just pointing out that this is > a change that might unlock some code paths that could not have been > triggered before. Yes, the concern makes sense. BTW, Claude is trying to convince me that even inode_is_open_for_write() is unecessary since 1) folio_test_dirty() before it has made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents further writes. But then we find a hole between folio_test_dirty() and try_to_unmap() where a write via a writable mmap PTE can dirty the folio after folio_test_dirty() and try_to_unmap(). To remove that hole, the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))” needs to be moved after try_to_unmap(). With that, all to-be-collapsed folios will be clean, unmapped, and locked, where unmapped means writes via mmap need to fault and take the folio lock, locked means writes via mmap and write() need to wait until the folio is unlocked. Let me know if my reasoning makes sense. It is definitely worth the time and effort to ensure this patchset does not introduce any unexpected race condition or issue. Best Regards, Yan, Zi
On 4/1/26 22:33, Zi Yan wrote: > On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote: > >> On 4/1/26 17:32, Zi Yan wrote: >>> >>> >>> Let me think. >>> >>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps >>> inode->i_writecount atomically and it turns inode_is_open_for_write() >>> to true. Then, do_dentry_open() also truncates all pages >>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first >>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening >>> a fd with write when there is a read-only THP. >>> >>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs >>> on FSes with large folio support (to be precise THP support). If a fd >>> is opened for write before inode_is_open_for_write() check, khugepaged >>> will stop. It is fine. But if a fd is opened for write after >>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only >>> THP and the fd can be written at the same time. >> >> Exactly, that's the race I mean. >> >>> >>> I notice that fd write requires locking the to-be-written folio first >>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume >>> f_ops->write() has the same locking requirement) and khugepaged has already >>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the >>> fd is opened for write after inode_is_open_for_write() check, its write >>> will wait for khugepaged collapse and see a new THP. Since the FS >>> supports THP, writing to the new THP should be fine. >>> >>> Let me know if my analysis above makes sense. If yes, I will add it >>> to the commit message and add a succinct comment about it before >>> inode_is_open_for_write(). >> >> khugepaged code is the only code that replaces folios in the pagecache >> by other folios. So my main concern is if that is problematic on >> concurrent write access. > > folio_split() does it too, although it replaces a large folio with > a bunch of after-split folios. It is a kinda reverse process of > collapse_file(). Right. You won't start looking at a small folio and suddenly there is something larger. > > >> >> You argue that the folio lock is sufficient. That's certainly true for >> individual folios, but I am more concerned about the replacement part. > > For the replacement part, both old and new folios are locked during > the process. A parallel writer uses filemap_get_entry() to get the folio > from mapping, but all of them check folio->mapping after acquiring the > folio lock, except mincore_page() which is a reader. A writer can see > either old folio or new folio during the process, but > > 1. if it sees the old one, it waits on the old folio lock. After > it acquires the lock, it sees old_folio->mapping is NULL, no longer > matches the original mapping. The writer will try again. > > 2. if it sees the new one, it waits on the new folio lock. After > it acquires the lock, it sees new_folio->mapping matches the > original mapping and proceeds to its writes. > > 3. if khugepaged needs to do a rollback, the old folio will stay > the same and the writer will see the old one after it gets the old > folio lock. I am primarily wondering about what would happen if someone traverses the pageache, and found+processed 3 small folios. Suddenly there is a large folio that covers the 3 small folios processes before. I suspect that is fine, because the code likely had to deal with concurrent truncation+population if relevant locks are dropped already. Just raising it. > >> >> I don't have anything concrete, primarily just pointing out that this is >> a change that might unlock some code paths that could not have been >> triggered before. > > Yes, the concern makes sense. > > BTW, Claude is trying to convince me that even inode_is_open_for_write() > is unecessary since 1) folio_test_dirty() before it has > made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents > further writes. > > But then we find a hole between folio_test_dirty() and > try_to_unmap() where a write via a writable mmap PTE can dirty the folio > after folio_test_dirty() and try_to_unmap(). To remove that hole, > the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))” > needs to be moved after try_to_unmap(). With that, all to-be-collapsed > folios will be clean, unmapped, and locked, where unmapped means > writes via mmap need to fault and take the folio lock, locked means > writes via mmap and write() need to wait until the folio is unlocked. > > Let me know if my reasoning makes sense. It is definitely worth the time > and effort to ensure this patchset does not introduce any unexpected race > condition or issue. Makes sense. Please clearly spell out that there is a slight change now, where we might be collapsing after the file has been opened for write. Then you can document that the folio locks should be protecting us from that. Implying that collapsing in writable files could likely "easily" done in the future. -- Cheers, David
On 2 Apr 2026, at 10:35, David Hildenbrand (Arm) wrote: > On 4/1/26 22:33, Zi Yan wrote: >> On 1 Apr 2026, at 15:15, David Hildenbrand (Arm) wrote: >> >>> On 4/1/26 17:32, Zi Yan wrote: >>>> >>>> >>>> Let me think. >>>> >>>> do_dentry_open() -> file_get_write_access() -> get_write_access() bumps >>>> inode->i_writecount atomically and it turns inode_is_open_for_write() >>>> to true. Then, do_dentry_open() also truncates all pages >>>> if filemap_nr_thps() is not zero. This pairs with khugepaged’s first >>>> filemap_nr_thps_inc() then inode_is_open_for_write() to prevent opening >>>> a fd with write when there is a read-only THP. >>>> >>>> After removing READ_ONLY_THP_FOR_FS, khugepaged only creates read-only THPs >>>> on FSes with large folio support (to be precise THP support). If a fd >>>> is opened for write before inode_is_open_for_write() check, khugepaged >>>> will stop. It is fine. But if a fd is opened for write after >>>> inode_is_open_for_write() check, khugepaged will try to collapse a read-only >>>> THP and the fd can be written at the same time. >>> >>> Exactly, that's the race I mean. >>> >>>> >>>> I notice that fd write requires locking the to-be-written folio first >>>> (I see it from f_ops->write_iter() -> write_begin_get_folio() and assume >>>> f_ops->write() has the same locking requirement) and khugepaged has already >>>> locked the to-be-collapsed folio before inode_is_open_for_write(). So if the >>>> fd is opened for write after inode_is_open_for_write() check, its write >>>> will wait for khugepaged collapse and see a new THP. Since the FS >>>> supports THP, writing to the new THP should be fine. >>>> >>>> Let me know if my analysis above makes sense. If yes, I will add it >>>> to the commit message and add a succinct comment about it before >>>> inode_is_open_for_write(). >>> >>> khugepaged code is the only code that replaces folios in the pagecache >>> by other folios. So my main concern is if that is problematic on >>> concurrent write access. >> >> folio_split() does it too, although it replaces a large folio with >> a bunch of after-split folios. It is a kinda reverse process of >> collapse_file(). > > Right. You won't start looking at a small folio and suddenly there is > something larger. > >> >> >>> >>> You argue that the folio lock is sufficient. That's certainly true for >>> individual folios, but I am more concerned about the replacement part. >> >> For the replacement part, both old and new folios are locked during >> the process. A parallel writer uses filemap_get_entry() to get the folio >> from mapping, but all of them check folio->mapping after acquiring the >> folio lock, except mincore_page() which is a reader. A writer can see >> either old folio or new folio during the process, but >> >> 1. if it sees the old one, it waits on the old folio lock. After >> it acquires the lock, it sees old_folio->mapping is NULL, no longer >> matches the original mapping. The writer will try again. >> >> 2. if it sees the new one, it waits on the new folio lock. After >> it acquires the lock, it sees new_folio->mapping matches the >> original mapping and proceeds to its writes. >> >> 3. if khugepaged needs to do a rollback, the old folio will stay >> the same and the writer will see the old one after it gets the old >> folio lock. > > I am primarily wondering about what would happen if someone traverses > the pageache, and found+processed 3 small folios. Suddenly there is a > large folio that covers the 3 small folios processes before. > > I suspect that is fine, because the code likely had to deal with > concurrent truncation+population if relevant locks are dropped already. > > Just raising it. > >> >>> >>> I don't have anything concrete, primarily just pointing out that this is >>> a change that might unlock some code paths that could not have been >>> triggered before. >> >> Yes, the concern makes sense. >> >> BTW, Claude is trying to convince me that even inode_is_open_for_write() >> is unecessary since 1) folio_test_dirty() before it has >> made sure the folio is clean, 2) try_to_unmap() and the locked folio prevents >> further writes. >> >> But then we find a hole between folio_test_dirty() and >> try_to_unmap() where a write via a writable mmap PTE can dirty the folio >> after folio_test_dirty() and try_to_unmap(). To remove that hole, >> the “if (!is_shmem && (folio_test_dirty(...) || folio_test_writeback(...))” >> needs to be moved after try_to_unmap(). With that, all to-be-collapsed >> folios will be clean, unmapped, and locked, where unmapped means >> writes via mmap need to fault and take the folio lock, locked means >> writes via mmap and write() need to wait until the folio is unlocked. >> >> Let me know if my reasoning makes sense. It is definitely worth the time >> and effort to ensure this patchset does not introduce any unexpected race >> condition or issue. > > Makes sense. > > Please clearly spell out that there is a slight change now, where we > might be collapsing after the file has been opened for write. Then you > can document that the folio locks should be protecting us from that. > > Implying that collapsing in writable files could likely "easily" done in > the future. Definitely. Thank you for all the inputs. :) Best Regards, Yan, Zi
On 2026/3/27 09:42, Zi Yan wrote: > They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without > large folio support, so that read-only THPs created in these FSes are not > seen by the FSes when the underlying fd becomes writable. Now read-only PMD > THPs only appear in a FS with large folio support and the supported orders > include PMD_ORDRE. > > Signed-off-by: Zi Yan <ziy@nvidia.com> > --- LGTM, feel free to add: Reviewed-by: Lance Yang <lance.yang@linux.dev>
© 2016 - 2026 Red Hat, Inc.