[PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()

Shakeel Butt posted 1 patch 1 week, 1 day ago
mm/khugepaged.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
[PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Shakeel Butt 1 week, 1 day ago
In META's fleet, we are seeing high level cgroups with zero file memcg
stat but their descendants have non-zero file stat. This should not be
possible. On further inspection by looking at kernel data structures
though drgn, it was revealed that the high level cgroups have negative
file stat which was aggregated from their children.

Another interesting point was that this specific issue start happening
more often as we started deploying thp-always more widely which
indicates some correlation between file memory and THPs and indeed it
was found that file memcg stat accounting is buggy in the collapse code
path from the start.

When collapse_file() replaces small folios with a large THP, it fails to
properly update the NR_FILE_PAGES memcg stat for both the old folios
being freed and the new THP being added. It assumes the old and new
folios belong to the same cgroup. However this assumption breaks in
couple of scenarios:

1. Binary (executable) package downloader running in a different cgroup
   than the actual job executing the downloaded package.

2. File shared and mapped by processes running in different cgroups. One
   process read-in the file and the second process either through
   madvise(COLLAPSE) or khugepaged on behalf of second process
   collapsing the file.

So, the current code has two bugs:

1. For non-shmem files, NR_FILE_PAGES is never incremented for the new
   THP because nr_none is always 0 for non-shmem, and the stat update is
   inside the "if (nr_none)" block.

2. When freeing old folios, NR_FILE_PAGES is never decremented because
   folio->mapping is set to NULL directly without calling
   filemap_unaccount_folio().

These bugs cause incorrect per-memcg accounting when the process
triggering the collapse (MADV_COLLAPSE or khugepaged) belongs to a
different memcg than the process that originally faulted in the pages:

  - Process A (memcg X) reads file, creating 512 small page cache folios
    charged to memcg X (NR_FILE_PAGES += 512 for memcg X)

  - Process B (memcg Y) triggers collapse via MADV_COLLAPSE or khugepaged
    scans B's mm. The new THP is charged to memcg Y.

  - Old folios freed: NR_FILE_PAGES not decremented (bug)
    New THP added: NR_FILE_PAGES not incremented (bug)

  - Later, THP removed from page cache: NR_FILE_PAGES -= 512 for memcg Y

Result: memcg X has +512 inflated pages, memcg Y has -512 (negative!)

Fix this by:
1. Always incrementing NR_FILE_PAGES by HPAGE_PMD_NR for the new THP
2. Decrementing NR_FILE_PAGES for each old folio before clearing its
   mapping pointer

For shmem with holes (nr_none > 0), the net change is still +nr_none
since we decrement (HPAGE_PMD_NR - nr_none) old pages and increment
HPAGE_PMD_NR new pages.

Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/khugepaged.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1d994b6c58c6..1cf8e154e214 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2200,8 +2200,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	else
 		lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
 
+	lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, HPAGE_PMD_NR);
 	if (nr_none) {
-		lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
 		/* nr_none is always 0 for non-shmem. */
 		lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);
 	}
@@ -2238,6 +2238,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	 */
 	list_for_each_entry_safe(folio, tmp, &pagelist, lru) {
 		list_del(&folio->lru);
+		lruvec_stat_mod_folio(folio, NR_FILE_PAGES,
+				      -folio_nr_pages(folio));
 		folio->mapping = NULL;
 		folio_clear_active(folio);
 		folio_clear_unevictable(folio);
-- 
2.47.3
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Johannes Weiner 1 week, 1 day ago
On Thu, Jan 29, 2026 at 10:40:54AM -0800, Shakeel Butt wrote:
> @@ -2200,8 +2200,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	else
>  		lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
>  
> +	lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, HPAGE_PMD_NR);

The memcg breakage is more visible, but I think this has been broken
for NUMA stats even longer. new_folio could also come from a different
node than the subpages, after all.

>  	if (nr_none) {
> -		lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
>  		/* nr_none is always 0 for non-shmem. */
>  		lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);

So AFAICT NR_SHMEM needs the same treatment.

It looks like that's been broken since f3f0e1d2150b ("khugepaged: add
support of collapse for tmpfs/shmem pages").

Anon pages avoided this, because their accounting is done in rmap.c
when the pmd is mapped and the ptes zapped.
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Shakeel Butt 1 week, 1 day ago
On Thu, Jan 29, 2026 at 07:32:14PM -0500, Johannes Weiner wrote:
> On Thu, Jan 29, 2026 at 10:40:54AM -0800, Shakeel Butt wrote:
> > @@ -2200,8 +2200,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >  	else
> >  		lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
> >  
> > +	lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, HPAGE_PMD_NR);
> 
> The memcg breakage is more visible, but I think this has been broken
> for NUMA stats even longer. new_folio could also come from a different
> node than the subpages, after all.

Indeed you are right, so I should blame Kiryl instead of Song :P

> 
> >  	if (nr_none) {
> > -		lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
> >  		/* nr_none is always 0 for non-shmem. */
> >  		lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);
> 
> So AFAICT NR_SHMEM needs the same treatment.
> 
> It looks like that's been broken since f3f0e1d2150b ("khugepaged: add
> support of collapse for tmpfs/shmem pages").

Thanks, I will send v2 with correct handling of NR_SHMEM as well.
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Usama Arif 1 week, 1 day ago

On 29/01/2026 18:40, Shakeel Butt wrote:
> In META's fleet, we are seeing high level cgroups with zero file memcg
> stat but their descendants have non-zero file stat. This should not be
> possible. On further inspection by looking at kernel data structures
> though drgn, it was revealed that the high level cgroups have negative
> file stat which was aggregated from their children.
> 
> Another interesting point was that this specific issue start happening
> more often as we started deploying thp-always more widely which
> indicates some correlation between file memory and THPs and indeed it
> was found that file memcg stat accounting is buggy in the collapse code
> path from the start.
> 
> When collapse_file() replaces small folios with a large THP, it fails to
> properly update the NR_FILE_PAGES memcg stat for both the old folios
> being freed and the new THP being added. It assumes the old and new
> folios belong to the same cgroup. However this assumption breaks in
> couple of scenarios:
> 
> 1. Binary (executable) package downloader running in a different cgroup
>    than the actual job executing the downloaded package.
> 
> 2. File shared and mapped by processes running in different cgroups. One
>    process read-in the file and the second process either through
>    madvise(COLLAPSE) or khugepaged on behalf of second process
>    collapsing the file.
> 
> So, the current code has two bugs:
> 
> 1. For non-shmem files, NR_FILE_PAGES is never incremented for the new
>    THP because nr_none is always 0 for non-shmem, and the stat update is
>    inside the "if (nr_none)" block.
> 
> 2. When freeing old folios, NR_FILE_PAGES is never decremented because
>    folio->mapping is set to NULL directly without calling
>    filemap_unaccount_folio().
> 
> These bugs cause incorrect per-memcg accounting when the process
> triggering the collapse (MADV_COLLAPSE or khugepaged) belongs to a
> different memcg than the process that originally faulted in the pages:
> 
>   - Process A (memcg X) reads file, creating 512 small page cache folios
>     charged to memcg X (NR_FILE_PAGES += 512 for memcg X)
> 
>   - Process B (memcg Y) triggers collapse via MADV_COLLAPSE or khugepaged
>     scans B's mm. The new THP is charged to memcg Y.
> 
>   - Old folios freed: NR_FILE_PAGES not decremented (bug)
>     New THP added: NR_FILE_PAGES not incremented (bug)
> 
>   - Later, THP removed from page cache: NR_FILE_PAGES -= 512 for memcg Y
> 
> Result: memcg X has +512 inflated pages, memcg Y has -512 (negative!)
> 
> Fix this by:
> 1. Always incrementing NR_FILE_PAGES by HPAGE_PMD_NR for the new THP
> 2. Decrementing NR_FILE_PAGES for each old folio before clearing its
>    mapping pointer
> 
> For shmem with holes (nr_none > 0), the net change is still +nr_none
> since we decrement (HPAGE_PMD_NR - nr_none) old pages and increment
> HPAGE_PMD_NR new pages.
> 
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev

Acked-by: Usama Arif <usamaarif642@gmail.com>
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Matthew Wilcox 1 week, 1 day ago
On Thu, Jan 29, 2026 at 10:40:54AM -0800, Shakeel Butt wrote:
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")

Are you sure this is the right Fixes?  99cb0dbd47a1 wasn't cgroup
aware:

        if (nr_none) {
                struct zone *zone = page_zone(new_page);

                __mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
-               __mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
+               if (is_shmem)
+                       __mod_node_page_state(zone->zone_pgdat,
+                                             NR_SHMEM, nr_none);
        }

b8eddff8886b added cgroup support:

        if (is_shmem)
-               __inc_node_page_state(new_page, NR_SHMEM_THPS);
+               __inc_lruvec_page_state(new_page, NR_SHMEM_THPS);
        else {
-               __inc_node_page_state(new_page, NR_FILE_THPS);
+               __inc_lruvec_page_state(new_page, NR_FILE_THPS);
                filemap_nr_thps_inc(mapping);
        }

which would seem like the right Fixes: to me?

> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  mm/khugepaged.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1d994b6c58c6..1cf8e154e214 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2200,8 +2200,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	else
>  		lruvec_stat_mod_folio(new_folio, NR_FILE_THPS, HPAGE_PMD_NR);
>  
> +	lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, HPAGE_PMD_NR);
>  	if (nr_none) {
> -		lruvec_stat_mod_folio(new_folio, NR_FILE_PAGES, nr_none);
>  		/* nr_none is always 0 for non-shmem. */
>  		lruvec_stat_mod_folio(new_folio, NR_SHMEM, nr_none);
>  	}
> @@ -2238,6 +2238,8 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	 */
>  	list_for_each_entry_safe(folio, tmp, &pagelist, lru) {
>  		list_del(&folio->lru);
> +		lruvec_stat_mod_folio(folio, NR_FILE_PAGES,
> +				      -folio_nr_pages(folio));
>  		folio->mapping = NULL;
>  		folio_clear_active(folio);
>  		folio_clear_unevictable(folio);
> -- 
> 2.47.3
> 
>
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Shakeel Butt 1 week, 1 day ago
On Thu, Jan 29, 2026 at 10:47:16PM +0000, Matthew Wilcox wrote:
> On Thu, Jan 29, 2026 at 10:40:54AM -0800, Shakeel Butt wrote:
> > Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> 
> Are you sure this is the right Fixes?  99cb0dbd47a1 wasn't cgroup
> aware:
> 
>         if (nr_none) {
>                 struct zone *zone = page_zone(new_page);
> 
>                 __mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
> -               __mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
> +               if (is_shmem)
> +                       __mod_node_page_state(zone->zone_pgdat,
> +                                             NR_SHMEM, nr_none);
>         }
> 
> b8eddff8886b added cgroup support:
> 
>         if (is_shmem)
> -               __inc_node_page_state(new_page, NR_SHMEM_THPS);
> +               __inc_lruvec_page_state(new_page, NR_SHMEM_THPS);
>         else {
> -               __inc_node_page_state(new_page, NR_FILE_THPS);
> +               __inc_lruvec_page_state(new_page, NR_FILE_THPS);
>                 filemap_nr_thps_inc(mapping);
>         }
> 
> which would seem like the right Fixes: to me?
> 

b8eddff8886b is about different stats NR_SHMEM_THPS & NR_FILE_THPS.
However as Johannes responded this code was broken even before memcg
awareness for NR_FILE_PAGES and NR_SHMEM.

I will send v2 with correct handling of NR_SHMEM as well.
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Andrew Morton 1 week, 1 day ago
On Thu, 29 Jan 2026 10:40:54 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> In META's fleet, we are seeing high level cgroups with zero file memcg
> stat but their descendants have non-zero file stat. This should not be
> possible. On further inspection by looking at kernel data structures
> though drgn, it was revealed that the high level cgroups have negative
> file stat which was aggregated from their children.
> 
> Another interesting point was that this specific issue start happening
> more often as we started deploying thp-always more widely which
> indicates some correlation between file memory and THPs and indeed it
> was found that file memcg stat accounting is buggy in the collapse code
> path from the start.

So this has no known runtime effect apart from incorrect accounting?

> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")

Should we cc:stable?
Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
Posted by Shakeel Butt 1 week, 1 day ago
On Thu, Jan 29, 2026 at 10:50:15AM -0800, Andrew Morton wrote:
> On Thu, 29 Jan 2026 10:40:54 -0800 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > In META's fleet, we are seeing high level cgroups with zero file memcg
> > stat but their descendants have non-zero file stat. This should not be
> > possible. On further inspection by looking at kernel data structures
> > though drgn, it was revealed that the high level cgroups have negative
> > file stat which was aggregated from their children.
> > 
> > Another interesting point was that this specific issue start happening
> > more often as we started deploying thp-always more widely which
> > indicates some correlation between file memory and THPs and indeed it
> > was found that file memcg stat accounting is buggy in the collapse code
> > path from the start.
> 
> So this has no known runtime effect apart from incorrect accounting?

Yes just an accounting bug.

> 
> > Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> 
> Should we cc:stable?

I think so.