RRe: Possible memory leak in 6.17.7

David Wang posted 1 patch 5 days, 5 hours ago
RRe: Possible memory leak in 6.17.7
Posted by David Wang 5 days, 5 hours ago


At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
>"David Wang" <00107082@163.com> wrote:
>
>> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
>> >On Mon,  8 Dec 2025 19:08:29 +0800
>> >David Wang <00107082@163.com> wrote:
>> >  
>> >> On Mon, 10 Nov 2025 18:20:08 +1000
>> >> Mal Haak <malcolm@haak.id.au> wrote:  
>> >> > Hello,
>> >> > 
>> >> > I have found a memory leak in 6.17.7 but I am unsure how to
>> >> > track it down effectively.
>> >> > 
>> >> >     
>> >> 
>> >> I think the `memory allocation profiling` feature can help.
>> >> https://docs.kernel.org/mm/allocation-profiling.html
>> >> 
>> >> You would need to build a kernel with 
>> >> CONFIG_MEM_ALLOC_PROFILING=y
>> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> >> 
>> >> And check /proc/allocinfo for the suspicious allocations which take
>> >> more memory than expected.
>> >> 
>> >> (I once caught a nvidia driver memory leak.)
>> >> 
>> >> 
>> >> FYI
>> >> David
>> >>   
>> >
>> >Thank you for your suggestion. I have some results.
>> >
>> >Ran the rsync workload for about 9 hours. It started to look like it
>> >was happening.
>> ># smem -pw
>> >Area                           Used      Cache   Noncache 
>> >firmware/hardware             0.00%      0.00%      0.00% 
>> >kernel image                  0.00%      0.00%      0.00% 
>> >kernel dynamic memory        80.46%     65.80%     14.66% 
>> >userspace memory              0.35%      0.16%      0.19% 
>> >free memory                  19.19%     19.19%      0.00% 
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
>> >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
>> >func:xfs_buf_alloc_backing_mem 
>> >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
>> > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
>> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
>> > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
>> > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
>> > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
>> >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
>> >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
>> >
>> >
>> >So I stopped the workload and dropped caches to confirm.
>> >
>> ># echo 3 > /proc/sys/vm/drop_caches
>> ># smem -pw
>> >Area                           Used      Cache   Noncache 
>> >firmware/hardware             0.00%      0.00%      0.00% 
>> >kernel image                  0.00%      0.00%      0.00% 
>> >kernel dynamic memory        33.45%      0.09%     33.36% 
>> >userspace memory              0.36%      0.16%      0.19% 
>> >free memory                  66.20%     66.20%      0.00% 
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
>> >         12M        3 kernel/dma/pool.c:96 func:atomic_pool_expand 
>> >         13M      751 mm/slub.c:3061 func:alloc_slab_page 
>> >         16M        8 mm/khugepaged.c:1069 func:alloc_charge_folio 
>> >         18M     4355 mm/memory.c:1190 func:folio_prealloc 
>> >         24M     6119 mm/memory.c:1192 func:folio_prealloc 
>> >         58M    14784 mm/page_ext.c:271 func:alloc_page_ext 
>> >         61M    15448 mm/readahead.c:189 func:ractl_alloc_folio 
>> >         79M     6726 mm/slub.c:3059 func:alloc_slab_page 
>> >         11G  2674488 mm/filemap.c:2012 func:__filemap_get_folio

Maybe narrowing down the "Noncache" caller of __filemap_get_folio would help clarify things.
(It could be designed that way, and  needs other route than dropping-cache to release the memory, just guess....)
If you want, you can modify code to split the accounting for __filemap_get_folio according to its callers.

Following is a draft patch: (based on v6.18)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 09b581c1d878..ba8c659a6ae3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
 }
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+
+#define __filemap_get_folio(...)			\
+	alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
+
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 024b71da5224..e1c1c26d7cb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
@@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			err = -ENOMEM;
 			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
-			folio = filemap_alloc_folio(alloc_gfp, order);
+			folio = filemap_alloc_folio_noprof(alloc_gfp, order);
 			if (!folio)
 				continue;
 
@@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_clear_dropbehind(folio);
 	return folio;
 }
-EXPORT_SYMBOL(__filemap_get_folio);
+EXPORT_SYMBOL(__filemap_get_folio_noprof);
 
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)




FYI
David

>> >
>> >So if I'm reading this correctly something is causing folios collect
>> >and not be able to be freed?  
>> 
>> CC cephfs, maybe someone could have an easy reading out of those
>> folio usage
>> 
>> 
>> >
>> >Also it's clear that some of the folio's are counting as cache and
>> >some aren't. 
>> >
>> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm now
>> >going to manually walk through previous kernel releases and find
>> >where it first starts happening purely because I'm having issues
>> >building earlier kernels due to rust stuff and other python
>> >incompatibilities making doing a git-bisect a bit fun.
>> >
>> >I'll do it the packages way until I get closer, then solve the build
>> >issues. 
>> >
>> >Thanks,
>> >Mal
>> >  
>Thanks David.
>
>I've contacted the ceph developers as well. 
>
>There was a suggestion it was due to the change from, to quote:
>folio.free() to folio.put() or something like this.
>
>The change happened around 6.14/6.15
>
>I've found an easier reproducer. 
>
>There has been a suggestion that perhaps the ceph team might not fix
>this as "you can just reboot before the machine becomes unstable" and
>"Nobody else has encountered this bug"
>
>I'll leave that to other people to make a call on but I'd assume the
>lack of reports is due to the fact that most stable distros are still
>on a a far too early kernel and/or are using the fuse driver with k8s.
>
>Anyway, thanks for your assistance.
Re: RRe: Possible memory leak in 6.17.7
Posted by Mal Haak 5 days, 4 hours ago
On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
"David Wang" <00107082@163.com> wrote:

> At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> >"David Wang" <00107082@163.com> wrote:
> >  
> >> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:  
> >> >On Mon,  8 Dec 2025 19:08:29 +0800
> >> >David Wang <00107082@163.com> wrote:
> >> >    
> >> >> On Mon, 10 Nov 2025 18:20:08 +1000
> >> >> Mal Haak <malcolm@haak.id.au> wrote:    
> >> >> > Hello,
> >> >> > 
> >> >> > I have found a memory leak in 6.17.7 but I am unsure how to
> >> >> > track it down effectively.
> >> >> > 
> >> >> >       
> >> >> 
> >> >> I think the `memory allocation profiling` feature can help.
> >> >> https://docs.kernel.org/mm/allocation-profiling.html
> >> >> 
> >> >> You would need to build a kernel with 
> >> >> CONFIG_MEM_ALLOC_PROFILING=y
> >> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >> >> 
> >> >> And check /proc/allocinfo for the suspicious allocations which
> >> >> take more memory than expected.
> >> >> 
> >> >> (I once caught a nvidia driver memory leak.)
> >> >> 
> >> >> 
> >> >> FYI
> >> >> David
> >> >>     
> >> >
> >> >Thank you for your suggestion. I have some results.
> >> >
> >> >Ran the rsync workload for about 9 hours. It started to look like
> >> >it was happening.
> >> ># smem -pw
> >> >Area                           Used      Cache   Noncache 
> >> >firmware/hardware             0.00%      0.00%      0.00% 
> >> >kernel image                  0.00%      0.00%      0.00% 
> >> >kernel dynamic memory        80.46%     65.80%     14.66% 
> >> >userspace memory              0.35%      0.16%      0.19% 
> >> >free memory                  19.19%     19.19%      0.00% 
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
> >> >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
> >> >func:xfs_buf_alloc_backing_mem 
> >> >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
> >> > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
> >> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
> >> > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
> >> >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
> >> >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
> >> >
> >> >
> >> >So I stopped the workload and dropped caches to confirm.
> >> >
> >> ># echo 3 > /proc/sys/vm/drop_caches
> >> ># smem -pw
> >> >Area                           Used      Cache   Noncache 
> >> >firmware/hardware             0.00%      0.00%      0.00% 
> >> >kernel image                  0.00%      0.00%      0.00% 
> >> >kernel dynamic memory        33.45%      0.09%     33.36% 
> >> >userspace memory              0.36%      0.16%      0.19% 
> >> >free memory                  66.20%     66.20%      0.00% 
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
> >> >         12M        3 kernel/dma/pool.c:96
> >> > func:atomic_pool_expand 13M      751 mm/slub.c:3061
> >> > func:alloc_slab_page 16M        8 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 18M     4355 mm/memory.c:1190
> >> > func:folio_prealloc 24M     6119 mm/memory.c:1192
> >> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 61M    15448 mm/readahead.c:189
> >> > func:ractl_alloc_folio 79M     6726 mm/slub.c:3059
> >> > func:alloc_slab_page 11G  2674488 mm/filemap.c:2012
> >> > func:__filemap_get_folio  
> 
> Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> would help clarify things. (It could be designed that way, and  needs
> other route than dropping-cache to release the memory, just
> guess....) If you want, you can modify code to split the accounting
> for __filemap_get_folio according to its callers.


Thanks again, I'll add this patch in and see where I end up. 

The issue is nothing will cause the memory to be freed. Dropping caches
doesn't work, memory pressure doesn't work, unmounting the filesystems
doesn't work. Removing the cephfs and netfs kernel modules also doesn't
work. 

This is why I feel it's a ref_count (or similar) issue. 

I've also found it seems to be a fixed amount leaked each time, per
file. Simply doing lots of IO on one large file doesn't leak as fast as
lots of "small" (greater than 10MB less than 100MB seems to be a sweet
spot) 

Also, dropping caches while the workload is running actually amplifies
the issue. So it very much feels like something is wrong in the reclaim
code.

Anyway I'll get this patch applied and see where I end up. 

I now have crash dumps (after enabling crash_on_oom) so I'm going to
try and see if I can find these structures and see what state they are
in

Thanks again. 

Mal


> Following is a draft patch: (based on v6.18)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878..ba8c659a6ae3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
>  }
>  
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t
> index); -struct folio *__filemap_get_folio(struct address_space
> *mapping, pgoff_t index, +
> +#define __filemap_get_folio(...)			\
> +	alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
> +
> +struct folio *__filemap_get_folio_noprof(struct address_space
> *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
>  struct page *pagecache_get_page(struct address_space *mapping,
> pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 024b71da5224..e1c1c26d7cb3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space
> *mapping, pgoff_t index) *
>   * Return: The found folio or an ERR_PTR() otherwise.
>   */
> -struct folio *__filemap_get_folio(struct address_space *mapping,
> pgoff_t index, +struct folio *__filemap_get_folio_noprof(struct
> address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
> @@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, err = -ENOMEM;
>  			if (order > min_order)
>  				alloc_gfp |= __GFP_NORETRY |
> __GFP_NOWARN;
> -			folio = filemap_alloc_folio(alloc_gfp,
> order);
> +			folio =
> filemap_alloc_folio_noprof(alloc_gfp, order); if (!folio)
>  				continue;
>  
> @@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, folio_clear_dropbehind(folio);
>  	return folio;
>  }
> -EXPORT_SYMBOL(__filemap_get_folio);
> +EXPORT_SYMBOL(__filemap_get_folio_noprof);
>  
>  static inline struct folio *find_get_entry(struct xa_state *xas,
> pgoff_t max, xa_mark_t mark)
> 
> 
> 
> 
> FYI
> David
> 
> >> >
> >> >So if I'm reading this correctly something is causing folios
> >> >collect and not be able to be freed?    
> >> 
> >> CC cephfs, maybe someone could have an easy reading out of those
> >> folio usage
> >> 
> >>   
> >> >
> >> >Also it's clear that some of the folio's are counting as cache and
> >> >some aren't. 
> >> >
> >> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm
> >> >now going to manually walk through previous kernel releases and
> >> >find where it first starts happening purely because I'm having
> >> >issues building earlier kernels due to rust stuff and other python
> >> >incompatibilities making doing a git-bisect a bit fun.
> >> >
> >> >I'll do it the packages way until I get closer, then solve the
> >> >build issues. 
> >> >
> >> >Thanks,
> >> >Mal
> >> >    
> >Thanks David.
> >
> >I've contacted the ceph developers as well. 
> >
> >There was a suggestion it was due to the change from, to quote:
> >folio.free() to folio.put() or something like this.
> >
> >The change happened around 6.14/6.15
> >
> >I've found an easier reproducer. 
> >
> >There has been a suggestion that perhaps the ceph team might not fix
> >this as "you can just reboot before the machine becomes unstable" and
> >"Nobody else has encountered this bug"
> >
> >I'll leave that to other people to make a call on but I'd assume the
> >lack of reports is due to the fact that most stable distros are still
> >on a a far too early kernel and/or are using the fuse driver with
> >k8s.
> >
> >Anyway, thanks for your assistance.
RE: RRe: Possible memory leak in 6.17.7
Posted by Viacheslav Dubeyko 13 hours ago
Hi Mal,

On Thu, 2025-12-11 at 14:23 +1000, Mal Haak wrote:
> On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
> "David Wang" <00107082@163.com> wrote:
> 
> > At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> > > On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> > > "David Wang" <00107082@163.com> wrote:
> > >  
> > > > At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:  
> > > > > On Mon,  8 Dec 2025 19:08:29 +0800
> > > > > David Wang <00107082@163.com> wrote:
> > > > >    
> > > > > > On Mon, 10 Nov 2025 18:20:08 +1000
> > > > > > Mal Haak <malcolm@haak.id.au> wrote:    
> > > > > > > Hello,
> > > > > > > 
> > > > > > > I have found a memory leak in 6.17.7 but I am unsure how to
> > > > > > > track it down effectively.
> > > > > > > 
> > > > > > >       
> > > > > > 
> > > > > > I think the `memory allocation profiling` feature can help.
> > > > > > https://docs.kernel.org/mm/allocation-profiling.html  
> > > > > > 
> > > > > > You would need to build a kernel with 
> > > > > > CONFIG_MEM_ALLOC_PROFILING=y
> > > > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> > > > > > 
> > > > > > And check /proc/allocinfo for the suspicious allocations which
> > > > > > take more memory than expected.
> > > > > > 
> > > > > > (I once caught a nvidia driver memory leak.)
> > > > > > 
> > > > > > 
> > > > > > FYI
> > > > > > David
> > > > > >     
> > > > > 
> > > > > Thank you for your suggestion. I have some results.
> > > > > 
> > > > > Ran the rsync workload for about 9 hours. It started to look like
> > > > > it was happening.
> > > > > # smem -pw
> > > > > Area                           Used      Cache   Noncache 
> > > > > firmware/hardware             0.00%      0.00%      0.00% 
> > > > > kernel image                  0.00%      0.00%      0.00% 
> > > > > kernel dynamic memory        80.46%     65.80%     14.66% 
> > > > > userspace memory              0.35%      0.16%      0.19% 
> > > > > free memory                  19.19%     19.19%      0.00% 
> > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec
> > > > >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
> > > > >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
> > > > > func:xfs_buf_alloc_backing_mem 
> > > > >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
> > > > > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
> > > > > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> > > > > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
> > > > > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
> > > > > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
> > > > >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
> > > > >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
> > > > > 
> > > > > 
> > > > > So I stopped the workload and dropped caches to confirm.
> > > > > 
> > > > > # echo 3 > /proc/sys/vm/drop_caches
> > > > > # smem -pw
> > > > > Area                           Used      Cache   Noncache 
> > > > > firmware/hardware             0.00%      0.00%      0.00% 
> > > > > kernel image                  0.00%      0.00%      0.00% 
> > > > > kernel dynamic memory        33.45%      0.09%     33.36% 
> > > > > userspace memory              0.36%      0.16%      0.19% 
> > > > > free memory                  66.20%     66.20%      0.00% 
> > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec
> > > > >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
> > > > >         12M        3 kernel/dma/pool.c:96
> > > > > func:atomic_pool_expand 13M      751 mm/slub.c:3061
> > > > > func:alloc_slab_page 16M        8 mm/khugepaged.c:1069
> > > > > func:alloc_charge_folio 18M     4355 mm/memory.c:1190
> > > > > func:folio_prealloc 24M     6119 mm/memory.c:1192
> > > > > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> > > > > func:alloc_page_ext 61M    15448 mm/readahead.c:189
> > > > > func:ractl_alloc_folio 79M     6726 mm/slub.c:3059
> > > > > func:alloc_slab_page 11G  2674488 mm/filemap.c:2012
> > > > > func:__filemap_get_folio  
> > 
> > Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> > would help clarify things. (It could be designed that way, and  needs
> > other route than dropping-cache to release the memory, just
> > guess....) If you want, you can modify code to split the accounting
> > for __filemap_get_folio according to its callers.
> 
> 
> Thanks again, I'll add this patch in and see where I end up. 
> 
> The issue is nothing will cause the memory to be freed. Dropping caches
> doesn't work, memory pressure doesn't work, unmounting the filesystems
> doesn't work. Removing the cephfs and netfs kernel modules also doesn't
> work. 
> 
> This is why I feel it's a ref_count (or similar) issue. 
> 
> I've also found it seems to be a fixed amount leaked each time, per
> file. Simply doing lots of IO on one large file doesn't leak as fast as
> lots of "small" (greater than 10MB less than 100MB seems to be a sweet
> spot) 
> 
> Also, dropping caches while the workload is running actually amplifies
> the issue. So it very much feels like something is wrong in the reclaim
> code.
> 
> Anyway I'll get this patch applied and see where I end up. 
> 
> I now have crash dumps (after enabling crash_on_oom) so I'm going to
> try and see if I can find these structures and see what state they are
> in
> 
> 

Thanks a lot for reporting the issue. Finally, I can see the discussion in email
list. :) Are you working on the patch with the fix? Should we wait for the fix
or I need to start the issue reproduction and investigation? I am simply trying
to avoid patches collision and, also, I have multiple other issues for the fix
in CephFS kernel client. :)

Thanks,
Slava.
Re: RRe: Possible memory leak in 6.17.7
Posted by Mal Haak 7 hours ago
On Mon, 15 Dec 2025 19:42:56 +0000
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> Hi Mal,
> 
<SNIP> 
> 
> Thanks a lot for reporting the issue. Finally, I can see the
> discussion in email list. :) Are you working on the patch with the
> fix? Should we wait for the fix or I need to start the issue
> reproduction and investigation? I am simply trying to avoid patches
> collision and, also, I have multiple other issues for the fix in
> CephFS kernel client. :)
> 
> Thanks,
> Slava.

Hello,

Unfortunately creating a patch is just outside my comfort zone, I've
lived too long in Lustre land.

I've have been trying to narrow down a consistent reproducer that's as
fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
haven't got it quite as fast. I think the dd workload is too well
behaved. 

I can confirm the issue appeared in the major patch set that was
applied as part of the 6.15 kernel. So during the more complete pages
to folios switch and that nothing has changed in the bug behaviour since
then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
and didn't see any changes post 6.15 that looked like they would impact
the bug behavior. 

Again, I'm not super familiar with the CephFS code but to hazard a
guess, but I think that the web download workload triggers things faster
suggests that unaligned writes might make things worse. But again, I'm
not 100% sure. I can't find a reproducer as fast as downloading a
dataset. Rsync of lots and lots of tiny files is a tad faster than the
dd case.

I did see some changes in ceph_check_page_before_write where the
previous code unlocked pages and then continued where as the changed
folio code just returns ENODATA and doesn't unlock anything with most
of the rest of the logic unchanged. This might be perfectly fine, but
in my, admittedly limited, reading of the code I couldn't figure out
where anything that was locked prior to this being called would get
unlocked like it did prior to the change. Again, I could be miles off
here and one of the bulk reclaim/unlock passes that was added might be
cleaning this up correctly or some other functional change might take
care of this, but it looks to be potentially in the code path I'm
excising and it has had some unlock logic changed. 

I've spent most of my time trying to find a solid quick reproducer. Not
that it takes long to start leaking folios, but I wanted something that
aggressively triggered it so a small vm would oom quickly and when
combined with crash_on_oom it could potentially be used for regression
testing by way of "did vm crash?".

I'm not sure if it will super help, but I'll provide what details I can
about the actual workload that really sets it off. It's a python based
tool for downloading datasets. Datasets are split into N chunks and the
tool downloads them in parallel 100 at a time until all N chunks are
down. The compressed dataset is then unpacked and reassembled for
use with workloads. 

This is replicating a common home folder usecase in HPC. CephFS is very
attractive for home folders due to it's "NFS-like" utility and
performance. And many tools use a similar method for fetching large
datasets. Tools are frequently written in python or go. 

None of my customers have hit this yet, not have any enterprise
customers as none have moved to a new enough kernel yet due to slow
upgrade cycles. Even Proxmox have only just started testing on a kernel
version > 6.14. 

I'm more than happy to help however I can with testing. I can run
instrumented kernels or test patches or whatever you need. I am sorry I
haven't been able to produce a super clean, fast reproducer (my test
cluster at home is all spinners and only 500TB usable). But I figured I
needed to get the word out asap as distros and soon customers are going
to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
marches on. Especially those wanting to take full advantage of CacheFS
and encryption functionality. 

Again thanks for looking at this and do reach out if I can help in
anyway. I am in the ceph slack if it's faster to reach out that way.

Regards

Mal Haak
Re: Possible memory leak in 6.17.7
Posted by David Wang an hour ago
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Mon, 15 Dec 2025 19:42:56 +0000
>Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
>> Hi Mal,
>> 
><SNIP> 
>> 
>> Thanks a lot for reporting the issue. Finally, I can see the
>> discussion in email list. :) Are you working on the patch with the
>> fix? Should we wait for the fix or I need to start the issue
>> reproduction and investigation? I am simply trying to avoid patches
>> collision and, also, I have multiple other issues for the fix in
>> CephFS kernel client. :)
>> 
>> Thanks,
>> Slava.
>
>Hello,
>
>Unfortunately creating a patch is just outside my comfort zone, I've
>lived too long in Lustre land.

Hi, just out of curiosity, have you narrowed down the caller of __filemap_get_folio
causing the memory problem? Or do you have trouble applying the debug patch for
memory allocation profiling?

David 

>
>I've have been trying to narrow down a consistent reproducer that's as
>fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
>haven't got it quite as fast. I think the dd workload is too well
>behaved. 
>
>I can confirm the issue appeared in the major patch set that was
>applied as part of the 6.15 kernel. So during the more complete pages
>to folios switch and that nothing has changed in the bug behaviour since
>then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
>and didn't see any changes post 6.15 that looked like they would impact
>the bug behavior. 
>
>Again, I'm not super familiar with the CephFS code but to hazard a
>guess, but I think that the web download workload triggers things faster
>suggests that unaligned writes might make things worse. But again, I'm
>not 100% sure. I can't find a reproducer as fast as downloading a
>dataset. Rsync of lots and lots of tiny files is a tad faster than the
>dd case.
>
>I did see some changes in ceph_check_page_before_write where the
>previous code unlocked pages and then continued where as the changed
>folio code just returns ENODATA and doesn't unlock anything with most
>of the rest of the logic unchanged. This might be perfectly fine, but
>in my, admittedly limited, reading of the code I couldn't figure out
>where anything that was locked prior to this being called would get
>unlocked like it did prior to the change. Again, I could be miles off
>here and one of the bulk reclaim/unlock passes that was added might be
>cleaning this up correctly or some other functional change might take
>care of this, but it looks to be potentially in the code path I'm
>excising and it has had some unlock logic changed. 
>
>I've spent most of my time trying to find a solid quick reproducer. Not
>that it takes long to start leaking folios, but I wanted something that
>aggressively triggered it so a small vm would oom quickly and when
>combined with crash_on_oom it could potentially be used for regression
>testing by way of "did vm crash?".
>
>I'm not sure if it will super help, but I'll provide what details I can
>about the actual workload that really sets it off. It's a python based
>tool for downloading datasets. Datasets are split into N chunks and the
>tool downloads them in parallel 100 at a time until all N chunks are
>down. The compressed dataset is then unpacked and reassembled for
>use with workloads. 
>
>This is replicating a common home folder usecase in HPC. CephFS is very
>attractive for home folders due to it's "NFS-like" utility and
>performance. And many tools use a similar method for fetching large
>datasets. Tools are frequently written in python or go. 
>
>None of my customers have hit this yet, not have any enterprise
>customers as none have moved to a new enough kernel yet due to slow
>upgrade cycles. Even Proxmox have only just started testing on a kernel
>version > 6.14. 
>
>I'm more than happy to help however I can with testing. I can run
>instrumented kernels or test patches or whatever you need. I am sorry I
>haven't been able to produce a super clean, fast reproducer (my test
>cluster at home is all spinners and only 500TB usable). But I figured I
>needed to get the word out asap as distros and soon customers are going
>to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
>marches on. Especially those wanting to take full advantage of CacheFS
>and encryption functionality. 
>
>Again thanks for looking at this and do reach out if I can help in
>anyway. I am in the ceph slack if it's faster to reach out that way.
>
>Regards
>
>Mal Haak
Re: Possible memory leak in 6.17.7
Posted by Mal Haak an hour ago
On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
"David Wang" <00107082@163.com> wrote:

> At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Mon, 15 Dec 2025 19:42:56 +0000
> >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> >  
> >> Hi Mal,
> >>   
> ><SNIP>   
> >> 
> >> Thanks a lot for reporting the issue. Finally, I can see the
> >> discussion in email list. :) Are you working on the patch with the
> >> fix? Should we wait for the fix or I need to start the issue
> >> reproduction and investigation? I am simply trying to avoid patches
> >> collision and, also, I have multiple other issues for the fix in
> >> CephFS kernel client. :)
> >> 
> >> Thanks,
> >> Slava.  
> >
> >Hello,
> >
> >Unfortunately creating a patch is just outside my comfort zone, I've
> >lived too long in Lustre land.  
> 
> Hi, just out of curiosity, have you narrowed down the caller of
> __filemap_get_folio causing the memory problem? Or do you have
> trouble applying the debug patch for memory allocation profiling?
> 
> David 
> 
Hi David,

I hadn't yet as I did test XFS and NFS to see if it replicated the
behaviour and it did not. 

But actually this could speed things up considerably. I will do that
now and see what I get.

Thanks

Mal

> >
> >I've have been trying to narrow down a consistent reproducer that's
> >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> >And I haven't got it quite as fast. I think the dd workload is too
> >well behaved. 
> >
> >I can confirm the issue appeared in the major patch set that was
> >applied as part of the 6.15 kernel. So during the more complete pages
> >to folios switch and that nothing has changed in the bug behaviour
> >since then. I did have a look at all the diffs from 6.14 to 6.18 on
> >addr.c and didn't see any changes post 6.15 that looked like they
> >would impact the bug behavior. 
> >
> >Again, I'm not super familiar with the CephFS code but to hazard a
> >guess, but I think that the web download workload triggers things
> >faster suggests that unaligned writes might make things worse. But
> >again, I'm not 100% sure. I can't find a reproducer as fast as
> >downloading a dataset. Rsync of lots and lots of tiny files is a tad
> >faster than the dd case.
> >
> >I did see some changes in ceph_check_page_before_write where the
> >previous code unlocked pages and then continued where as the changed
> >folio code just returns ENODATA and doesn't unlock anything with most
> >of the rest of the logic unchanged. This might be perfectly fine, but
> >in my, admittedly limited, reading of the code I couldn't figure out
> >where anything that was locked prior to this being called would get
> >unlocked like it did prior to the change. Again, I could be miles off
> >here and one of the bulk reclaim/unlock passes that was added might
> >be cleaning this up correctly or some other functional change might
> >take care of this, but it looks to be potentially in the code path
> >I'm excising and it has had some unlock logic changed. 
> >
> >I've spent most of my time trying to find a solid quick reproducer.
> >Not that it takes long to start leaking folios, but I wanted
> >something that aggressively triggered it so a small vm would oom
> >quickly and when combined with crash_on_oom it could potentially be
> >used for regression testing by way of "did vm crash?".
> >
> >I'm not sure if it will super help, but I'll provide what details I
> >can about the actual workload that really sets it off. It's a python
> >based tool for downloading datasets. Datasets are split into N
> >chunks and the tool downloads them in parallel 100 at a time until
> >all N chunks are down. The compressed dataset is then unpacked and
> >reassembled for use with workloads. 
> >
> >This is replicating a common home folder usecase in HPC. CephFS is
> >very attractive for home folders due to it's "NFS-like" utility and
> >performance. And many tools use a similar method for fetching large
> >datasets. Tools are frequently written in python or go. 
> >
> >None of my customers have hit this yet, not have any enterprise
> >customers as none have moved to a new enough kernel yet due to slow
> >upgrade cycles. Even Proxmox have only just started testing on a
> >kernel version > 6.14. 
> >
> >I'm more than happy to help however I can with testing. I can run
> >instrumented kernels or test patches or whatever you need. I am
> >sorry I haven't been able to produce a super clean, fast reproducer
> >(my test cluster at home is all spinners and only 500TB usable). But
> >I figured I needed to get the word out asap as distros and soon
> >customers are going to be moving past 6.12-6.14 kernels as the 5-7
> >year update cycle marches on. Especially those wanting to take full
> >advantage of CacheFS and encryption functionality. 
> >
> >Again thanks for looking at this and do reach out if I can help in
> >anyway. I am in the ceph slack if it's faster to reach out that way.
> >
> >Regards
> >
> >Mal Haak
RE: RRe: Possible memory leak in 6.17.7
Posted by Viacheslav Dubeyko 6 hours ago
On Tue, 2025-12-16 at 11:26 +1000, Mal Haak wrote:
> On Mon, 15 Dec 2025 19:42:56 +0000
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Hi Mal,
> > 
> <SNIP> 
> > 
> > Thanks a lot for reporting the issue. Finally, I can see the
> > discussion in email list. :) Are you working on the patch with the
> > fix? Should we wait for the fix or I need to start the issue
> > reproduction and investigation? I am simply trying to avoid patches
> > collision and, also, I have multiple other issues for the fix in
> > CephFS kernel client. :)
> > 
> > Thanks,
> > Slava.
> 
> Hello,
> 
> Unfortunately creating a patch is just outside my comfort zone, I've
> lived too long in Lustre land.
> 
> I've have been trying to narrow down a consistent reproducer that's as
> fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
> haven't got it quite as fast. I think the dd workload is too well
> behaved. 
> 
> I can confirm the issue appeared in the major patch set that was
> applied as part of the 6.15 kernel. So during the more complete pages
> to folios switch and that nothing has changed in the bug behaviour since
> then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
> and didn't see any changes post 6.15 that looked like they would impact
> the bug behavior. 
> 
> Again, I'm not super familiar with the CephFS code but to hazard a
> guess, but I think that the web download workload triggers things faster
> suggests that unaligned writes might make things worse. But again, I'm
> not 100% sure. I can't find a reproducer as fast as downloading a
> dataset. Rsync of lots and lots of tiny files is a tad faster than the
> dd case.
> 
> I did see some changes in ceph_check_page_before_write where the
> previous code unlocked pages and then continued where as the changed
> folio code just returns ENODATA and doesn't unlock anything with most
> of the rest of the logic unchanged. This might be perfectly fine, but
> in my, admittedly limited, reading of the code I couldn't figure out
> where anything that was locked prior to this being called would get
> unlocked like it did prior to the change. Again, I could be miles off
> here and one of the bulk reclaim/unlock passes that was added might be
> cleaning this up correctly or some other functional change might take
> care of this, but it looks to be potentially in the code path I'm
> excising and it has had some unlock logic changed. 
> 
> I've spent most of my time trying to find a solid quick reproducer. Not
> that it takes long to start leaking folios, but I wanted something that
> aggressively triggered it so a small vm would oom quickly and when
> combined with crash_on_oom it could potentially be used for regression
> testing by way of "did vm crash?".
> 
> I'm not sure if it will super help, but I'll provide what details I can
> about the actual workload that really sets it off. It's a python based
> tool for downloading datasets. Datasets are split into N chunks and the
> tool downloads them in parallel 100 at a time until all N chunks are
> down. The compressed dataset is then unpacked and reassembled for
> use with workloads. 
> 
> This is replicating a common home folder usecase in HPC. CephFS is very
> attractive for home folders due to it's "NFS-like" utility and
> performance. And many tools use a similar method for fetching large
> datasets. Tools are frequently written in python or go. 
> 
> None of my customers have hit this yet, not have any enterprise
> customers as none have moved to a new enough kernel yet due to slow
> upgrade cycles. Even Proxmox have only just started testing on a kernel
> version > 6.14. 
> 
> I'm more than happy to help however I can with testing. I can run
> instrumented kernels or test patches or whatever you need. I am sorry I
> haven't been able to produce a super clean, fast reproducer (my test
> cluster at home is all spinners and only 500TB usable). But I figured I
> needed to get the word out asap as distros and soon customers are going
> to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
> marches on. Especially those wanting to take full advantage of CacheFS
> and encryption functionality. 
> 
> Again thanks for looking at this and do reach out if I can help in
> anyway. I am in the ceph slack if it's faster to reach out that way.
> 
> 

Thanks a lot for of your efforts. I hope it will help a lot. Let me start to
reproduce the issue. I'll let you know if I need additional details. I'll share
my progress and potential troubles in the ticket that you've created.

Thanks,
Slava.