RRe: Possible memory leak in 6.17.7

David Wang posted 1 patch 1 month, 4 weeks ago
RRe: Possible memory leak in 6.17.7
Posted by David Wang 1 month, 4 weeks ago


At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
>"David Wang" <00107082@163.com> wrote:
>
>> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
>> >On Mon,  8 Dec 2025 19:08:29 +0800
>> >David Wang <00107082@163.com> wrote:
>> >  
>> >> On Mon, 10 Nov 2025 18:20:08 +1000
>> >> Mal Haak <malcolm@haak.id.au> wrote:  
>> >> > Hello,
>> >> > 
>> >> > I have found a memory leak in 6.17.7 but I am unsure how to
>> >> > track it down effectively.
>> >> > 
>> >> >     
>> >> 
>> >> I think the `memory allocation profiling` feature can help.
>> >> https://docs.kernel.org/mm/allocation-profiling.html
>> >> 
>> >> You would need to build a kernel with 
>> >> CONFIG_MEM_ALLOC_PROFILING=y
>> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> >> 
>> >> And check /proc/allocinfo for the suspicious allocations which take
>> >> more memory than expected.
>> >> 
>> >> (I once caught a nvidia driver memory leak.)
>> >> 
>> >> 
>> >> FYI
>> >> David
>> >>   
>> >
>> >Thank you for your suggestion. I have some results.
>> >
>> >Ran the rsync workload for about 9 hours. It started to look like it
>> >was happening.
>> ># smem -pw
>> >Area                           Used      Cache   Noncache 
>> >firmware/hardware             0.00%      0.00%      0.00% 
>> >kernel image                  0.00%      0.00%      0.00% 
>> >kernel dynamic memory        80.46%     65.80%     14.66% 
>> >userspace memory              0.35%      0.16%      0.19% 
>> >free memory                  19.19%     19.19%      0.00% 
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
>> >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
>> >func:xfs_buf_alloc_backing_mem 
>> >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
>> > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
>> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
>> > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
>> > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
>> > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
>> >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
>> >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
>> >
>> >
>> >So I stopped the workload and dropped caches to confirm.
>> >
>> ># echo 3 > /proc/sys/vm/drop_caches
>> ># smem -pw
>> >Area                           Used      Cache   Noncache 
>> >firmware/hardware             0.00%      0.00%      0.00% 
>> >kernel image                  0.00%      0.00%      0.00% 
>> >kernel dynamic memory        33.45%      0.09%     33.36% 
>> >userspace memory              0.36%      0.16%      0.19% 
>> >free memory                  66.20%     66.20%      0.00% 
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
>> >         12M        3 kernel/dma/pool.c:96 func:atomic_pool_expand 
>> >         13M      751 mm/slub.c:3061 func:alloc_slab_page 
>> >         16M        8 mm/khugepaged.c:1069 func:alloc_charge_folio 
>> >         18M     4355 mm/memory.c:1190 func:folio_prealloc 
>> >         24M     6119 mm/memory.c:1192 func:folio_prealloc 
>> >         58M    14784 mm/page_ext.c:271 func:alloc_page_ext 
>> >         61M    15448 mm/readahead.c:189 func:ractl_alloc_folio 
>> >         79M     6726 mm/slub.c:3059 func:alloc_slab_page 
>> >         11G  2674488 mm/filemap.c:2012 func:__filemap_get_folio

Maybe narrowing down the "Noncache" caller of __filemap_get_folio would help clarify things.
(It could be designed that way, and  needs other route than dropping-cache to release the memory, just guess....)
If you want, you can modify code to split the accounting for __filemap_get_folio according to its callers.

Following is a draft patch: (based on v6.18)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 09b581c1d878..ba8c659a6ae3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
 }
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+
+#define __filemap_get_folio(...)			\
+	alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
+
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 024b71da5224..e1c1c26d7cb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
@@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			err = -ENOMEM;
 			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
-			folio = filemap_alloc_folio(alloc_gfp, order);
+			folio = filemap_alloc_folio_noprof(alloc_gfp, order);
 			if (!folio)
 				continue;
 
@@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_clear_dropbehind(folio);
 	return folio;
 }
-EXPORT_SYMBOL(__filemap_get_folio);
+EXPORT_SYMBOL(__filemap_get_folio_noprof);
 
 static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 		xa_mark_t mark)




FYI
David

>> >
>> >So if I'm reading this correctly something is causing folios collect
>> >and not be able to be freed?  
>> 
>> CC cephfs, maybe someone could have an easy reading out of those
>> folio usage
>> 
>> 
>> >
>> >Also it's clear that some of the folio's are counting as cache and
>> >some aren't. 
>> >
>> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm now
>> >going to manually walk through previous kernel releases and find
>> >where it first starts happening purely because I'm having issues
>> >building earlier kernels due to rust stuff and other python
>> >incompatibilities making doing a git-bisect a bit fun.
>> >
>> >I'll do it the packages way until I get closer, then solve the build
>> >issues. 
>> >
>> >Thanks,
>> >Mal
>> >  
>Thanks David.
>
>I've contacted the ceph developers as well. 
>
>There was a suggestion it was due to the change from, to quote:
>folio.free() to folio.put() or something like this.
>
>The change happened around 6.14/6.15
>
>I've found an easier reproducer. 
>
>There has been a suggestion that perhaps the ceph team might not fix
>this as "you can just reboot before the machine becomes unstable" and
>"Nobody else has encountered this bug"
>
>I'll leave that to other people to make a call on but I'd assume the
>lack of reports is due to the fact that most stable distros are still
>on a a far too early kernel and/or are using the fuse driver with k8s.
>
>Anyway, thanks for your assistance.
Re: RRe: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 4 weeks ago
On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
"David Wang" <00107082@163.com> wrote:

> At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> >"David Wang" <00107082@163.com> wrote:
> >  
> >> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:  
> >> >On Mon,  8 Dec 2025 19:08:29 +0800
> >> >David Wang <00107082@163.com> wrote:
> >> >    
> >> >> On Mon, 10 Nov 2025 18:20:08 +1000
> >> >> Mal Haak <malcolm@haak.id.au> wrote:    
> >> >> > Hello,
> >> >> > 
> >> >> > I have found a memory leak in 6.17.7 but I am unsure how to
> >> >> > track it down effectively.
> >> >> > 
> >> >> >       
> >> >> 
> >> >> I think the `memory allocation profiling` feature can help.
> >> >> https://docs.kernel.org/mm/allocation-profiling.html
> >> >> 
> >> >> You would need to build a kernel with 
> >> >> CONFIG_MEM_ALLOC_PROFILING=y
> >> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >> >> 
> >> >> And check /proc/allocinfo for the suspicious allocations which
> >> >> take more memory than expected.
> >> >> 
> >> >> (I once caught a nvidia driver memory leak.)
> >> >> 
> >> >> 
> >> >> FYI
> >> >> David
> >> >>     
> >> >
> >> >Thank you for your suggestion. I have some results.
> >> >
> >> >Ran the rsync workload for about 9 hours. It started to look like
> >> >it was happening.
> >> ># smem -pw
> >> >Area                           Used      Cache   Noncache 
> >> >firmware/hardware             0.00%      0.00%      0.00% 
> >> >kernel image                  0.00%      0.00%      0.00% 
> >> >kernel dynamic memory        80.46%     65.80%     14.66% 
> >> >userspace memory              0.35%      0.16%      0.19% 
> >> >free memory                  19.19%     19.19%      0.00% 
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
> >> >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
> >> >func:xfs_buf_alloc_backing_mem 
> >> >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
> >> > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
> >> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
> >> > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
> >> >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
> >> >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
> >> >
> >> >
> >> >So I stopped the workload and dropped caches to confirm.
> >> >
> >> ># echo 3 > /proc/sys/vm/drop_caches
> >> ># smem -pw
> >> >Area                           Used      Cache   Noncache 
> >> >firmware/hardware             0.00%      0.00%      0.00% 
> >> >kernel image                  0.00%      0.00%      0.00% 
> >> >kernel dynamic memory        33.45%      0.09%     33.36% 
> >> >userspace memory              0.36%      0.16%      0.19% 
> >> >free memory                  66.20%     66.20%      0.00% 
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
> >> >         12M        3 kernel/dma/pool.c:96
> >> > func:atomic_pool_expand 13M      751 mm/slub.c:3061
> >> > func:alloc_slab_page 16M        8 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 18M     4355 mm/memory.c:1190
> >> > func:folio_prealloc 24M     6119 mm/memory.c:1192
> >> > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 61M    15448 mm/readahead.c:189
> >> > func:ractl_alloc_folio 79M     6726 mm/slub.c:3059
> >> > func:alloc_slab_page 11G  2674488 mm/filemap.c:2012
> >> > func:__filemap_get_folio  
> 
> Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> would help clarify things. (It could be designed that way, and  needs
> other route than dropping-cache to release the memory, just
> guess....) If you want, you can modify code to split the accounting
> for __filemap_get_folio according to its callers.


Thanks again, I'll add this patch in and see where I end up. 

The issue is nothing will cause the memory to be freed. Dropping caches
doesn't work, memory pressure doesn't work, unmounting the filesystems
doesn't work. Removing the cephfs and netfs kernel modules also doesn't
work. 

This is why I feel it's a ref_count (or similar) issue. 

I've also found it seems to be a fixed amount leaked each time, per
file. Simply doing lots of IO on one large file doesn't leak as fast as
lots of "small" (greater than 10MB less than 100MB seems to be a sweet
spot) 

Also, dropping caches while the workload is running actually amplifies
the issue. So it very much feels like something is wrong in the reclaim
code.

Anyway I'll get this patch applied and see where I end up. 

I now have crash dumps (after enabling crash_on_oom) so I'm going to
try and see if I can find these structures and see what state they are
in

Thanks again. 

Mal


> Following is a draft patch: (based on v6.18)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878..ba8c659a6ae3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
>  }
>  
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t
> index); -struct folio *__filemap_get_folio(struct address_space
> *mapping, pgoff_t index, +
> +#define __filemap_get_folio(...)			\
> +	alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
> +
> +struct folio *__filemap_get_folio_noprof(struct address_space
> *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
>  struct page *pagecache_get_page(struct address_space *mapping,
> pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 024b71da5224..e1c1c26d7cb3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space
> *mapping, pgoff_t index) *
>   * Return: The found folio or an ERR_PTR() otherwise.
>   */
> -struct folio *__filemap_get_folio(struct address_space *mapping,
> pgoff_t index, +struct folio *__filemap_get_folio_noprof(struct
> address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
> @@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, err = -ENOMEM;
>  			if (order > min_order)
>  				alloc_gfp |= __GFP_NORETRY |
> __GFP_NOWARN;
> -			folio = filemap_alloc_folio(alloc_gfp,
> order);
> +			folio =
> filemap_alloc_folio_noprof(alloc_gfp, order); if (!folio)
>  				continue;
>  
> @@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, folio_clear_dropbehind(folio);
>  	return folio;
>  }
> -EXPORT_SYMBOL(__filemap_get_folio);
> +EXPORT_SYMBOL(__filemap_get_folio_noprof);
>  
>  static inline struct folio *find_get_entry(struct xa_state *xas,
> pgoff_t max, xa_mark_t mark)
> 
> 
> 
> 
> FYI
> David
> 
> >> >
> >> >So if I'm reading this correctly something is causing folios
> >> >collect and not be able to be freed?    
> >> 
> >> CC cephfs, maybe someone could have an easy reading out of those
> >> folio usage
> >> 
> >>   
> >> >
> >> >Also it's clear that some of the folio's are counting as cache and
> >> >some aren't. 
> >> >
> >> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm
> >> >now going to manually walk through previous kernel releases and
> >> >find where it first starts happening purely because I'm having
> >> >issues building earlier kernels due to rust stuff and other python
> >> >incompatibilities making doing a git-bisect a bit fun.
> >> >
> >> >I'll do it the packages way until I get closer, then solve the
> >> >build issues. 
> >> >
> >> >Thanks,
> >> >Mal
> >> >    
> >Thanks David.
> >
> >I've contacted the ceph developers as well. 
> >
> >There was a suggestion it was due to the change from, to quote:
> >folio.free() to folio.put() or something like this.
> >
> >The change happened around 6.14/6.15
> >
> >I've found an easier reproducer. 
> >
> >There has been a suggestion that perhaps the ceph team might not fix
> >this as "you can just reboot before the machine becomes unstable" and
> >"Nobody else has encountered this bug"
> >
> >I'll leave that to other people to make a call on but I'd assume the
> >lack of reports is due to the fact that most stable distros are still
> >on a a far too early kernel and/or are using the fuse driver with
> >k8s.
> >
> >Anyway, thanks for your assistance.
RE: RRe: Possible memory leak in 6.17.7
Posted by Viacheslav Dubeyko 1 month, 3 weeks ago
Hi Mal,

On Thu, 2025-12-11 at 14:23 +1000, Mal Haak wrote:
> On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
> "David Wang" <00107082@163.com> wrote:
> 
> > At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> > > On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> > > "David Wang" <00107082@163.com> wrote:
> > >  
> > > > At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:  
> > > > > On Mon,  8 Dec 2025 19:08:29 +0800
> > > > > David Wang <00107082@163.com> wrote:
> > > > >    
> > > > > > On Mon, 10 Nov 2025 18:20:08 +1000
> > > > > > Mal Haak <malcolm@haak.id.au> wrote:    
> > > > > > > Hello,
> > > > > > > 
> > > > > > > I have found a memory leak in 6.17.7 but I am unsure how to
> > > > > > > track it down effectively.
> > > > > > > 
> > > > > > >       
> > > > > > 
> > > > > > I think the `memory allocation profiling` feature can help.
> > > > > > https://docs.kernel.org/mm/allocation-profiling.html  
> > > > > > 
> > > > > > You would need to build a kernel with 
> > > > > > CONFIG_MEM_ALLOC_PROFILING=y
> > > > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> > > > > > 
> > > > > > And check /proc/allocinfo for the suspicious allocations which
> > > > > > take more memory than expected.
> > > > > > 
> > > > > > (I once caught a nvidia driver memory leak.)
> > > > > > 
> > > > > > 
> > > > > > FYI
> > > > > > David
> > > > > >     
> > > > > 
> > > > > Thank you for your suggestion. I have some results.
> > > > > 
> > > > > Ran the rsync workload for about 9 hours. It started to look like
> > > > > it was happening.
> > > > > # smem -pw
> > > > > Area                           Used      Cache   Noncache 
> > > > > firmware/hardware             0.00%      0.00%      0.00% 
> > > > > kernel image                  0.00%      0.00%      0.00% 
> > > > > kernel dynamic memory        80.46%     65.80%     14.66% 
> > > > > userspace memory              0.35%      0.16%      0.19% 
> > > > > free memory                  19.19%     19.19%      0.00% 
> > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec
> > > > >         22M     5609 mm/memory.c:1190 func:folio_prealloc 
> > > > >         23M     1932 fs/xfs/xfs_buf.c:226 [xfs]
> > > > > func:xfs_buf_alloc_backing_mem 
> > > > >         24M    24135 fs/xfs/xfs_icache.c:97 [xfs]
> > > > > func:xfs_inode_alloc 27M     6693 mm/memory.c:1192
> > > > > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> > > > > func:alloc_page_ext 258M      129 mm/khugepaged.c:1069
> > > > > func:alloc_charge_folio 430M   770788 lib/xarray.c:378
> > > > > func:xas_alloc 545M    36444 mm/slub.c:3059 func:alloc_slab_page 
> > > > >        9.8G  2563617 mm/readahead.c:189 func:ractl_alloc_folio 
> > > > >         20G  5164004 mm/filemap.c:2012 func:__filemap_get_folio 
> > > > > 
> > > > > 
> > > > > So I stopped the workload and dropped caches to confirm.
> > > > > 
> > > > > # echo 3 > /proc/sys/vm/drop_caches
> > > > > # smem -pw
> > > > > Area                           Used      Cache   Noncache 
> > > > > firmware/hardware             0.00%      0.00%      0.00% 
> > > > > kernel image                  0.00%      0.00%      0.00% 
> > > > > kernel dynamic memory        33.45%      0.09%     33.36% 
> > > > > userspace memory              0.36%      0.16%      0.19% 
> > > > > free memory                  66.20%     66.20%      0.00% 
> > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec
> > > > >         12M     2987 mm/execmem.c:41 func:execmem_vmalloc 
> > > > >         12M        3 kernel/dma/pool.c:96
> > > > > func:atomic_pool_expand 13M      751 mm/slub.c:3061
> > > > > func:alloc_slab_page 16M        8 mm/khugepaged.c:1069
> > > > > func:alloc_charge_folio 18M     4355 mm/memory.c:1190
> > > > > func:folio_prealloc 24M     6119 mm/memory.c:1192
> > > > > func:folio_prealloc 58M    14784 mm/page_ext.c:271
> > > > > func:alloc_page_ext 61M    15448 mm/readahead.c:189
> > > > > func:ractl_alloc_folio 79M     6726 mm/slub.c:3059
> > > > > func:alloc_slab_page 11G  2674488 mm/filemap.c:2012
> > > > > func:__filemap_get_folio  
> > 
> > Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> > would help clarify things. (It could be designed that way, and  needs
> > other route than dropping-cache to release the memory, just
> > guess....) If you want, you can modify code to split the accounting
> > for __filemap_get_folio according to its callers.
> 
> 
> Thanks again, I'll add this patch in and see where I end up. 
> 
> The issue is nothing will cause the memory to be freed. Dropping caches
> doesn't work, memory pressure doesn't work, unmounting the filesystems
> doesn't work. Removing the cephfs and netfs kernel modules also doesn't
> work. 
> 
> This is why I feel it's a ref_count (or similar) issue. 
> 
> I've also found it seems to be a fixed amount leaked each time, per
> file. Simply doing lots of IO on one large file doesn't leak as fast as
> lots of "small" (greater than 10MB less than 100MB seems to be a sweet
> spot) 
> 
> Also, dropping caches while the workload is running actually amplifies
> the issue. So it very much feels like something is wrong in the reclaim
> code.
> 
> Anyway I'll get this patch applied and see where I end up. 
> 
> I now have crash dumps (after enabling crash_on_oom) so I'm going to
> try and see if I can find these structures and see what state they are
> in
> 
> 

Thanks a lot for reporting the issue. Finally, I can see the discussion in email
list. :) Are you working on the patch with the fix? Should we wait for the fix
or I need to start the issue reproduction and investigation? I am simply trying
to avoid patches collision and, also, I have multiple other issues for the fix
in CephFS kernel client. :)

Thanks,
Slava.
Re: RRe: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 3 weeks ago
On Mon, 15 Dec 2025 19:42:56 +0000
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> Hi Mal,
> 
<SNIP> 
> 
> Thanks a lot for reporting the issue. Finally, I can see the
> discussion in email list. :) Are you working on the patch with the
> fix? Should we wait for the fix or I need to start the issue
> reproduction and investigation? I am simply trying to avoid patches
> collision and, also, I have multiple other issues for the fix in
> CephFS kernel client. :)
> 
> Thanks,
> Slava.

Hello,

Unfortunately creating a patch is just outside my comfort zone, I've
lived too long in Lustre land.

I've have been trying to narrow down a consistent reproducer that's as
fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
haven't got it quite as fast. I think the dd workload is too well
behaved. 

I can confirm the issue appeared in the major patch set that was
applied as part of the 6.15 kernel. So during the more complete pages
to folios switch and that nothing has changed in the bug behaviour since
then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
and didn't see any changes post 6.15 that looked like they would impact
the bug behavior. 

Again, I'm not super familiar with the CephFS code but to hazard a
guess, but I think that the web download workload triggers things faster
suggests that unaligned writes might make things worse. But again, I'm
not 100% sure. I can't find a reproducer as fast as downloading a
dataset. Rsync of lots and lots of tiny files is a tad faster than the
dd case.

I did see some changes in ceph_check_page_before_write where the
previous code unlocked pages and then continued where as the changed
folio code just returns ENODATA and doesn't unlock anything with most
of the rest of the logic unchanged. This might be perfectly fine, but
in my, admittedly limited, reading of the code I couldn't figure out
where anything that was locked prior to this being called would get
unlocked like it did prior to the change. Again, I could be miles off
here and one of the bulk reclaim/unlock passes that was added might be
cleaning this up correctly or some other functional change might take
care of this, but it looks to be potentially in the code path I'm
excising and it has had some unlock logic changed. 

I've spent most of my time trying to find a solid quick reproducer. Not
that it takes long to start leaking folios, but I wanted something that
aggressively triggered it so a small vm would oom quickly and when
combined with crash_on_oom it could potentially be used for regression
testing by way of "did vm crash?".

I'm not sure if it will super help, but I'll provide what details I can
about the actual workload that really sets it off. It's a python based
tool for downloading datasets. Datasets are split into N chunks and the
tool downloads them in parallel 100 at a time until all N chunks are
down. The compressed dataset is then unpacked and reassembled for
use with workloads. 

This is replicating a common home folder usecase in HPC. CephFS is very
attractive for home folders due to it's "NFS-like" utility and
performance. And many tools use a similar method for fetching large
datasets. Tools are frequently written in python or go. 

None of my customers have hit this yet, not have any enterprise
customers as none have moved to a new enough kernel yet due to slow
upgrade cycles. Even Proxmox have only just started testing on a kernel
version > 6.14. 

I'm more than happy to help however I can with testing. I can run
instrumented kernels or test patches or whatever you need. I am sorry I
haven't been able to produce a super clean, fast reproducer (my test
cluster at home is all spinners and only 500TB usable). But I figured I
needed to get the word out asap as distros and soon customers are going
to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
marches on. Especially those wanting to take full advantage of CacheFS
and encryption functionality. 

Again thanks for looking at this and do reach out if I can help in
anyway. I am in the ceph slack if it's faster to reach out that way.

Regards

Mal Haak
Re: Possible memory leak in 6.17.7
Posted by David Wang 1 month, 3 weeks ago
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Mon, 15 Dec 2025 19:42:56 +0000
>Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
>> Hi Mal,
>> 
><SNIP> 
>> 
>> Thanks a lot for reporting the issue. Finally, I can see the
>> discussion in email list. :) Are you working on the patch with the
>> fix? Should we wait for the fix or I need to start the issue
>> reproduction and investigation? I am simply trying to avoid patches
>> collision and, also, I have multiple other issues for the fix in
>> CephFS kernel client. :)
>> 
>> Thanks,
>> Slava.
>
>Hello,
>
>Unfortunately creating a patch is just outside my comfort zone, I've
>lived too long in Lustre land.
>
>I've have been trying to narrow down a consistent reproducer that's as
>fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
>haven't got it quite as fast. I think the dd workload is too well
>behaved. 
>
>I can confirm the issue appeared in the major patch set that was
>applied as part of the 6.15 kernel. So during the more complete pages
>to folios switch and that nothing has changed in the bug behaviour since
>then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
>and didn't see any changes post 6.15 that looked like they would impact
>the bug behavior. 

Hi,
Just a suggestion, in case you run out of idea for further investigation.  
I think you can bisect *manually*  targeting changes  of fs/cephfs between 6.14 and 6.15


$ git log  --pretty='format:%h %an' v6.14..v6.15 fs/ceph
349b7d77f5a1 Linus Torvalds
b261d2222063 Eric Biggers
f452a2204614 David Howells
e63046adefc0 Linus Torvalds
59b59a943177 Matthew Wilcox (Oracle)  <-----------3
efbdd92ed9f6 Matthew Wilcox (Oracle)
d1b452673af4 Matthew Wilcox (Oracle)
ad49fe2b3d54 Matthew Wilcox (Oracle)
a55cf4fd8fae Matthew Wilcox (Oracle)
15fdaf2fd60d Matthew Wilcox (Oracle)
62171c16da60 Matthew Wilcox (Oracle)
baff9740bc8f Matthew Wilcox (Oracle)
f9707a8b5b9d Matthew Wilcox (Oracle)
88a59bda3f37 Matthew Wilcox (Oracle)
19a288110435 Matthew Wilcox (Oracle)
fd7449d937e7 Viacheslav Dubeyko  <---------2
1551ec61dc55 Viacheslav Dubeyko
ce80b76dd327 Viacheslav Dubeyko
f08068df4aa4 Viacheslav Dubeyko
3f92c7b57687 NeilBrown               <-----------1
88d5baf69082 NeilBrown

There were 3 major patch set (group by author),  the suspect could be narrowed down further.


(Bisect, even over a short range of patch, is quite an unpleasant experience though....)

FYI
David

>
>Again, I'm not super familiar with the CephFS code but to hazard a
>guess, but I think that the web download workload triggers things faster
>suggests that unaligned writes might make things worse. But again, I'm
>not 100% sure. I can't find a reproducer as fast as downloading a
>dataset. Rsync of lots and lots of tiny files is a tad faster than the
>dd case.
>
>I did see some changes in ceph_check_page_before_write where the
>previous code unlocked pages and then continued where as the changed
>folio code just returns ENODATA and doesn't unlock anything with most
>of the rest of the logic unchanged. This might be perfectly fine, but
>in my, admittedly limited, reading of the code I couldn't figure out
>where anything that was locked prior to this being called would get
>unlocked like it did prior to the change. Again, I could be miles off
>here and one of the bulk reclaim/unlock passes that was added might be
>cleaning this up correctly or some other functional change might take
>care of this, but it looks to be potentially in the code path I'm
>excising and it has had some unlock logic changed. 
>
>I've spent most of my time trying to find a solid quick reproducer. Not
>that it takes long to start leaking folios, but I wanted something that
>aggressively triggered it so a small vm would oom quickly and when
>combined with crash_on_oom it could potentially be used for regression
>testing by way of "did vm crash?".
>
>I'm not sure if it will super help, but I'll provide what details I can
>about the actual workload that really sets it off. It's a python based
>tool for downloading datasets. Datasets are split into N chunks and the
>tool downloads them in parallel 100 at a time until all N chunks are
>down. The compressed dataset is then unpacked and reassembled for
>use with workloads. 
>
>This is replicating a common home folder usecase in HPC. CephFS is very
>attractive for home folders due to it's "NFS-like" utility and
>performance. And many tools use a similar method for fetching large
>datasets. Tools are frequently written in python or go. 
>
>None of my customers have hit this yet, not have any enterprise
>customers as none have moved to a new enough kernel yet due to slow
>upgrade cycles. Even Proxmox have only just started testing on a kernel
>version > 6.14. 
>
>I'm more than happy to help however I can with testing. I can run
>instrumented kernels or test patches or whatever you need. I am sorry I
>haven't been able to produce a super clean, fast reproducer (my test
>cluster at home is all spinners and only 500TB usable). But I figured I
>needed to get the word out asap as distros and soon customers are going
>to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
>marches on. Especially those wanting to take full advantage of CacheFS
>and encryption functionality. 
>
>Again thanks for looking at this and do reach out if I can help in
>anyway. I am in the ceph slack if it's faster to reach out that way.
>
>Regards
>
>Mal Haak
Re: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 3 weeks ago
On Wed, 17 Dec 2025 13:59:47 +0800 (CST)
"David Wang" <00107082@163.com> wrote:

> At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Mon, 15 Dec 2025 19:42:56 +0000
> >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> >  
> >> Hi Mal,
> >>   
> ><SNIP>   
> >> 
> >> Thanks a lot for reporting the issue. Finally, I can see the
> >> discussion in email list. :) Are you working on the patch with the
> >> fix? Should we wait for the fix or I need to start the issue
> >> reproduction and investigation? I am simply trying to avoid patches
> >> collision and, also, I have multiple other issues for the fix in
> >> CephFS kernel client. :)
> >> 
> >> Thanks,
> >> Slava.  
> >
> >Hello,
> >
> >Unfortunately creating a patch is just outside my comfort zone, I've
> >lived too long in Lustre land.
> >
> >I've have been trying to narrow down a consistent reproducer that's
> >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> >And I haven't got it quite as fast. I think the dd workload is too
> >well behaved. 
> >
> >I can confirm the issue appeared in the major patch set that was
> >applied as part of the 6.15 kernel. So during the more complete pages
> >to folios switch and that nothing has changed in the bug behaviour
> >since then. I did have a look at all the diffs from 6.14 to 6.18 on
> >addr.c and didn't see any changes post 6.15 that looked like they
> >would impact the bug behavior.   
> 
> Hi,
> Just a suggestion, in case you run out of idea for further
> investigation. I think you can bisect *manually*  targeting changes
> of fs/cephfs between 6.14 and 6.15
> 
> 
> $ git log  --pretty='format:%h %an' v6.14..v6.15 fs/ceph
> 349b7d77f5a1 Linus Torvalds
> b261d2222063 Eric Biggers
> f452a2204614 David Howells
> e63046adefc0 Linus Torvalds
> 59b59a943177 Matthew Wilcox (Oracle)  <-----------3
> efbdd92ed9f6 Matthew Wilcox (Oracle)
> d1b452673af4 Matthew Wilcox (Oracle)
> ad49fe2b3d54 Matthew Wilcox (Oracle)
> a55cf4fd8fae Matthew Wilcox (Oracle)
> 15fdaf2fd60d Matthew Wilcox (Oracle)
> 62171c16da60 Matthew Wilcox (Oracle)
> baff9740bc8f Matthew Wilcox (Oracle)
> f9707a8b5b9d Matthew Wilcox (Oracle)
> 88a59bda3f37 Matthew Wilcox (Oracle)
> 19a288110435 Matthew Wilcox (Oracle)
> fd7449d937e7 Viacheslav Dubeyko  <---------2
> 1551ec61dc55 Viacheslav Dubeyko
> ce80b76dd327 Viacheslav Dubeyko
> f08068df4aa4 Viacheslav Dubeyko
> 3f92c7b57687 NeilBrown               <-----------1
> 88d5baf69082 NeilBrown
> 
> There were 3 major patch set (group by author),  the suspect could be
> narrowed down further.
> 
> 
> (Bisect, even over a short range of patch, is quite an unpleasant
> experience though....)
> 
> FYI
> David
> 
> >
<SNIP>

Yeah, I don't think it's a small patch that is the cause of the issue. 

It looks like there was a patch set that migrated cephfs off handling
individual pages and onto folios to enable wider use of netfs features
like local caching and encryption, as examples. I'm not sure that set
can be broken up and result in a working cephfs module. Which limits
the utility of a git-bisect. I'm pretty sure the issue is in addr.c
somewhere and most of the changes in there are one patch. That said,
after I get this crash dump I'll probably do it anyway. 

What I really need to do is get a crash dump to look at what state the
folios and their tracking is in. Assuming I can grok what I'm looking
at. This is the bit I'm most apprehensive of. I'm hoping I can find a
list of folios used by the reclaim machinery that is missing
a bunch of folios. That or a bunch with inflated refcounts or
something. 

Something is going awry, but it's not fast. I thought I had a quick
reproducer. I was wrong, I sized the DD workload incorrectly and
triggered the panic_on_oom due to that, not the bug. 

I'm re-running the reproducer now, on a VM with 2GB of ram and its been
running for around 3hrs and I think at most its leaked possibly
100MB-150MB of ram at most. (It was averaging 190-200MB of noncache
usage. It's now averaging 290-340MB). 

It does accelerate. The more folios that are in the weird state, the
more end up in the weird state. Which feels like fragmentation side
effects, but I'm just speculating.  

Anyway, one of the wonderful ceph developers is looking into it. I just
hope I can do enough to help them locate the issue. They are having
troubles reproducing last I heard from them but they might have been
expecting a slightly faster reproducer.

I have however recreated it on a physical host not just a vm. So I feel
like I can rule out being a VM as a cause.

Anyway thanks for your continued assistance!

Mal
Re: Possible memory leak in 6.17.7
Posted by David Wang 1 month, 3 weeks ago
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Mon, 15 Dec 2025 19:42:56 +0000
>Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>
>> Hi Mal,
>> 
><SNIP> 
>> 
>> Thanks a lot for reporting the issue. Finally, I can see the
>> discussion in email list. :) Are you working on the patch with the
>> fix? Should we wait for the fix or I need to start the issue
>> reproduction and investigation? I am simply trying to avoid patches
>> collision and, also, I have multiple other issues for the fix in
>> CephFS kernel client. :)
>> 
>> Thanks,
>> Slava.
>
>Hello,
>
>Unfortunately creating a patch is just outside my comfort zone, I've
>lived too long in Lustre land.

Hi, just out of curiosity, have you narrowed down the caller of __filemap_get_folio
causing the memory problem? Or do you have trouble applying the debug patch for
memory allocation profiling?

David 

>
>I've have been trying to narrow down a consistent reproducer that's as
>fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
>haven't got it quite as fast. I think the dd workload is too well
>behaved. 
>
>I can confirm the issue appeared in the major patch set that was
>applied as part of the 6.15 kernel. So during the more complete pages
>to folios switch and that nothing has changed in the bug behaviour since
>then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
>and didn't see any changes post 6.15 that looked like they would impact
>the bug behavior. 
>
>Again, I'm not super familiar with the CephFS code but to hazard a
>guess, but I think that the web download workload triggers things faster
>suggests that unaligned writes might make things worse. But again, I'm
>not 100% sure. I can't find a reproducer as fast as downloading a
>dataset. Rsync of lots and lots of tiny files is a tad faster than the
>dd case.
>
>I did see some changes in ceph_check_page_before_write where the
>previous code unlocked pages and then continued where as the changed
>folio code just returns ENODATA and doesn't unlock anything with most
>of the rest of the logic unchanged. This might be perfectly fine, but
>in my, admittedly limited, reading of the code I couldn't figure out
>where anything that was locked prior to this being called would get
>unlocked like it did prior to the change. Again, I could be miles off
>here and one of the bulk reclaim/unlock passes that was added might be
>cleaning this up correctly or some other functional change might take
>care of this, but it looks to be potentially in the code path I'm
>excising and it has had some unlock logic changed. 
>
>I've spent most of my time trying to find a solid quick reproducer. Not
>that it takes long to start leaking folios, but I wanted something that
>aggressively triggered it so a small vm would oom quickly and when
>combined with crash_on_oom it could potentially be used for regression
>testing by way of "did vm crash?".
>
>I'm not sure if it will super help, but I'll provide what details I can
>about the actual workload that really sets it off. It's a python based
>tool for downloading datasets. Datasets are split into N chunks and the
>tool downloads them in parallel 100 at a time until all N chunks are
>down. The compressed dataset is then unpacked and reassembled for
>use with workloads. 
>
>This is replicating a common home folder usecase in HPC. CephFS is very
>attractive for home folders due to it's "NFS-like" utility and
>performance. And many tools use a similar method for fetching large
>datasets. Tools are frequently written in python or go. 
>
>None of my customers have hit this yet, not have any enterprise
>customers as none have moved to a new enough kernel yet due to slow
>upgrade cycles. Even Proxmox have only just started testing on a kernel
>version > 6.14. 
>
>I'm more than happy to help however I can with testing. I can run
>instrumented kernels or test patches or whatever you need. I am sorry I
>haven't been able to produce a super clean, fast reproducer (my test
>cluster at home is all spinners and only 500TB usable). But I figured I
>needed to get the word out asap as distros and soon customers are going
>to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
>marches on. Especially those wanting to take full advantage of CacheFS
>and encryption functionality. 
>
>Again thanks for looking at this and do reach out if I can help in
>anyway. I am in the ceph slack if it's faster to reach out that way.
>
>Regards
>
>Mal Haak
Re: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 3 weeks ago
On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
"David Wang" <00107082@163.com> wrote:

> At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Mon, 15 Dec 2025 19:42:56 +0000
> >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> >  
> >> Hi Mal,
> >>   
> ><SNIP>   
> >> 
> >> Thanks a lot for reporting the issue. Finally, I can see the
> >> discussion in email list. :) Are you working on the patch with the
> >> fix? Should we wait for the fix or I need to start the issue
> >> reproduction and investigation? I am simply trying to avoid patches
> >> collision and, also, I have multiple other issues for the fix in
> >> CephFS kernel client. :)
> >> 
> >> Thanks,
> >> Slava.  
> >
> >Hello,
> >
> >Unfortunately creating a patch is just outside my comfort zone, I've
> >lived too long in Lustre land.  
> 
> Hi, just out of curiosity, have you narrowed down the caller of
> __filemap_get_folio causing the memory problem? Or do you have
> trouble applying the debug patch for memory allocation profiling?
> 
> David 
> 
Hi David,

I hadn't yet as I did test XFS and NFS to see if it replicated the
behaviour and it did not. 

But actually this could speed things up considerably. I will do that
now and see what I get.

Thanks

Mal

> >
> >I've have been trying to narrow down a consistent reproducer that's
> >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> >And I haven't got it quite as fast. I think the dd workload is too
> >well behaved. 
> >
> >I can confirm the issue appeared in the major patch set that was
> >applied as part of the 6.15 kernel. So during the more complete pages
> >to folios switch and that nothing has changed in the bug behaviour
> >since then. I did have a look at all the diffs from 6.14 to 6.18 on
> >addr.c and didn't see any changes post 6.15 that looked like they
> >would impact the bug behavior. 
> >
> >Again, I'm not super familiar with the CephFS code but to hazard a
> >guess, but I think that the web download workload triggers things
> >faster suggests that unaligned writes might make things worse. But
> >again, I'm not 100% sure. I can't find a reproducer as fast as
> >downloading a dataset. Rsync of lots and lots of tiny files is a tad
> >faster than the dd case.
> >
> >I did see some changes in ceph_check_page_before_write where the
> >previous code unlocked pages and then continued where as the changed
> >folio code just returns ENODATA and doesn't unlock anything with most
> >of the rest of the logic unchanged. This might be perfectly fine, but
> >in my, admittedly limited, reading of the code I couldn't figure out
> >where anything that was locked prior to this being called would get
> >unlocked like it did prior to the change. Again, I could be miles off
> >here and one of the bulk reclaim/unlock passes that was added might
> >be cleaning this up correctly or some other functional change might
> >take care of this, but it looks to be potentially in the code path
> >I'm excising and it has had some unlock logic changed. 
> >
> >I've spent most of my time trying to find a solid quick reproducer.
> >Not that it takes long to start leaking folios, but I wanted
> >something that aggressively triggered it so a small vm would oom
> >quickly and when combined with crash_on_oom it could potentially be
> >used for regression testing by way of "did vm crash?".
> >
> >I'm not sure if it will super help, but I'll provide what details I
> >can about the actual workload that really sets it off. It's a python
> >based tool for downloading datasets. Datasets are split into N
> >chunks and the tool downloads them in parallel 100 at a time until
> >all N chunks are down. The compressed dataset is then unpacked and
> >reassembled for use with workloads. 
> >
> >This is replicating a common home folder usecase in HPC. CephFS is
> >very attractive for home folders due to it's "NFS-like" utility and
> >performance. And many tools use a similar method for fetching large
> >datasets. Tools are frequently written in python or go. 
> >
> >None of my customers have hit this yet, not have any enterprise
> >customers as none have moved to a new enough kernel yet due to slow
> >upgrade cycles. Even Proxmox have only just started testing on a
> >kernel version > 6.14. 
> >
> >I'm more than happy to help however I can with testing. I can run
> >instrumented kernels or test patches or whatever you need. I am
> >sorry I haven't been able to produce a super clean, fast reproducer
> >(my test cluster at home is all spinners and only 500TB usable). But
> >I figured I needed to get the word out asap as distros and soon
> >customers are going to be moving past 6.12-6.14 kernels as the 5-7
> >year update cycle marches on. Especially those wanting to take full
> >advantage of CacheFS and encryption functionality. 
> >
> >Again thanks for looking at this and do reach out if I can help in
> >anyway. I am in the ceph slack if it's faster to reach out that way.
> >
> >Regards
> >
> >Mal Haak
Re: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 3 weeks ago
On Tue, 16 Dec 2025 17:09:18 +1000
Mal Haak <malcolm@haak.id.au> wrote:

> On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
> "David Wang" <00107082@163.com> wrote:
> 
> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:  
> > >On Mon, 15 Dec 2025 19:42:56 +0000
> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > >    
> > >> Hi Mal,
> > >>     
> > ><SNIP>     
> > >> 
> > >> Thanks a lot for reporting the issue. Finally, I can see the
> > >> discussion in email list. :) Are you working on the patch with
> > >> the fix? Should we wait for the fix or I need to start the issue
> > >> reproduction and investigation? I am simply trying to avoid
> > >> patches collision and, also, I have multiple other issues for
> > >> the fix in CephFS kernel client. :)
> > >> 
> > >> Thanks,
> > >> Slava.    
> > >
> > >Hello,
> > >
> > >Unfortunately creating a patch is just outside my comfort zone,
> > >I've lived too long in Lustre land.    
> > 
> > Hi, just out of curiosity, have you narrowed down the caller of
> > __filemap_get_folio causing the memory problem? Or do you have
> > trouble applying the debug patch for memory allocation profiling?
> > 
> > David 
> >   
> Hi David,
> 
> I hadn't yet as I did test XFS and NFS to see if it replicated the
> behaviour and it did not. 
> 
> But actually this could speed things up considerably. I will do that
> now and see what I get.
> 
> Thanks
> 
> Mal
> 
I did just give it a blast. 

Unfortunately it returned exactly what I expected, that is the calls
are all coming from netfs.

Which makes sense for cephfs. 

# sort -g /proc/allocinfo|tail|numfmt --to=iec
         10M     2541 drivers/block/zram/zram_drv.c:1597 [zram]
func:zram_meta_alloc 12M     3001 mm/execmem.c:41 func:execmem_vmalloc 
         12M     3605 kernel/fork.c:311 func:alloc_thread_stack_node 
         16M      992 mm/slub.c:3061 func:alloc_slab_page 
         20M    35544 lib/xarray.c:378 func:xas_alloc 
         31M     7704 mm/memory.c:1192 func:folio_prealloc 
         69M    17562 mm/memory.c:1190 func:folio_prealloc 
        104M     8212 mm/slub.c:3059 func:alloc_slab_page 
        124M    30075 mm/readahead.c:189 func:ractl_alloc_folio 
        2.6G   661392 fs/netfs/buffered_read.c:635 [netfs]
func:netfs_write_begin 

So, unfortunately it doesn't reveal the true source. But was worth a
shot! So thanks again

Mal


> > >
> > >I've have been trying to narrow down a consistent reproducer that's
> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> > >And I haven't got it quite as fast. I think the dd workload is too
> > >well behaved. 
> > >
> > >I can confirm the issue appeared in the major patch set that was
> > >applied as part of the 6.15 kernel. So during the more complete
> > >pages to folios switch and that nothing has changed in the bug
> > >behaviour since then. I did have a look at all the diffs from 6.14
> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked
> > >like they would impact the bug behavior. 
> > >
> > >Again, I'm not super familiar with the CephFS code but to hazard a
> > >guess, but I think that the web download workload triggers things
> > >faster suggests that unaligned writes might make things worse. But
> > >again, I'm not 100% sure. I can't find a reproducer as fast as
> > >downloading a dataset. Rsync of lots and lots of tiny files is a
> > >tad faster than the dd case.
> > >
> > >I did see some changes in ceph_check_page_before_write where the
> > >previous code unlocked pages and then continued where as the
> > >changed folio code just returns ENODATA and doesn't unlock
> > >anything with most of the rest of the logic unchanged. This might
> > >be perfectly fine, but in my, admittedly limited, reading of the
> > >code I couldn't figure out where anything that was locked prior to
> > >this being called would get unlocked like it did prior to the
> > >change. Again, I could be miles off here and one of the bulk
> > >reclaim/unlock passes that was added might be cleaning this up
> > >correctly or some other functional change might take care of this,
> > >but it looks to be potentially in the code path I'm excising and
> > >it has had some unlock logic changed. 
> > >
> > >I've spent most of my time trying to find a solid quick reproducer.
> > >Not that it takes long to start leaking folios, but I wanted
> > >something that aggressively triggered it so a small vm would oom
> > >quickly and when combined with crash_on_oom it could potentially be
> > >used for regression testing by way of "did vm crash?".
> > >
> > >I'm not sure if it will super help, but I'll provide what details I
> > >can about the actual workload that really sets it off. It's a
> > >python based tool for downloading datasets. Datasets are split
> > >into N chunks and the tool downloads them in parallel 100 at a
> > >time until all N chunks are down. The compressed dataset is then
> > >unpacked and reassembled for use with workloads. 
> > >
> > >This is replicating a common home folder usecase in HPC. CephFS is
> > >very attractive for home folders due to it's "NFS-like" utility and
> > >performance. And many tools use a similar method for fetching large
> > >datasets. Tools are frequently written in python or go. 
> > >
> > >None of my customers have hit this yet, not have any enterprise
> > >customers as none have moved to a new enough kernel yet due to slow
> > >upgrade cycles. Even Proxmox have only just started testing on a
> > >kernel version > 6.14. 
> > >
> > >I'm more than happy to help however I can with testing. I can run
> > >instrumented kernels or test patches or whatever you need. I am
> > >sorry I haven't been able to produce a super clean, fast reproducer
> > >(my test cluster at home is all spinners and only 500TB usable).
> > >But I figured I needed to get the word out asap as distros and soon
> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7
> > >year update cycle marches on. Especially those wanting to take full
> > >advantage of CacheFS and encryption functionality. 
> > >
> > >Again thanks for looking at this and do reach out if I can help in
> > >anyway. I am in the ceph slack if it's faster to reach out that
> > >way.
> > >
> > >Regards
> > >
> > >Mal Haak    
>
Re: Possible memory leak in 6.17.7
Posted by David Wang 1 month, 3 weeks ago
At 2025-12-16 19:55:27, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Tue, 16 Dec 2025 17:09:18 +1000
>Mal Haak <malcolm@haak.id.au> wrote:
>
>> On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
>> "David Wang" <00107082@163.com> wrote:
>> 
>> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:  
>> > >On Mon, 15 Dec 2025 19:42:56 +0000
>> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>> > >    
>> > >> Hi Mal,
>> > >>     
>> > ><SNIP>     
>> > >> 
>> > >> Thanks a lot for reporting the issue. Finally, I can see the
>> > >> discussion in email list. :) Are you working on the patch with
>> > >> the fix? Should we wait for the fix or I need to start the issue
>> > >> reproduction and investigation? I am simply trying to avoid
>> > >> patches collision and, also, I have multiple other issues for
>> > >> the fix in CephFS kernel client. :)
>> > >> 
>> > >> Thanks,
>> > >> Slava.    
>> > >
>> > >Hello,
>> > >
>> > >Unfortunately creating a patch is just outside my comfort zone,
>> > >I've lived too long in Lustre land.    
>> > 
>> > Hi, just out of curiosity, have you narrowed down the caller of
>> > __filemap_get_folio causing the memory problem? Or do you have
>> > trouble applying the debug patch for memory allocation profiling?
>> > 
>> > David 
>> >   
>> Hi David,
>> 
>> I hadn't yet as I did test XFS and NFS to see if it replicated the
>> behaviour and it did not. 
>> 
>> But actually this could speed things up considerably. I will do that
>> now and see what I get.
>> 
>> Thanks
>> 
>> Mal
>> 
>I did just give it a blast. 
>
>Unfortunately it returned exactly what I expected, that is the calls
>are all coming from netfs.
>
>Which makes sense for cephfs. 
>
># sort -g /proc/allocinfo|tail|numfmt --to=iec
>         10M     2541 drivers/block/zram/zram_drv.c:1597 [zram]
>func:zram_meta_alloc 12M     3001 mm/execmem.c:41 func:execmem_vmalloc 
>         12M     3605 kernel/fork.c:311 func:alloc_thread_stack_node 
>         16M      992 mm/slub.c:3061 func:alloc_slab_page 
>         20M    35544 lib/xarray.c:378 func:xas_alloc 
>         31M     7704 mm/memory.c:1192 func:folio_prealloc 
>         69M    17562 mm/memory.c:1190 func:folio_prealloc 
>        104M     8212 mm/slub.c:3059 func:alloc_slab_page 
>        124M    30075 mm/readahead.c:189 func:ractl_alloc_folio 
>        2.6G   661392 fs/netfs/buffered_read.c:635 [netfs]
>func:netfs_write_begin 
>
>So, unfortunately it doesn't reveal the true source. But was worth a
>shot! So thanks again

Oh,  at least cephfs could be ruled out, right?

CC netfs folks then. :)


>
>Mal
>
>
>> > >
>> > >I've have been trying to narrow down a consistent reproducer that's
>> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
>> > >And I haven't got it quite as fast. I think the dd workload is too
>> > >well behaved. 
>> > >
>> > >I can confirm the issue appeared in the major patch set that was
>> > >applied as part of the 6.15 kernel. So during the more complete
>> > >pages to folios switch and that nothing has changed in the bug
>> > >behaviour since then. I did have a look at all the diffs from 6.14
>> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked
>> > >like they would impact the bug behavior. 
>> > >
>> > >Again, I'm not super familiar with the CephFS code but to hazard a
>> > >guess, but I think that the web download workload triggers things
>> > >faster suggests that unaligned writes might make things worse. But
>> > >again, I'm not 100% sure. I can't find a reproducer as fast as
>> > >downloading a dataset. Rsync of lots and lots of tiny files is a
>> > >tad faster than the dd case.
>> > >
>> > >I did see some changes in ceph_check_page_before_write where the
>> > >previous code unlocked pages and then continued where as the
>> > >changed folio code just returns ENODATA and doesn't unlock
>> > >anything with most of the rest of the logic unchanged. This might
>> > >be perfectly fine, but in my, admittedly limited, reading of the
>> > >code I couldn't figure out where anything that was locked prior to
>> > >this being called would get unlocked like it did prior to the
>> > >change. Again, I could be miles off here and one of the bulk
>> > >reclaim/unlock passes that was added might be cleaning this up
>> > >correctly or some other functional change might take care of this,
>> > >but it looks to be potentially in the code path I'm excising and
>> > >it has had some unlock logic changed. 
>> > >
>> > >I've spent most of my time trying to find a solid quick reproducer.
>> > >Not that it takes long to start leaking folios, but I wanted
>> > >something that aggressively triggered it so a small vm would oom
>> > >quickly and when combined with crash_on_oom it could potentially be
>> > >used for regression testing by way of "did vm crash?".
>> > >
>> > >I'm not sure if it will super help, but I'll provide what details I
>> > >can about the actual workload that really sets it off. It's a
>> > >python based tool for downloading datasets. Datasets are split
>> > >into N chunks and the tool downloads them in parallel 100 at a
>> > >time until all N chunks are down. The compressed dataset is then
>> > >unpacked and reassembled for use with workloads. 
>> > >
>> > >This is replicating a common home folder usecase in HPC. CephFS is
>> > >very attractive for home folders due to it's "NFS-like" utility and
>> > >performance. And many tools use a similar method for fetching large
>> > >datasets. Tools are frequently written in python or go. 
>> > >
>> > >None of my customers have hit this yet, not have any enterprise
>> > >customers as none have moved to a new enough kernel yet due to slow
>> > >upgrade cycles. Even Proxmox have only just started testing on a
>> > >kernel version > 6.14. 
>> > >
>> > >I'm more than happy to help however I can with testing. I can run
>> > >instrumented kernels or test patches or whatever you need. I am
>> > >sorry I haven't been able to produce a super clean, fast reproducer
>> > >(my test cluster at home is all spinners and only 500TB usable).
>> > >But I figured I needed to get the word out asap as distros and soon
>> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7
>> > >year update cycle marches on. Especially those wanting to take full
>> > >advantage of CacheFS and encryption functionality. 
>> > >
>> > >Again thanks for looking at this and do reach out if I can help in
>> > >anyway. I am in the ceph slack if it's faster to reach out that
>> > >way.
>> > >
>> > >Regards
>> > >
>> > >Mal Haak    
>> 
Re: Possible memory leak in 6.17.7
Posted by David Wang 1 month, 3 weeks ago
At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote:
>
>At 2025-12-16 19:55:27, "Mal Haak" <malcolm@haak.id.au> wrote:
>>On Tue, 16 Dec 2025 17:09:18 +1000
>>Mal Haak <malcolm@haak.id.au> wrote:
>>
>>> On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
>>> "David Wang" <00107082@163.com> wrote:
>>> 
>>> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:  
>>> > >On Mon, 15 Dec 2025 19:42:56 +0000
>>> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
>>> > >    
>>> > >> Hi Mal,
>>> > >>     
>>> > ><SNIP>     
>>> > >> 
>>> > >> Thanks a lot for reporting the issue. Finally, I can see the
>>> > >> discussion in email list. :) Are you working on the patch with
>>> > >> the fix? Should we wait for the fix or I need to start the issue
>>> > >> reproduction and investigation? I am simply trying to avoid
>>> > >> patches collision and, also, I have multiple other issues for
>>> > >> the fix in CephFS kernel client. :)
>>> > >> 
>>> > >> Thanks,
>>> > >> Slava.    
>>> > >
>>> > >Hello,
>>> > >
>>> > >Unfortunately creating a patch is just outside my comfort zone,
>>> > >I've lived too long in Lustre land.    
>>> > 
>>> > Hi, just out of curiosity, have you narrowed down the caller of
>>> > __filemap_get_folio causing the memory problem? Or do you have
>>> > trouble applying the debug patch for memory allocation profiling?
>>> > 
>>> > David 
>>> >   
>>> Hi David,
>>> 
>>> I hadn't yet as I did test XFS and NFS to see if it replicated the
>>> behaviour and it did not. 
>>> 
>>> But actually this could speed things up considerably. I will do that
>>> now and see what I get.
>>> 
>>> Thanks
>>> 
>>> Mal
>>> 
>>I did just give it a blast. 
>>
>>Unfortunately it returned exactly what I expected, that is the calls
>>are all coming from netfs.
>>
>>Which makes sense for cephfs. 
>>
>># sort -g /proc/allocinfo|tail|numfmt --to=iec
>>         10M     2541 drivers/block/zram/zram_drv.c:1597 [zram]
>>func:zram_meta_alloc 12M     3001 mm/execmem.c:41 func:execmem_vmalloc 
>>         12M     3605 kernel/fork.c:311 func:alloc_thread_stack_node 
>>         16M      992 mm/slub.c:3061 func:alloc_slab_page 
>>         20M    35544 lib/xarray.c:378 func:xas_alloc 
>>         31M     7704 mm/memory.c:1192 func:folio_prealloc 
>>         69M    17562 mm/memory.c:1190 func:folio_prealloc 
>>        104M     8212 mm/slub.c:3059 func:alloc_slab_page 
>>        124M    30075 mm/readahead.c:189 func:ractl_alloc_folio 
>>        2.6G   661392 fs/netfs/buffered_read.c:635 [netfs]
>>func:netfs_write_begin 
>>
>>So, unfortunately it doesn't reveal the true source. But was worth a
>>shot! So thanks again
>
>Oh,  at least cephfs could be ruled out, right?
ehh...., I think I could be wrong about this.....

>
>CC netfs folks then. :)

>
>
>>
>>Mal
>>
>>
>>> > >
>>> > >I've have been trying to narrow down a consistent reproducer that's
>>> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
>>> > >And I haven't got it quite as fast. I think the dd workload is too
>>> > >well behaved. 
>>> > >
>>> > >I can confirm the issue appeared in the major patch set that was
>>> > >applied as part of the 6.15 kernel. So during the more complete
>>> > >pages to folios switch and that nothing has changed in the bug
>>> > >behaviour since then. I did have a look at all the diffs from 6.14
>>> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked
>>> > >like they would impact the bug behavior. 
>>> > >
>>> > >Again, I'm not super familiar with the CephFS code but to hazard a
>>> > >guess, but I think that the web download workload triggers things
>>> > >faster suggests that unaligned writes might make things worse. But
>>> > >again, I'm not 100% sure. I can't find a reproducer as fast as
>>> > >downloading a dataset. Rsync of lots and lots of tiny files is a
>>> > >tad faster than the dd case.
>>> > >
>>> > >I did see some changes in ceph_check_page_before_write where the
>>> > >previous code unlocked pages and then continued where as the
>>> > >changed folio code just returns ENODATA and doesn't unlock
>>> > >anything with most of the rest of the logic unchanged. This might
>>> > >be perfectly fine, but in my, admittedly limited, reading of the
>>> > >code I couldn't figure out where anything that was locked prior to
>>> > >this being called would get unlocked like it did prior to the
>>> > >change. Again, I could be miles off here and one of the bulk
>>> > >reclaim/unlock passes that was added might be cleaning this up
>>> > >correctly or some other functional change might take care of this,
>>> > >but it looks to be potentially in the code path I'm excising and
>>> > >it has had some unlock logic changed. 
>>> > >
>>> > >I've spent most of my time trying to find a solid quick reproducer.
>>> > >Not that it takes long to start leaking folios, but I wanted
>>> > >something that aggressively triggered it so a small vm would oom
>>> > >quickly and when combined with crash_on_oom it could potentially be
>>> > >used for regression testing by way of "did vm crash?".
>>> > >
>>> > >I'm not sure if it will super help, but I'll provide what details I
>>> > >can about the actual workload that really sets it off. It's a
>>> > >python based tool for downloading datasets. Datasets are split
>>> > >into N chunks and the tool downloads them in parallel 100 at a
>>> > >time until all N chunks are down. The compressed dataset is then
>>> > >unpacked and reassembled for use with workloads. 
>>> > >
>>> > >This is replicating a common home folder usecase in HPC. CephFS is
>>> > >very attractive for home folders due to it's "NFS-like" utility and
>>> > >performance. And many tools use a similar method for fetching large
>>> > >datasets. Tools are frequently written in python or go. 
>>> > >
>>> > >None of my customers have hit this yet, not have any enterprise
>>> > >customers as none have moved to a new enough kernel yet due to slow
>>> > >upgrade cycles. Even Proxmox have only just started testing on a
>>> > >kernel version > 6.14. 
>>> > >
>>> > >I'm more than happy to help however I can with testing. I can run
>>> > >instrumented kernels or test patches or whatever you need. I am
>>> > >sorry I haven't been able to produce a super clean, fast reproducer
>>> > >(my test cluster at home is all spinners and only 500TB usable).
>>> > >But I figured I needed to get the word out asap as distros and soon
>>> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7
>>> > >year update cycle marches on. Especially those wanting to take full
>>> > >advantage of CacheFS and encryption functionality. 
>>> > >
>>> > >Again thanks for looking at this and do reach out if I can help in
>>> > >anyway. I am in the ceph slack if it's faster to reach out that
>>> > >way.
>>> > >
>>> > >Regards
>>> > >
>>> > >Mal Haak    
>>> 
RE: Possible memory leak in 6.17.7
Posted by Viacheslav Dubeyko 1 month, 3 weeks ago
Hi Mal,

On Tue, 2025-12-16 at 20:42 +0800, David Wang wrote:
> At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote:
> > 
> > 

<skipped>

> > > 
> > > 
> > > > > > 
> > > > > > I've have been trying to narrow down a consistent reproducer that's
> > > > > > as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> > > > > > And I haven't got it quite as fast. I think the dd workload is too
> > > > > > well behaved. 
> > > > > > 
> > > > > > I can confirm the issue appeared in the major patch set that was
> > > > > > applied as part of the 6.15 kernel. So during the more complete
> > > > > > pages to folios switch and that nothing has changed in the bug
> > > > > > behaviour since then. I did have a look at all the diffs from 6.14
> > > > > > to 6.18 on addr.c and didn't see any changes post 6.15 that looked
> > > > > > like they would impact the bug behavior. 
> > > > > > 
> > > > > > Again, I'm not super familiar with the CephFS code but to hazard a
> > > > > > guess, but I think that the web download workload triggers things
> > > > > > faster suggests that unaligned writes might make things worse. But
> > > > > > again, I'm not 100% sure. I can't find a reproducer as fast as
> > > > > > downloading a dataset. Rsync of lots and lots of tiny files is a
> > > > > > tad faster than the dd case.
> > > > > > 
> > > > > > I did see some changes in ceph_check_page_before_write where the
> > > > > > previous code unlocked pages and then continued where as the
> > > > > > changed folio code just returns ENODATA and doesn't unlock
> > > > > > anything with most of the rest of the logic unchanged. This might
> > > > > > be perfectly fine, but in my, admittedly limited, reading of the
> > > > > > code I couldn't figure out where anything that was locked prior to
> > > > > > this being called would get unlocked like it did prior to the
> > > > > > change. Again, I could be miles off here and one of the bulk
> > > > > > reclaim/unlock passes that was added might be cleaning this up
> > > > > > correctly or some other functional change might take care of this,
> > > > > > but it looks to be potentially in the code path I'm excising and
> > > > > > it has had some unlock logic changed. 
> > > > > > 
> > > > > > I've spent most of my time trying to find a solid quick reproducer.
> > > > > > Not that it takes long to start leaking folios, but I wanted
> > > > > > something that aggressively triggered it so a small vm would oom
> > > > > > quickly and when combined with crash_on_oom it could potentially be
> > > > > > used for regression testing by way of "did vm crash?".
> > > > > > 
> > > > > > I'm not sure if it will super help, but I'll provide what details I
> > > > > > can about the actual workload that really sets it off. It's a
> > > > > > python based tool for downloading datasets. Datasets are split
> > > > > > into N chunks and the tool downloads them in parallel 100 at a
> > > > > > time until all N chunks are down. The compressed dataset is then
> > > > > > unpacked and reassembled for use with workloads. 
> > > > > > 
> > > > > > This is replicating a common home folder usecase in HPC. CephFS is
> > > > > > very attractive for home folders due to it's "NFS-like" utility and
> > > > > > performance. And many tools use a similar method for fetching large
> > > > > > datasets. Tools are frequently written in python or go. 
> > > > > > 
> > > > > > None of my customers have hit this yet, not have any enterprise
> > > > > > customers as none have moved to a new enough kernel yet due to slow
> > > > > > upgrade cycles. Even Proxmox have only just started testing on a
> > > > > > kernel version > 6.14. 
> > > > > > 
> > > > > > I'm more than happy to help however I can with testing. I can run
> > > > > > instrumented kernels or test patches or whatever you need. I am
> > > > > > sorry I haven't been able to produce a super clean, fast reproducer
> > > > > > (my test cluster at home is all spinners and only 500TB usable).
> > > > > > But I figured I needed to get the word out asap as distros and soon
> > > > > > customers are going to be moving past 6.12-6.14 kernels as the 5-7
> > > > > > year update cycle marches on. Especially those wanting to take full
> > > > > > advantage of CacheFS and encryption functionality. 
> > > > > > 
> > > > > > Again thanks for looking at this and do reach out if I can help in
> > > > > > anyway. I am in the ceph slack if it's faster to reach out that
> > > > > > way.
> > > > > > 
> > > > 

Could you please add your CephFS kernel client's mount options into the ticket
[1]?

Thanks a lot,
Slava.

[1] https://tracker.ceph.com/issues/74156 
Re: Possible memory leak in 6.17.7
Posted by Mal Haak 1 month, 3 weeks ago
On Wed, 17 Dec 2025 01:56:52 +0000
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> Hi Mal,
> 
> On Tue, 2025-12-16 at 20:42 +0800, David Wang wrote:
> > At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote:  
> > > 
> > >   
> 
> <skipped>
> 
> > > > 
> > > >   
> > > > > > > 
> > > > > > > I've have been trying to narrow down a consistent
> > > > > > > reproducer that's as fast as my production workload. (It
> > > > > > > crashes a 32GB VM in 2hrs) And I haven't got it quite as
> > > > > > > fast. I think the dd workload is too well behaved. 
> > > > > > > 
> > > > > > > I can confirm the issue appeared in the major patch set
> > > > > > > that was applied as part of the 6.15 kernel. So during
> > > > > > > the more complete pages to folios switch and that nothing
> > > > > > > has changed in the bug behaviour since then. I did have a
> > > > > > > look at all the diffs from 6.14 to 6.18 on addr.c and
> > > > > > > didn't see any changes post 6.15 that looked like they
> > > > > > > would impact the bug behavior. 
> > > > > > > 
> > > > > > > Again, I'm not super familiar with the CephFS code but to
> > > > > > > hazard a guess, but I think that the web download
> > > > > > > workload triggers things faster suggests that unaligned
> > > > > > > writes might make things worse. But again, I'm not 100%
> > > > > > > sure. I can't find a reproducer as fast as downloading a
> > > > > > > dataset. Rsync of lots and lots of tiny files is a tad
> > > > > > > faster than the dd case.
> > > > > > > 
> > > > > > > I did see some changes in ceph_check_page_before_write
> > > > > > > where the previous code unlocked pages and then continued
> > > > > > > where as the changed folio code just returns ENODATA and
> > > > > > > doesn't unlock anything with most of the rest of the
> > > > > > > logic unchanged. This might be perfectly fine, but in my,
> > > > > > > admittedly limited, reading of the code I couldn't figure
> > > > > > > out where anything that was locked prior to this being
> > > > > > > called would get unlocked like it did prior to the
> > > > > > > change. Again, I could be miles off here and one of the
> > > > > > > bulk reclaim/unlock passes that was added might be
> > > > > > > cleaning this up correctly or some other functional
> > > > > > > change might take care of this, but it looks to be
> > > > > > > potentially in the code path I'm excising and it has had
> > > > > > > some unlock logic changed. 
> > > > > > > 
> > > > > > > I've spent most of my time trying to find a solid quick
> > > > > > > reproducer. Not that it takes long to start leaking
> > > > > > > folios, but I wanted something that aggressively
> > > > > > > triggered it so a small vm would oom quickly and when
> > > > > > > combined with crash_on_oom it could potentially be used
> > > > > > > for regression testing by way of "did vm crash?".
> > > > > > > 
> > > > > > > I'm not sure if it will super help, but I'll provide what
> > > > > > > details I can about the actual workload that really sets
> > > > > > > it off. It's a python based tool for downloading
> > > > > > > datasets. Datasets are split into N chunks and the tool
> > > > > > > downloads them in parallel 100 at a time until all N
> > > > > > > chunks are down. The compressed dataset is then unpacked
> > > > > > > and reassembled for use with workloads. 
> > > > > > > 
> > > > > > > This is replicating a common home folder usecase in HPC.
> > > > > > > CephFS is very attractive for home folders due to it's
> > > > > > > "NFS-like" utility and performance. And many tools use a
> > > > > > > similar method for fetching large datasets. Tools are
> > > > > > > frequently written in python or go. 
> > > > > > > 
> > > > > > > None of my customers have hit this yet, not have any
> > > > > > > enterprise customers as none have moved to a new enough
> > > > > > > kernel yet due to slow upgrade cycles. Even Proxmox have
> > > > > > > only just started testing on a kernel version > 6.14. 
> > > > > > > 
> > > > > > > I'm more than happy to help however I can with testing. I
> > > > > > > can run instrumented kernels or test patches or whatever
> > > > > > > you need. I am sorry I haven't been able to produce a
> > > > > > > super clean, fast reproducer (my test cluster at home is
> > > > > > > all spinners and only 500TB usable). But I figured I
> > > > > > > needed to get the word out asap as distros and soon
> > > > > > > customers are going to be moving past 6.12-6.14 kernels
> > > > > > > as the 5-7 year update cycle marches on. Especially those
> > > > > > > wanting to take full advantage of CacheFS and encryption
> > > > > > > functionality. 
> > > > > > > 
> > > > > > > Again thanks for looking at this and do reach out if I
> > > > > > > can help in anyway. I am in the ceph slack if it's faster
> > > > > > > to reach out that way.
> > > > > > >   
> > > > >   
> 
> Could you please add your CephFS kernel client's mount options into
> the ticket [1]?
> 
> Thanks a lot,
> Slava.
> 
> [1] https://tracker.ceph.com/issues/74156 

I've updated the ticket. 

I am curious about the differences between your test setup and my
actual setup in terms of capacity and hardware. 

I can provide crash dumps if it is helpful.

Thanks 

Mal
RE: RRe: Possible memory leak in 6.17.7
Posted by Viacheslav Dubeyko 1 month, 3 weeks ago
On Tue, 2025-12-16 at 11:26 +1000, Mal Haak wrote:
> On Mon, 15 Dec 2025 19:42:56 +0000
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Hi Mal,
> > 
> <SNIP> 
> > 
> > Thanks a lot for reporting the issue. Finally, I can see the
> > discussion in email list. :) Are you working on the patch with the
> > fix? Should we wait for the fix or I need to start the issue
> > reproduction and investigation? I am simply trying to avoid patches
> > collision and, also, I have multiple other issues for the fix in
> > CephFS kernel client. :)
> > 
> > Thanks,
> > Slava.
> 
> Hello,
> 
> Unfortunately creating a patch is just outside my comfort zone, I've
> lived too long in Lustre land.
> 
> I've have been trying to narrow down a consistent reproducer that's as
> fast as my production workload. (It crashes a 32GB VM in 2hrs) And I
> haven't got it quite as fast. I think the dd workload is too well
> behaved. 
> 
> I can confirm the issue appeared in the major patch set that was
> applied as part of the 6.15 kernel. So during the more complete pages
> to folios switch and that nothing has changed in the bug behaviour since
> then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c
> and didn't see any changes post 6.15 that looked like they would impact
> the bug behavior. 
> 
> Again, I'm not super familiar with the CephFS code but to hazard a
> guess, but I think that the web download workload triggers things faster
> suggests that unaligned writes might make things worse. But again, I'm
> not 100% sure. I can't find a reproducer as fast as downloading a
> dataset. Rsync of lots and lots of tiny files is a tad faster than the
> dd case.
> 
> I did see some changes in ceph_check_page_before_write where the
> previous code unlocked pages and then continued where as the changed
> folio code just returns ENODATA and doesn't unlock anything with most
> of the rest of the logic unchanged. This might be perfectly fine, but
> in my, admittedly limited, reading of the code I couldn't figure out
> where anything that was locked prior to this being called would get
> unlocked like it did prior to the change. Again, I could be miles off
> here and one of the bulk reclaim/unlock passes that was added might be
> cleaning this up correctly or some other functional change might take
> care of this, but it looks to be potentially in the code path I'm
> excising and it has had some unlock logic changed. 
> 
> I've spent most of my time trying to find a solid quick reproducer. Not
> that it takes long to start leaking folios, but I wanted something that
> aggressively triggered it so a small vm would oom quickly and when
> combined with crash_on_oom it could potentially be used for regression
> testing by way of "did vm crash?".
> 
> I'm not sure if it will super help, but I'll provide what details I can
> about the actual workload that really sets it off. It's a python based
> tool for downloading datasets. Datasets are split into N chunks and the
> tool downloads them in parallel 100 at a time until all N chunks are
> down. The compressed dataset is then unpacked and reassembled for
> use with workloads. 
> 
> This is replicating a common home folder usecase in HPC. CephFS is very
> attractive for home folders due to it's "NFS-like" utility and
> performance. And many tools use a similar method for fetching large
> datasets. Tools are frequently written in python or go. 
> 
> None of my customers have hit this yet, not have any enterprise
> customers as none have moved to a new enough kernel yet due to slow
> upgrade cycles. Even Proxmox have only just started testing on a kernel
> version > 6.14. 
> 
> I'm more than happy to help however I can with testing. I can run
> instrumented kernels or test patches or whatever you need. I am sorry I
> haven't been able to produce a super clean, fast reproducer (my test
> cluster at home is all spinners and only 500TB usable). But I figured I
> needed to get the word out asap as distros and soon customers are going
> to be moving past 6.12-6.14 kernels as the 5-7 year update cycle
> marches on. Especially those wanting to take full advantage of CacheFS
> and encryption functionality. 
> 
> Again thanks for looking at this and do reach out if I can help in
> anyway. I am in the ceph slack if it's faster to reach out that way.
> 
> 

Thanks a lot for of your efforts. I hope it will help a lot. Let me start to
reproduce the issue. I'll let you know if I need additional details. I'll share
my progress and potential troubles in the ticket that you've created.

Thanks,
Slava.