At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
>"David Wang" <00107082@163.com> wrote:
>
>> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
>> >On Mon, 8 Dec 2025 19:08:29 +0800
>> >David Wang <00107082@163.com> wrote:
>> >
>> >> On Mon, 10 Nov 2025 18:20:08 +1000
>> >> Mal Haak <malcolm@haak.id.au> wrote:
>> >> > Hello,
>> >> >
>> >> > I have found a memory leak in 6.17.7 but I am unsure how to
>> >> > track it down effectively.
>> >> >
>> >> >
>> >>
>> >> I think the `memory allocation profiling` feature can help.
>> >> https://docs.kernel.org/mm/allocation-profiling.html
>> >>
>> >> You would need to build a kernel with
>> >> CONFIG_MEM_ALLOC_PROFILING=y
>> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> >>
>> >> And check /proc/allocinfo for the suspicious allocations which take
>> >> more memory than expected.
>> >>
>> >> (I once caught a nvidia driver memory leak.)
>> >>
>> >>
>> >> FYI
>> >> David
>> >>
>> >
>> >Thank you for your suggestion. I have some results.
>> >
>> >Ran the rsync workload for about 9 hours. It started to look like it
>> >was happening.
>> ># smem -pw
>> >Area Used Cache Noncache
>> >firmware/hardware 0.00% 0.00% 0.00%
>> >kernel image 0.00% 0.00% 0.00%
>> >kernel dynamic memory 80.46% 65.80% 14.66%
>> >userspace memory 0.35% 0.16% 0.19%
>> >free memory 19.19% 19.19% 0.00%
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 22M 5609 mm/memory.c:1190 func:folio_prealloc
>> > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs]
>> >func:xfs_buf_alloc_backing_mem
>> > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs]
>> > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192
>> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
>> > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069
>> > func:alloc_charge_folio 430M 770788 lib/xarray.c:378
>> > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page
>> > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio
>> > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio
>> >
>> >
>> >So I stopped the workload and dropped caches to confirm.
>> >
>> ># echo 3 > /proc/sys/vm/drop_caches
>> ># smem -pw
>> >Area Used Cache Noncache
>> >firmware/hardware 0.00% 0.00% 0.00%
>> >kernel image 0.00% 0.00% 0.00%
>> >kernel dynamic memory 33.45% 0.09% 33.36%
>> >userspace memory 0.36% 0.16% 0.19%
>> >free memory 66.20% 66.20% 0.00%
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
>> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
>> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
>> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
>> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
>> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
>> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
>> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
>> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
>> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
Maybe narrowing down the "Noncache" caller of __filemap_get_folio would help clarify things.
(It could be designed that way, and needs other route than dropping-cache to release the memory, just guess....)
If you want, you can modify code to split the accounting for __filemap_get_folio according to its callers.
Following is a draft patch: (based on v6.18)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 09b581c1d878..ba8c659a6ae3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+
+#define __filemap_get_folio(...) \
+ alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
+
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 024b71da5224..e1c1c26d7cb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
*
* Return: The found folio or an ERR_PTR() otherwise.
*/
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp)
{
struct folio *folio;
@@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
err = -ENOMEM;
if (order > min_order)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
- folio = filemap_alloc_folio(alloc_gfp, order);
+ folio = filemap_alloc_folio_noprof(alloc_gfp, order);
if (!folio)
continue;
@@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_clear_dropbehind(folio);
return folio;
}
-EXPORT_SYMBOL(__filemap_get_folio);
+EXPORT_SYMBOL(__filemap_get_folio_noprof);
static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
xa_mark_t mark)
FYI
David
>> >
>> >So if I'm reading this correctly something is causing folios collect
>> >and not be able to be freed?
>>
>> CC cephfs, maybe someone could have an easy reading out of those
>> folio usage
>>
>>
>> >
>> >Also it's clear that some of the folio's are counting as cache and
>> >some aren't.
>> >
>> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm now
>> >going to manually walk through previous kernel releases and find
>> >where it first starts happening purely because I'm having issues
>> >building earlier kernels due to rust stuff and other python
>> >incompatibilities making doing a git-bisect a bit fun.
>> >
>> >I'll do it the packages way until I get closer, then solve the build
>> >issues.
>> >
>> >Thanks,
>> >Mal
>> >
>Thanks David.
>
>I've contacted the ceph developers as well.
>
>There was a suggestion it was due to the change from, to quote:
>folio.free() to folio.put() or something like this.
>
>The change happened around 6.14/6.15
>
>I've found an easier reproducer.
>
>There has been a suggestion that perhaps the ceph team might not fix
>this as "you can just reboot before the machine becomes unstable" and
>"Nobody else has encountered this bug"
>
>I'll leave that to other people to make a call on but I'd assume the
>lack of reports is due to the fact that most stable distros are still
>on a a far too early kernel and/or are using the fuse driver with k8s.
>
>Anyway, thanks for your assistance.
On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
"David Wang" <00107082@163.com> wrote:
> At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> >"David Wang" <00107082@163.com> wrote:
> >
> >> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
> >> >On Mon, 8 Dec 2025 19:08:29 +0800
> >> >David Wang <00107082@163.com> wrote:
> >> >
> >> >> On Mon, 10 Nov 2025 18:20:08 +1000
> >> >> Mal Haak <malcolm@haak.id.au> wrote:
> >> >> > Hello,
> >> >> >
> >> >> > I have found a memory leak in 6.17.7 but I am unsure how to
> >> >> > track it down effectively.
> >> >> >
> >> >> >
> >> >>
> >> >> I think the `memory allocation profiling` feature can help.
> >> >> https://docs.kernel.org/mm/allocation-profiling.html
> >> >>
> >> >> You would need to build a kernel with
> >> >> CONFIG_MEM_ALLOC_PROFILING=y
> >> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >> >>
> >> >> And check /proc/allocinfo for the suspicious allocations which
> >> >> take more memory than expected.
> >> >>
> >> >> (I once caught a nvidia driver memory leak.)
> >> >>
> >> >>
> >> >> FYI
> >> >> David
> >> >>
> >> >
> >> >Thank you for your suggestion. I have some results.
> >> >
> >> >Ran the rsync workload for about 9 hours. It started to look like
> >> >it was happening.
> >> ># smem -pw
> >> >Area Used Cache Noncache
> >> >firmware/hardware 0.00% 0.00% 0.00%
> >> >kernel image 0.00% 0.00% 0.00%
> >> >kernel dynamic memory 80.46% 65.80% 14.66%
> >> >userspace memory 0.35% 0.16% 0.19%
> >> >free memory 19.19% 19.19% 0.00%
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> > 22M 5609 mm/memory.c:1190 func:folio_prealloc
> >> > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs]
> >> >func:xfs_buf_alloc_backing_mem
> >> > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs]
> >> > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192
> >> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 430M 770788 lib/xarray.c:378
> >> > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page
> >> > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio
> >> > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio
> >> >
> >> >
> >> >So I stopped the workload and dropped caches to confirm.
> >> >
> >> ># echo 3 > /proc/sys/vm/drop_caches
> >> ># smem -pw
> >> >Area Used Cache Noncache
> >> >firmware/hardware 0.00% 0.00% 0.00%
> >> >kernel image 0.00% 0.00% 0.00%
> >> >kernel dynamic memory 33.45% 0.09% 33.36%
> >> >userspace memory 0.36% 0.16% 0.19%
> >> >free memory 66.20% 66.20% 0.00%
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
> >> > 12M 3 kernel/dma/pool.c:96
> >> > func:atomic_pool_expand 13M 751 mm/slub.c:3061
> >> > func:alloc_slab_page 16M 8 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 18M 4355 mm/memory.c:1190
> >> > func:folio_prealloc 24M 6119 mm/memory.c:1192
> >> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 61M 15448 mm/readahead.c:189
> >> > func:ractl_alloc_folio 79M 6726 mm/slub.c:3059
> >> > func:alloc_slab_page 11G 2674488 mm/filemap.c:2012
> >> > func:__filemap_get_folio
>
> Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> would help clarify things. (It could be designed that way, and needs
> other route than dropping-cache to release the memory, just
> guess....) If you want, you can modify code to split the accounting
> for __filemap_get_folio according to its callers.
Thanks again, I'll add this patch in and see where I end up.
The issue is nothing will cause the memory to be freed. Dropping caches
doesn't work, memory pressure doesn't work, unmounting the filesystems
doesn't work. Removing the cephfs and netfs kernel modules also doesn't
work.
This is why I feel it's a ref_count (or similar) issue.
I've also found it seems to be a fixed amount leaked each time, per
file. Simply doing lots of IO on one large file doesn't leak as fast as
lots of "small" (greater than 10MB less than 100MB seems to be a sweet
spot)
Also, dropping caches while the workload is running actually amplifies
the issue. So it very much feels like something is wrong in the reclaim
code.
Anyway I'll get this patch applied and see where I end up.
I now have crash dumps (after enabling crash_on_oom) so I'm going to
try and see if I can find these structures and see what state they are
in
Thanks again.
Mal
> Following is a draft patch: (based on v6.18)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878..ba8c659a6ae3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
> }
>
> void *filemap_get_entry(struct address_space *mapping, pgoff_t
> index); -struct folio *__filemap_get_folio(struct address_space
> *mapping, pgoff_t index, +
> +#define __filemap_get_folio(...) \
> + alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
> +
> +struct folio *__filemap_get_folio_noprof(struct address_space
> *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> struct page *pagecache_get_page(struct address_space *mapping,
> pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 024b71da5224..e1c1c26d7cb3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space
> *mapping, pgoff_t index) *
> * Return: The found folio or an ERR_PTR() otherwise.
> */
> -struct folio *__filemap_get_folio(struct address_space *mapping,
> pgoff_t index, +struct folio *__filemap_get_folio_noprof(struct
> address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
> {
> struct folio *folio;
> @@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, err = -ENOMEM;
> if (order > min_order)
> alloc_gfp |= __GFP_NORETRY |
> __GFP_NOWARN;
> - folio = filemap_alloc_folio(alloc_gfp,
> order);
> + folio =
> filemap_alloc_folio_noprof(alloc_gfp, order); if (!folio)
> continue;
>
> @@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, folio_clear_dropbehind(folio);
> return folio;
> }
> -EXPORT_SYMBOL(__filemap_get_folio);
> +EXPORT_SYMBOL(__filemap_get_folio_noprof);
>
> static inline struct folio *find_get_entry(struct xa_state *xas,
> pgoff_t max, xa_mark_t mark)
>
>
>
>
> FYI
> David
>
> >> >
> >> >So if I'm reading this correctly something is causing folios
> >> >collect and not be able to be freed?
> >>
> >> CC cephfs, maybe someone could have an easy reading out of those
> >> folio usage
> >>
> >>
> >> >
> >> >Also it's clear that some of the folio's are counting as cache and
> >> >some aren't.
> >> >
> >> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm
> >> >now going to manually walk through previous kernel releases and
> >> >find where it first starts happening purely because I'm having
> >> >issues building earlier kernels due to rust stuff and other python
> >> >incompatibilities making doing a git-bisect a bit fun.
> >> >
> >> >I'll do it the packages way until I get closer, then solve the
> >> >build issues.
> >> >
> >> >Thanks,
> >> >Mal
> >> >
> >Thanks David.
> >
> >I've contacted the ceph developers as well.
> >
> >There was a suggestion it was due to the change from, to quote:
> >folio.free() to folio.put() or something like this.
> >
> >The change happened around 6.14/6.15
> >
> >I've found an easier reproducer.
> >
> >There has been a suggestion that perhaps the ceph team might not fix
> >this as "you can just reboot before the machine becomes unstable" and
> >"Nobody else has encountered this bug"
> >
> >I'll leave that to other people to make a call on but I'd assume the
> >lack of reports is due to the fact that most stable distros are still
> >on a a far too early kernel and/or are using the fuse driver with
> >k8s.
> >
> >Anyway, thanks for your assistance.
Hi Mal, On Thu, 2025-12-11 at 14:23 +1000, Mal Haak wrote: > On Thu, 11 Dec 2025 11:28:21 +0800 (CST) > "David Wang" <00107082@163.com> wrote: > > > At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote: > > > On Tue, 9 Dec 2025 12:40:21 +0800 (CST) > > > "David Wang" <00107082@163.com> wrote: > > > > > > > At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote: > > > > > On Mon, 8 Dec 2025 19:08:29 +0800 > > > > > David Wang <00107082@163.com> wrote: > > > > > > > > > > > On Mon, 10 Nov 2025 18:20:08 +1000 > > > > > > Mal Haak <malcolm@haak.id.au> wrote: > > > > > > > Hello, > > > > > > > > > > > > > > I have found a memory leak in 6.17.7 but I am unsure how to > > > > > > > track it down effectively. > > > > > > > > > > > > > > > > > > > > > > > > > > I think the `memory allocation profiling` feature can help. > > > > > > https://docs.kernel.org/mm/allocation-profiling.html > > > > > > > > > > > > You would need to build a kernel with > > > > > > CONFIG_MEM_ALLOC_PROFILING=y > > > > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y > > > > > > > > > > > > And check /proc/allocinfo for the suspicious allocations which > > > > > > take more memory than expected. > > > > > > > > > > > > (I once caught a nvidia driver memory leak.) > > > > > > > > > > > > > > > > > > FYI > > > > > > David > > > > > > > > > > > > > > > > Thank you for your suggestion. I have some results. > > > > > > > > > > Ran the rsync workload for about 9 hours. It started to look like > > > > > it was happening. > > > > > # smem -pw > > > > > Area Used Cache Noncache > > > > > firmware/hardware 0.00% 0.00% 0.00% > > > > > kernel image 0.00% 0.00% 0.00% > > > > > kernel dynamic memory 80.46% 65.80% 14.66% > > > > > userspace memory 0.35% 0.16% 0.19% > > > > > free memory 19.19% 19.19% 0.00% > > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec > > > > > 22M 5609 mm/memory.c:1190 func:folio_prealloc > > > > > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs] > > > > > func:xfs_buf_alloc_backing_mem > > > > > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs] > > > > > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192 > > > > > func:folio_prealloc 58M 14784 mm/page_ext.c:271 > > > > > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069 > > > > > func:alloc_charge_folio 430M 770788 lib/xarray.c:378 > > > > > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page > > > > > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio > > > > > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio > > > > > > > > > > > > > > > So I stopped the workload and dropped caches to confirm. > > > > > > > > > > # echo 3 > /proc/sys/vm/drop_caches > > > > > # smem -pw > > > > > Area Used Cache Noncache > > > > > firmware/hardware 0.00% 0.00% 0.00% > > > > > kernel image 0.00% 0.00% 0.00% > > > > > kernel dynamic memory 33.45% 0.09% 33.36% > > > > > userspace memory 0.36% 0.16% 0.19% > > > > > free memory 66.20% 66.20% 0.00% > > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec > > > > > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc > > > > > 12M 3 kernel/dma/pool.c:96 > > > > > func:atomic_pool_expand 13M 751 mm/slub.c:3061 > > > > > func:alloc_slab_page 16M 8 mm/khugepaged.c:1069 > > > > > func:alloc_charge_folio 18M 4355 mm/memory.c:1190 > > > > > func:folio_prealloc 24M 6119 mm/memory.c:1192 > > > > > func:folio_prealloc 58M 14784 mm/page_ext.c:271 > > > > > func:alloc_page_ext 61M 15448 mm/readahead.c:189 > > > > > func:ractl_alloc_folio 79M 6726 mm/slub.c:3059 > > > > > func:alloc_slab_page 11G 2674488 mm/filemap.c:2012 > > > > > func:__filemap_get_folio > > > > Maybe narrowing down the "Noncache" caller of __filemap_get_folio > > would help clarify things. (It could be designed that way, and needs > > other route than dropping-cache to release the memory, just > > guess....) If you want, you can modify code to split the accounting > > for __filemap_get_folio according to its callers. > > > Thanks again, I'll add this patch in and see where I end up. > > The issue is nothing will cause the memory to be freed. Dropping caches > doesn't work, memory pressure doesn't work, unmounting the filesystems > doesn't work. Removing the cephfs and netfs kernel modules also doesn't > work. > > This is why I feel it's a ref_count (or similar) issue. > > I've also found it seems to be a fixed amount leaked each time, per > file. Simply doing lots of IO on one large file doesn't leak as fast as > lots of "small" (greater than 10MB less than 100MB seems to be a sweet > spot) > > Also, dropping caches while the workload is running actually amplifies > the issue. So it very much feels like something is wrong in the reclaim > code. > > Anyway I'll get this patch applied and see where I end up. > > I now have crash dumps (after enabling crash_on_oom) so I'm going to > try and see if I can find these structures and see what state they are > in > > Thanks a lot for reporting the issue. Finally, I can see the discussion in email list. :) Are you working on the patch with the fix? Should we wait for the fix or I need to start the issue reproduction and investigation? I am simply trying to avoid patches collision and, also, I have multiple other issues for the fix in CephFS kernel client. :) Thanks, Slava.
On Mon, 15 Dec 2025 19:42:56 +0000 Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > Hi Mal, > <SNIP> > > Thanks a lot for reporting the issue. Finally, I can see the > discussion in email list. :) Are you working on the patch with the > fix? Should we wait for the fix or I need to start the issue > reproduction and investigation? I am simply trying to avoid patches > collision and, also, I have multiple other issues for the fix in > CephFS kernel client. :) > > Thanks, > Slava. Hello, Unfortunately creating a patch is just outside my comfort zone, I've lived too long in Lustre land. I've have been trying to narrow down a consistent reproducer that's as fast as my production workload. (It crashes a 32GB VM in 2hrs) And I haven't got it quite as fast. I think the dd workload is too well behaved. I can confirm the issue appeared in the major patch set that was applied as part of the 6.15 kernel. So during the more complete pages to folios switch and that nothing has changed in the bug behaviour since then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c and didn't see any changes post 6.15 that looked like they would impact the bug behavior. Again, I'm not super familiar with the CephFS code but to hazard a guess, but I think that the web download workload triggers things faster suggests that unaligned writes might make things worse. But again, I'm not 100% sure. I can't find a reproducer as fast as downloading a dataset. Rsync of lots and lots of tiny files is a tad faster than the dd case. I did see some changes in ceph_check_page_before_write where the previous code unlocked pages and then continued where as the changed folio code just returns ENODATA and doesn't unlock anything with most of the rest of the logic unchanged. This might be perfectly fine, but in my, admittedly limited, reading of the code I couldn't figure out where anything that was locked prior to this being called would get unlocked like it did prior to the change. Again, I could be miles off here and one of the bulk reclaim/unlock passes that was added might be cleaning this up correctly or some other functional change might take care of this, but it looks to be potentially in the code path I'm excising and it has had some unlock logic changed. I've spent most of my time trying to find a solid quick reproducer. Not that it takes long to start leaking folios, but I wanted something that aggressively triggered it so a small vm would oom quickly and when combined with crash_on_oom it could potentially be used for regression testing by way of "did vm crash?". I'm not sure if it will super help, but I'll provide what details I can about the actual workload that really sets it off. It's a python based tool for downloading datasets. Datasets are split into N chunks and the tool downloads them in parallel 100 at a time until all N chunks are down. The compressed dataset is then unpacked and reassembled for use with workloads. This is replicating a common home folder usecase in HPC. CephFS is very attractive for home folders due to it's "NFS-like" utility and performance. And many tools use a similar method for fetching large datasets. Tools are frequently written in python or go. None of my customers have hit this yet, not have any enterprise customers as none have moved to a new enough kernel yet due to slow upgrade cycles. Even Proxmox have only just started testing on a kernel version > 6.14. I'm more than happy to help however I can with testing. I can run instrumented kernels or test patches or whatever you need. I am sorry I haven't been able to produce a super clean, fast reproducer (my test cluster at home is all spinners and only 500TB usable). But I figured I needed to get the word out asap as distros and soon customers are going to be moving past 6.12-6.14 kernels as the 5-7 year update cycle marches on. Especially those wanting to take full advantage of CacheFS and encryption functionality. Again thanks for looking at this and do reach out if I can help in anyway. I am in the ceph slack if it's faster to reach out that way. Regards Mal Haak
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: >On Mon, 15 Dec 2025 19:42:56 +0000 >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > >> Hi Mal, >> ><SNIP> >> >> Thanks a lot for reporting the issue. Finally, I can see the >> discussion in email list. :) Are you working on the patch with the >> fix? Should we wait for the fix or I need to start the issue >> reproduction and investigation? I am simply trying to avoid patches >> collision and, also, I have multiple other issues for the fix in >> CephFS kernel client. :) >> >> Thanks, >> Slava. > >Hello, > >Unfortunately creating a patch is just outside my comfort zone, I've >lived too long in Lustre land. Hi, just out of curiosity, have you narrowed down the caller of __filemap_get_folio causing the memory problem? Or do you have trouble applying the debug patch for memory allocation profiling? David > >I've have been trying to narrow down a consistent reproducer that's as >fast as my production workload. (It crashes a 32GB VM in 2hrs) And I >haven't got it quite as fast. I think the dd workload is too well >behaved. > >I can confirm the issue appeared in the major patch set that was >applied as part of the 6.15 kernel. So during the more complete pages >to folios switch and that nothing has changed in the bug behaviour since >then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c >and didn't see any changes post 6.15 that looked like they would impact >the bug behavior. > >Again, I'm not super familiar with the CephFS code but to hazard a >guess, but I think that the web download workload triggers things faster >suggests that unaligned writes might make things worse. But again, I'm >not 100% sure. I can't find a reproducer as fast as downloading a >dataset. Rsync of lots and lots of tiny files is a tad faster than the >dd case. > >I did see some changes in ceph_check_page_before_write where the >previous code unlocked pages and then continued where as the changed >folio code just returns ENODATA and doesn't unlock anything with most >of the rest of the logic unchanged. This might be perfectly fine, but >in my, admittedly limited, reading of the code I couldn't figure out >where anything that was locked prior to this being called would get >unlocked like it did prior to the change. Again, I could be miles off >here and one of the bulk reclaim/unlock passes that was added might be >cleaning this up correctly or some other functional change might take >care of this, but it looks to be potentially in the code path I'm >excising and it has had some unlock logic changed. > >I've spent most of my time trying to find a solid quick reproducer. Not >that it takes long to start leaking folios, but I wanted something that >aggressively triggered it so a small vm would oom quickly and when >combined with crash_on_oom it could potentially be used for regression >testing by way of "did vm crash?". > >I'm not sure if it will super help, but I'll provide what details I can >about the actual workload that really sets it off. It's a python based >tool for downloading datasets. Datasets are split into N chunks and the >tool downloads them in parallel 100 at a time until all N chunks are >down. The compressed dataset is then unpacked and reassembled for >use with workloads. > >This is replicating a common home folder usecase in HPC. CephFS is very >attractive for home folders due to it's "NFS-like" utility and >performance. And many tools use a similar method for fetching large >datasets. Tools are frequently written in python or go. > >None of my customers have hit this yet, not have any enterprise >customers as none have moved to a new enough kernel yet due to slow >upgrade cycles. Even Proxmox have only just started testing on a kernel >version > 6.14. > >I'm more than happy to help however I can with testing. I can run >instrumented kernels or test patches or whatever you need. I am sorry I >haven't been able to produce a super clean, fast reproducer (my test >cluster at home is all spinners and only 500TB usable). But I figured I >needed to get the word out asap as distros and soon customers are going >to be moving past 6.12-6.14 kernels as the 5-7 year update cycle >marches on. Especially those wanting to take full advantage of CacheFS >and encryption functionality. > >Again thanks for looking at this and do reach out if I can help in >anyway. I am in the ceph slack if it's faster to reach out that way. > >Regards > >Mal Haak
On Tue, 16 Dec 2025 15:00:43 +0800 (CST) "David Wang" <00107082@163.com> wrote: > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: > >On Mon, 15 Dec 2025 19:42:56 +0000 > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > >> Hi Mal, > >> > ><SNIP> > >> > >> Thanks a lot for reporting the issue. Finally, I can see the > >> discussion in email list. :) Are you working on the patch with the > >> fix? Should we wait for the fix or I need to start the issue > >> reproduction and investigation? I am simply trying to avoid patches > >> collision and, also, I have multiple other issues for the fix in > >> CephFS kernel client. :) > >> > >> Thanks, > >> Slava. > > > >Hello, > > > >Unfortunately creating a patch is just outside my comfort zone, I've > >lived too long in Lustre land. > > Hi, just out of curiosity, have you narrowed down the caller of > __filemap_get_folio causing the memory problem? Or do you have > trouble applying the debug patch for memory allocation profiling? > > David > Hi David, I hadn't yet as I did test XFS and NFS to see if it replicated the behaviour and it did not. But actually this could speed things up considerably. I will do that now and see what I get. Thanks Mal > > > >I've have been trying to narrow down a consistent reproducer that's > >as fast as my production workload. (It crashes a 32GB VM in 2hrs) > >And I haven't got it quite as fast. I think the dd workload is too > >well behaved. > > > >I can confirm the issue appeared in the major patch set that was > >applied as part of the 6.15 kernel. So during the more complete pages > >to folios switch and that nothing has changed in the bug behaviour > >since then. I did have a look at all the diffs from 6.14 to 6.18 on > >addr.c and didn't see any changes post 6.15 that looked like they > >would impact the bug behavior. > > > >Again, I'm not super familiar with the CephFS code but to hazard a > >guess, but I think that the web download workload triggers things > >faster suggests that unaligned writes might make things worse. But > >again, I'm not 100% sure. I can't find a reproducer as fast as > >downloading a dataset. Rsync of lots and lots of tiny files is a tad > >faster than the dd case. > > > >I did see some changes in ceph_check_page_before_write where the > >previous code unlocked pages and then continued where as the changed > >folio code just returns ENODATA and doesn't unlock anything with most > >of the rest of the logic unchanged. This might be perfectly fine, but > >in my, admittedly limited, reading of the code I couldn't figure out > >where anything that was locked prior to this being called would get > >unlocked like it did prior to the change. Again, I could be miles off > >here and one of the bulk reclaim/unlock passes that was added might > >be cleaning this up correctly or some other functional change might > >take care of this, but it looks to be potentially in the code path > >I'm excising and it has had some unlock logic changed. > > > >I've spent most of my time trying to find a solid quick reproducer. > >Not that it takes long to start leaking folios, but I wanted > >something that aggressively triggered it so a small vm would oom > >quickly and when combined with crash_on_oom it could potentially be > >used for regression testing by way of "did vm crash?". > > > >I'm not sure if it will super help, but I'll provide what details I > >can about the actual workload that really sets it off. It's a python > >based tool for downloading datasets. Datasets are split into N > >chunks and the tool downloads them in parallel 100 at a time until > >all N chunks are down. The compressed dataset is then unpacked and > >reassembled for use with workloads. > > > >This is replicating a common home folder usecase in HPC. CephFS is > >very attractive for home folders due to it's "NFS-like" utility and > >performance. And many tools use a similar method for fetching large > >datasets. Tools are frequently written in python or go. > > > >None of my customers have hit this yet, not have any enterprise > >customers as none have moved to a new enough kernel yet due to slow > >upgrade cycles. Even Proxmox have only just started testing on a > >kernel version > 6.14. > > > >I'm more than happy to help however I can with testing. I can run > >instrumented kernels or test patches or whatever you need. I am > >sorry I haven't been able to produce a super clean, fast reproducer > >(my test cluster at home is all spinners and only 500TB usable). But > >I figured I needed to get the word out asap as distros and soon > >customers are going to be moving past 6.12-6.14 kernels as the 5-7 > >year update cycle marches on. Especially those wanting to take full > >advantage of CacheFS and encryption functionality. > > > >Again thanks for looking at this and do reach out if I can help in > >anyway. I am in the ceph slack if it's faster to reach out that way. > > > >Regards > > > >Mal Haak
On Tue, 2025-12-16 at 11:26 +1000, Mal Haak wrote: > On Mon, 15 Dec 2025 19:42:56 +0000 > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > Hi Mal, > > > <SNIP> > > > > Thanks a lot for reporting the issue. Finally, I can see the > > discussion in email list. :) Are you working on the patch with the > > fix? Should we wait for the fix or I need to start the issue > > reproduction and investigation? I am simply trying to avoid patches > > collision and, also, I have multiple other issues for the fix in > > CephFS kernel client. :) > > > > Thanks, > > Slava. > > Hello, > > Unfortunately creating a patch is just outside my comfort zone, I've > lived too long in Lustre land. > > I've have been trying to narrow down a consistent reproducer that's as > fast as my production workload. (It crashes a 32GB VM in 2hrs) And I > haven't got it quite as fast. I think the dd workload is too well > behaved. > > I can confirm the issue appeared in the major patch set that was > applied as part of the 6.15 kernel. So during the more complete pages > to folios switch and that nothing has changed in the bug behaviour since > then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c > and didn't see any changes post 6.15 that looked like they would impact > the bug behavior. > > Again, I'm not super familiar with the CephFS code but to hazard a > guess, but I think that the web download workload triggers things faster > suggests that unaligned writes might make things worse. But again, I'm > not 100% sure. I can't find a reproducer as fast as downloading a > dataset. Rsync of lots and lots of tiny files is a tad faster than the > dd case. > > I did see some changes in ceph_check_page_before_write where the > previous code unlocked pages and then continued where as the changed > folio code just returns ENODATA and doesn't unlock anything with most > of the rest of the logic unchanged. This might be perfectly fine, but > in my, admittedly limited, reading of the code I couldn't figure out > where anything that was locked prior to this being called would get > unlocked like it did prior to the change. Again, I could be miles off > here and one of the bulk reclaim/unlock passes that was added might be > cleaning this up correctly or some other functional change might take > care of this, but it looks to be potentially in the code path I'm > excising and it has had some unlock logic changed. > > I've spent most of my time trying to find a solid quick reproducer. Not > that it takes long to start leaking folios, but I wanted something that > aggressively triggered it so a small vm would oom quickly and when > combined with crash_on_oom it could potentially be used for regression > testing by way of "did vm crash?". > > I'm not sure if it will super help, but I'll provide what details I can > about the actual workload that really sets it off. It's a python based > tool for downloading datasets. Datasets are split into N chunks and the > tool downloads them in parallel 100 at a time until all N chunks are > down. The compressed dataset is then unpacked and reassembled for > use with workloads. > > This is replicating a common home folder usecase in HPC. CephFS is very > attractive for home folders due to it's "NFS-like" utility and > performance. And many tools use a similar method for fetching large > datasets. Tools are frequently written in python or go. > > None of my customers have hit this yet, not have any enterprise > customers as none have moved to a new enough kernel yet due to slow > upgrade cycles. Even Proxmox have only just started testing on a kernel > version > 6.14. > > I'm more than happy to help however I can with testing. I can run > instrumented kernels or test patches or whatever you need. I am sorry I > haven't been able to produce a super clean, fast reproducer (my test > cluster at home is all spinners and only 500TB usable). But I figured I > needed to get the word out asap as distros and soon customers are going > to be moving past 6.12-6.14 kernels as the 5-7 year update cycle > marches on. Especially those wanting to take full advantage of CacheFS > and encryption functionality. > > Again thanks for looking at this and do reach out if I can help in > anyway. I am in the ceph slack if it's faster to reach out that way. > > Thanks a lot for of your efforts. I hope it will help a lot. Let me start to reproduce the issue. I'll let you know if I need additional details. I'll share my progress and potential troubles in the ticket that you've created. Thanks, Slava.
© 2016 - 2025 Red Hat, Inc.