At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
>On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
>"David Wang" <00107082@163.com> wrote:
>
>> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
>> >On Mon, 8 Dec 2025 19:08:29 +0800
>> >David Wang <00107082@163.com> wrote:
>> >
>> >> On Mon, 10 Nov 2025 18:20:08 +1000
>> >> Mal Haak <malcolm@haak.id.au> wrote:
>> >> > Hello,
>> >> >
>> >> > I have found a memory leak in 6.17.7 but I am unsure how to
>> >> > track it down effectively.
>> >> >
>> >> >
>> >>
>> >> I think the `memory allocation profiling` feature can help.
>> >> https://docs.kernel.org/mm/allocation-profiling.html
>> >>
>> >> You would need to build a kernel with
>> >> CONFIG_MEM_ALLOC_PROFILING=y
>> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> >>
>> >> And check /proc/allocinfo for the suspicious allocations which take
>> >> more memory than expected.
>> >>
>> >> (I once caught a nvidia driver memory leak.)
>> >>
>> >>
>> >> FYI
>> >> David
>> >>
>> >
>> >Thank you for your suggestion. I have some results.
>> >
>> >Ran the rsync workload for about 9 hours. It started to look like it
>> >was happening.
>> ># smem -pw
>> >Area Used Cache Noncache
>> >firmware/hardware 0.00% 0.00% 0.00%
>> >kernel image 0.00% 0.00% 0.00%
>> >kernel dynamic memory 80.46% 65.80% 14.66%
>> >userspace memory 0.35% 0.16% 0.19%
>> >free memory 19.19% 19.19% 0.00%
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 22M 5609 mm/memory.c:1190 func:folio_prealloc
>> > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs]
>> >func:xfs_buf_alloc_backing_mem
>> > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs]
>> > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192
>> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
>> > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069
>> > func:alloc_charge_folio 430M 770788 lib/xarray.c:378
>> > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page
>> > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio
>> > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio
>> >
>> >
>> >So I stopped the workload and dropped caches to confirm.
>> >
>> ># echo 3 > /proc/sys/vm/drop_caches
>> ># smem -pw
>> >Area Used Cache Noncache
>> >firmware/hardware 0.00% 0.00% 0.00%
>> >kernel image 0.00% 0.00% 0.00%
>> >kernel dynamic memory 33.45% 0.09% 33.36%
>> >userspace memory 0.36% 0.16% 0.19%
>> >free memory 66.20% 66.20% 0.00%
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
>> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
>> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
>> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
>> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
>> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
>> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
>> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
>> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
>> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
Maybe narrowing down the "Noncache" caller of __filemap_get_folio would help clarify things.
(It could be designed that way, and needs other route than dropping-cache to release the memory, just guess....)
If you want, you can modify code to split the accounting for __filemap_get_folio according to its callers.
Following is a draft patch: (based on v6.18)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 09b581c1d878..ba8c659a6ae3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
}
void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+
+#define __filemap_get_folio(...) \
+ alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
+
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 024b71da5224..e1c1c26d7cb3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
*
* Return: The found folio or an ERR_PTR() otherwise.
*/
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
+struct folio *__filemap_get_folio_noprof(struct address_space *mapping, pgoff_t index,
fgf_t fgp_flags, gfp_t gfp)
{
struct folio *folio;
@@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
err = -ENOMEM;
if (order > min_order)
alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
- folio = filemap_alloc_folio(alloc_gfp, order);
+ folio = filemap_alloc_folio_noprof(alloc_gfp, order);
if (!folio)
continue;
@@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_clear_dropbehind(folio);
return folio;
}
-EXPORT_SYMBOL(__filemap_get_folio);
+EXPORT_SYMBOL(__filemap_get_folio_noprof);
static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
xa_mark_t mark)
FYI
David
>> >
>> >So if I'm reading this correctly something is causing folios collect
>> >and not be able to be freed?
>>
>> CC cephfs, maybe someone could have an easy reading out of those
>> folio usage
>>
>>
>> >
>> >Also it's clear that some of the folio's are counting as cache and
>> >some aren't.
>> >
>> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm now
>> >going to manually walk through previous kernel releases and find
>> >where it first starts happening purely because I'm having issues
>> >building earlier kernels due to rust stuff and other python
>> >incompatibilities making doing a git-bisect a bit fun.
>> >
>> >I'll do it the packages way until I get closer, then solve the build
>> >issues.
>> >
>> >Thanks,
>> >Mal
>> >
>Thanks David.
>
>I've contacted the ceph developers as well.
>
>There was a suggestion it was due to the change from, to quote:
>folio.free() to folio.put() or something like this.
>
>The change happened around 6.14/6.15
>
>I've found an easier reproducer.
>
>There has been a suggestion that perhaps the ceph team might not fix
>this as "you can just reboot before the machine becomes unstable" and
>"Nobody else has encountered this bug"
>
>I'll leave that to other people to make a call on but I'd assume the
>lack of reports is due to the fact that most stable distros are still
>on a a far too early kernel and/or are using the fuse driver with k8s.
>
>Anyway, thanks for your assistance.
On Thu, 11 Dec 2025 11:28:21 +0800 (CST)
"David Wang" <00107082@163.com> wrote:
> At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote:
> >On Tue, 9 Dec 2025 12:40:21 +0800 (CST)
> >"David Wang" <00107082@163.com> wrote:
> >
> >> At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote:
> >> >On Mon, 8 Dec 2025 19:08:29 +0800
> >> >David Wang <00107082@163.com> wrote:
> >> >
> >> >> On Mon, 10 Nov 2025 18:20:08 +1000
> >> >> Mal Haak <malcolm@haak.id.au> wrote:
> >> >> > Hello,
> >> >> >
> >> >> > I have found a memory leak in 6.17.7 but I am unsure how to
> >> >> > track it down effectively.
> >> >> >
> >> >> >
> >> >>
> >> >> I think the `memory allocation profiling` feature can help.
> >> >> https://docs.kernel.org/mm/allocation-profiling.html
> >> >>
> >> >> You would need to build a kernel with
> >> >> CONFIG_MEM_ALLOC_PROFILING=y
> >> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >> >>
> >> >> And check /proc/allocinfo for the suspicious allocations which
> >> >> take more memory than expected.
> >> >>
> >> >> (I once caught a nvidia driver memory leak.)
> >> >>
> >> >>
> >> >> FYI
> >> >> David
> >> >>
> >> >
> >> >Thank you for your suggestion. I have some results.
> >> >
> >> >Ran the rsync workload for about 9 hours. It started to look like
> >> >it was happening.
> >> ># smem -pw
> >> >Area Used Cache Noncache
> >> >firmware/hardware 0.00% 0.00% 0.00%
> >> >kernel image 0.00% 0.00% 0.00%
> >> >kernel dynamic memory 80.46% 65.80% 14.66%
> >> >userspace memory 0.35% 0.16% 0.19%
> >> >free memory 19.19% 19.19% 0.00%
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> > 22M 5609 mm/memory.c:1190 func:folio_prealloc
> >> > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs]
> >> >func:xfs_buf_alloc_backing_mem
> >> > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs]
> >> > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192
> >> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 430M 770788 lib/xarray.c:378
> >> > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page
> >> > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio
> >> > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio
> >> >
> >> >
> >> >So I stopped the workload and dropped caches to confirm.
> >> >
> >> ># echo 3 > /proc/sys/vm/drop_caches
> >> ># smem -pw
> >> >Area Used Cache Noncache
> >> >firmware/hardware 0.00% 0.00% 0.00%
> >> >kernel image 0.00% 0.00% 0.00%
> >> >kernel dynamic memory 33.45% 0.09% 33.36%
> >> >userspace memory 0.36% 0.16% 0.19%
> >> >free memory 66.20% 66.20% 0.00%
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
> >> > 12M 3 kernel/dma/pool.c:96
> >> > func:atomic_pool_expand 13M 751 mm/slub.c:3061
> >> > func:alloc_slab_page 16M 8 mm/khugepaged.c:1069
> >> > func:alloc_charge_folio 18M 4355 mm/memory.c:1190
> >> > func:folio_prealloc 24M 6119 mm/memory.c:1192
> >> > func:folio_prealloc 58M 14784 mm/page_ext.c:271
> >> > func:alloc_page_ext 61M 15448 mm/readahead.c:189
> >> > func:ractl_alloc_folio 79M 6726 mm/slub.c:3059
> >> > func:alloc_slab_page 11G 2674488 mm/filemap.c:2012
> >> > func:__filemap_get_folio
>
> Maybe narrowing down the "Noncache" caller of __filemap_get_folio
> would help clarify things. (It could be designed that way, and needs
> other route than dropping-cache to release the memory, just
> guess....) If you want, you can modify code to split the accounting
> for __filemap_get_folio according to its callers.
Thanks again, I'll add this patch in and see where I end up.
The issue is nothing will cause the memory to be freed. Dropping caches
doesn't work, memory pressure doesn't work, unmounting the filesystems
doesn't work. Removing the cephfs and netfs kernel modules also doesn't
work.
This is why I feel it's a ref_count (or similar) issue.
I've also found it seems to be a fixed amount leaked each time, per
file. Simply doing lots of IO on one large file doesn't leak as fast as
lots of "small" (greater than 10MB less than 100MB seems to be a sweet
spot)
Also, dropping caches while the workload is running actually amplifies
the issue. So it very much feels like something is wrong in the reclaim
code.
Anyway I'll get this patch applied and see where I end up.
I now have crash dumps (after enabling crash_on_oom) so I'm going to
try and see if I can find these structures and see what state they are
in
Thanks again.
Mal
> Following is a draft patch: (based on v6.18)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878..ba8c659a6ae3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -753,7 +753,11 @@ static inline fgf_t fgf_set_order(size_t size)
> }
>
> void *filemap_get_entry(struct address_space *mapping, pgoff_t
> index); -struct folio *__filemap_get_folio(struct address_space
> *mapping, pgoff_t index, +
> +#define __filemap_get_folio(...) \
> + alloc_hooks(__filemap_get_folio_noprof(__VA_ARGS__))
> +
> +struct folio *__filemap_get_folio_noprof(struct address_space
> *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> struct page *pagecache_get_page(struct address_space *mapping,
> pgoff_t index, fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 024b71da5224..e1c1c26d7cb3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1938,7 +1938,7 @@ void *filemap_get_entry(struct address_space
> *mapping, pgoff_t index) *
> * Return: The found folio or an ERR_PTR() otherwise.
> */
> -struct folio *__filemap_get_folio(struct address_space *mapping,
> pgoff_t index, +struct folio *__filemap_get_folio_noprof(struct
> address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp)
> {
> struct folio *folio;
> @@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, err = -ENOMEM;
> if (order > min_order)
> alloc_gfp |= __GFP_NORETRY |
> __GFP_NOWARN;
> - folio = filemap_alloc_folio(alloc_gfp,
> order);
> + folio =
> filemap_alloc_folio_noprof(alloc_gfp, order); if (!folio)
> continue;
>
> @@ -2056,7 +2056,7 @@ struct folio *__filemap_get_folio(struct
> address_space *mapping, pgoff_t index, folio_clear_dropbehind(folio);
> return folio;
> }
> -EXPORT_SYMBOL(__filemap_get_folio);
> +EXPORT_SYMBOL(__filemap_get_folio_noprof);
>
> static inline struct folio *find_get_entry(struct xa_state *xas,
> pgoff_t max, xa_mark_t mark)
>
>
>
>
> FYI
> David
>
> >> >
> >> >So if I'm reading this correctly something is causing folios
> >> >collect and not be able to be freed?
> >>
> >> CC cephfs, maybe someone could have an easy reading out of those
> >> folio usage
> >>
> >>
> >> >
> >> >Also it's clear that some of the folio's are counting as cache and
> >> >some aren't.
> >> >
> >> >Like I said 6.17 and 6.18 both have the issue. 6.12 does not. I'm
> >> >now going to manually walk through previous kernel releases and
> >> >find where it first starts happening purely because I'm having
> >> >issues building earlier kernels due to rust stuff and other python
> >> >incompatibilities making doing a git-bisect a bit fun.
> >> >
> >> >I'll do it the packages way until I get closer, then solve the
> >> >build issues.
> >> >
> >> >Thanks,
> >> >Mal
> >> >
> >Thanks David.
> >
> >I've contacted the ceph developers as well.
> >
> >There was a suggestion it was due to the change from, to quote:
> >folio.free() to folio.put() or something like this.
> >
> >The change happened around 6.14/6.15
> >
> >I've found an easier reproducer.
> >
> >There has been a suggestion that perhaps the ceph team might not fix
> >this as "you can just reboot before the machine becomes unstable" and
> >"Nobody else has encountered this bug"
> >
> >I'll leave that to other people to make a call on but I'd assume the
> >lack of reports is due to the fact that most stable distros are still
> >on a a far too early kernel and/or are using the fuse driver with
> >k8s.
> >
> >Anyway, thanks for your assistance.
Hi Mal, On Thu, 2025-12-11 at 14:23 +1000, Mal Haak wrote: > On Thu, 11 Dec 2025 11:28:21 +0800 (CST) > "David Wang" <00107082@163.com> wrote: > > > At 2025-12-10 21:43:18, "Mal Haak" <malcolm@haak.id.au> wrote: > > > On Tue, 9 Dec 2025 12:40:21 +0800 (CST) > > > "David Wang" <00107082@163.com> wrote: > > > > > > > At 2025-12-09 07:08:31, "Mal Haak" <malcolm@haak.id.au> wrote: > > > > > On Mon, 8 Dec 2025 19:08:29 +0800 > > > > > David Wang <00107082@163.com> wrote: > > > > > > > > > > > On Mon, 10 Nov 2025 18:20:08 +1000 > > > > > > Mal Haak <malcolm@haak.id.au> wrote: > > > > > > > Hello, > > > > > > > > > > > > > > I have found a memory leak in 6.17.7 but I am unsure how to > > > > > > > track it down effectively. > > > > > > > > > > > > > > > > > > > > > > > > > > I think the `memory allocation profiling` feature can help. > > > > > > https://docs.kernel.org/mm/allocation-profiling.html > > > > > > > > > > > > You would need to build a kernel with > > > > > > CONFIG_MEM_ALLOC_PROFILING=y > > > > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y > > > > > > > > > > > > And check /proc/allocinfo for the suspicious allocations which > > > > > > take more memory than expected. > > > > > > > > > > > > (I once caught a nvidia driver memory leak.) > > > > > > > > > > > > > > > > > > FYI > > > > > > David > > > > > > > > > > > > > > > > Thank you for your suggestion. I have some results. > > > > > > > > > > Ran the rsync workload for about 9 hours. It started to look like > > > > > it was happening. > > > > > # smem -pw > > > > > Area Used Cache Noncache > > > > > firmware/hardware 0.00% 0.00% 0.00% > > > > > kernel image 0.00% 0.00% 0.00% > > > > > kernel dynamic memory 80.46% 65.80% 14.66% > > > > > userspace memory 0.35% 0.16% 0.19% > > > > > free memory 19.19% 19.19% 0.00% > > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec > > > > > 22M 5609 mm/memory.c:1190 func:folio_prealloc > > > > > 23M 1932 fs/xfs/xfs_buf.c:226 [xfs] > > > > > func:xfs_buf_alloc_backing_mem > > > > > 24M 24135 fs/xfs/xfs_icache.c:97 [xfs] > > > > > func:xfs_inode_alloc 27M 6693 mm/memory.c:1192 > > > > > func:folio_prealloc 58M 14784 mm/page_ext.c:271 > > > > > func:alloc_page_ext 258M 129 mm/khugepaged.c:1069 > > > > > func:alloc_charge_folio 430M 770788 lib/xarray.c:378 > > > > > func:xas_alloc 545M 36444 mm/slub.c:3059 func:alloc_slab_page > > > > > 9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio > > > > > 20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio > > > > > > > > > > > > > > > So I stopped the workload and dropped caches to confirm. > > > > > > > > > > # echo 3 > /proc/sys/vm/drop_caches > > > > > # smem -pw > > > > > Area Used Cache Noncache > > > > > firmware/hardware 0.00% 0.00% 0.00% > > > > > kernel image 0.00% 0.00% 0.00% > > > > > kernel dynamic memory 33.45% 0.09% 33.36% > > > > > userspace memory 0.36% 0.16% 0.19% > > > > > free memory 66.20% 66.20% 0.00% > > > > > # sort -g /proc/allocinfo|tail|numfmt --to=iec > > > > > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc > > > > > 12M 3 kernel/dma/pool.c:96 > > > > > func:atomic_pool_expand 13M 751 mm/slub.c:3061 > > > > > func:alloc_slab_page 16M 8 mm/khugepaged.c:1069 > > > > > func:alloc_charge_folio 18M 4355 mm/memory.c:1190 > > > > > func:folio_prealloc 24M 6119 mm/memory.c:1192 > > > > > func:folio_prealloc 58M 14784 mm/page_ext.c:271 > > > > > func:alloc_page_ext 61M 15448 mm/readahead.c:189 > > > > > func:ractl_alloc_folio 79M 6726 mm/slub.c:3059 > > > > > func:alloc_slab_page 11G 2674488 mm/filemap.c:2012 > > > > > func:__filemap_get_folio > > > > Maybe narrowing down the "Noncache" caller of __filemap_get_folio > > would help clarify things. (It could be designed that way, and needs > > other route than dropping-cache to release the memory, just > > guess....) If you want, you can modify code to split the accounting > > for __filemap_get_folio according to its callers. > > > Thanks again, I'll add this patch in and see where I end up. > > The issue is nothing will cause the memory to be freed. Dropping caches > doesn't work, memory pressure doesn't work, unmounting the filesystems > doesn't work. Removing the cephfs and netfs kernel modules also doesn't > work. > > This is why I feel it's a ref_count (or similar) issue. > > I've also found it seems to be a fixed amount leaked each time, per > file. Simply doing lots of IO on one large file doesn't leak as fast as > lots of "small" (greater than 10MB less than 100MB seems to be a sweet > spot) > > Also, dropping caches while the workload is running actually amplifies > the issue. So it very much feels like something is wrong in the reclaim > code. > > Anyway I'll get this patch applied and see where I end up. > > I now have crash dumps (after enabling crash_on_oom) so I'm going to > try and see if I can find these structures and see what state they are > in > > Thanks a lot for reporting the issue. Finally, I can see the discussion in email list. :) Are you working on the patch with the fix? Should we wait for the fix or I need to start the issue reproduction and investigation? I am simply trying to avoid patches collision and, also, I have multiple other issues for the fix in CephFS kernel client. :) Thanks, Slava.
On Mon, 15 Dec 2025 19:42:56 +0000 Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > Hi Mal, > <SNIP> > > Thanks a lot for reporting the issue. Finally, I can see the > discussion in email list. :) Are you working on the patch with the > fix? Should we wait for the fix or I need to start the issue > reproduction and investigation? I am simply trying to avoid patches > collision and, also, I have multiple other issues for the fix in > CephFS kernel client. :) > > Thanks, > Slava. Hello, Unfortunately creating a patch is just outside my comfort zone, I've lived too long in Lustre land. I've have been trying to narrow down a consistent reproducer that's as fast as my production workload. (It crashes a 32GB VM in 2hrs) And I haven't got it quite as fast. I think the dd workload is too well behaved. I can confirm the issue appeared in the major patch set that was applied as part of the 6.15 kernel. So during the more complete pages to folios switch and that nothing has changed in the bug behaviour since then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c and didn't see any changes post 6.15 that looked like they would impact the bug behavior. Again, I'm not super familiar with the CephFS code but to hazard a guess, but I think that the web download workload triggers things faster suggests that unaligned writes might make things worse. But again, I'm not 100% sure. I can't find a reproducer as fast as downloading a dataset. Rsync of lots and lots of tiny files is a tad faster than the dd case. I did see some changes in ceph_check_page_before_write where the previous code unlocked pages and then continued where as the changed folio code just returns ENODATA and doesn't unlock anything with most of the rest of the logic unchanged. This might be perfectly fine, but in my, admittedly limited, reading of the code I couldn't figure out where anything that was locked prior to this being called would get unlocked like it did prior to the change. Again, I could be miles off here and one of the bulk reclaim/unlock passes that was added might be cleaning this up correctly or some other functional change might take care of this, but it looks to be potentially in the code path I'm excising and it has had some unlock logic changed. I've spent most of my time trying to find a solid quick reproducer. Not that it takes long to start leaking folios, but I wanted something that aggressively triggered it so a small vm would oom quickly and when combined with crash_on_oom it could potentially be used for regression testing by way of "did vm crash?". I'm not sure if it will super help, but I'll provide what details I can about the actual workload that really sets it off. It's a python based tool for downloading datasets. Datasets are split into N chunks and the tool downloads them in parallel 100 at a time until all N chunks are down. The compressed dataset is then unpacked and reassembled for use with workloads. This is replicating a common home folder usecase in HPC. CephFS is very attractive for home folders due to it's "NFS-like" utility and performance. And many tools use a similar method for fetching large datasets. Tools are frequently written in python or go. None of my customers have hit this yet, not have any enterprise customers as none have moved to a new enough kernel yet due to slow upgrade cycles. Even Proxmox have only just started testing on a kernel version > 6.14. I'm more than happy to help however I can with testing. I can run instrumented kernels or test patches or whatever you need. I am sorry I haven't been able to produce a super clean, fast reproducer (my test cluster at home is all spinners and only 500TB usable). But I figured I needed to get the word out asap as distros and soon customers are going to be moving past 6.12-6.14 kernels as the 5-7 year update cycle marches on. Especially those wanting to take full advantage of CacheFS and encryption functionality. Again thanks for looking at this and do reach out if I can help in anyway. I am in the ceph slack if it's faster to reach out that way. Regards Mal Haak
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: >On Mon, 15 Dec 2025 19:42:56 +0000 >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > >> Hi Mal, >> ><SNIP> >> >> Thanks a lot for reporting the issue. Finally, I can see the >> discussion in email list. :) Are you working on the patch with the >> fix? Should we wait for the fix or I need to start the issue >> reproduction and investigation? I am simply trying to avoid patches >> collision and, also, I have multiple other issues for the fix in >> CephFS kernel client. :) >> >> Thanks, >> Slava. > >Hello, > >Unfortunately creating a patch is just outside my comfort zone, I've >lived too long in Lustre land. > >I've have been trying to narrow down a consistent reproducer that's as >fast as my production workload. (It crashes a 32GB VM in 2hrs) And I >haven't got it quite as fast. I think the dd workload is too well >behaved. > >I can confirm the issue appeared in the major patch set that was >applied as part of the 6.15 kernel. So during the more complete pages >to folios switch and that nothing has changed in the bug behaviour since >then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c >and didn't see any changes post 6.15 that looked like they would impact >the bug behavior. Hi, Just a suggestion, in case you run out of idea for further investigation. I think you can bisect *manually* targeting changes of fs/cephfs between 6.14 and 6.15 $ git log --pretty='format:%h %an' v6.14..v6.15 fs/ceph 349b7d77f5a1 Linus Torvalds b261d2222063 Eric Biggers f452a2204614 David Howells e63046adefc0 Linus Torvalds 59b59a943177 Matthew Wilcox (Oracle) <-----------3 efbdd92ed9f6 Matthew Wilcox (Oracle) d1b452673af4 Matthew Wilcox (Oracle) ad49fe2b3d54 Matthew Wilcox (Oracle) a55cf4fd8fae Matthew Wilcox (Oracle) 15fdaf2fd60d Matthew Wilcox (Oracle) 62171c16da60 Matthew Wilcox (Oracle) baff9740bc8f Matthew Wilcox (Oracle) f9707a8b5b9d Matthew Wilcox (Oracle) 88a59bda3f37 Matthew Wilcox (Oracle) 19a288110435 Matthew Wilcox (Oracle) fd7449d937e7 Viacheslav Dubeyko <---------2 1551ec61dc55 Viacheslav Dubeyko ce80b76dd327 Viacheslav Dubeyko f08068df4aa4 Viacheslav Dubeyko 3f92c7b57687 NeilBrown <-----------1 88d5baf69082 NeilBrown There were 3 major patch set (group by author), the suspect could be narrowed down further. (Bisect, even over a short range of patch, is quite an unpleasant experience though....) FYI David > >Again, I'm not super familiar with the CephFS code but to hazard a >guess, but I think that the web download workload triggers things faster >suggests that unaligned writes might make things worse. But again, I'm >not 100% sure. I can't find a reproducer as fast as downloading a >dataset. Rsync of lots and lots of tiny files is a tad faster than the >dd case. > >I did see some changes in ceph_check_page_before_write where the >previous code unlocked pages and then continued where as the changed >folio code just returns ENODATA and doesn't unlock anything with most >of the rest of the logic unchanged. This might be perfectly fine, but >in my, admittedly limited, reading of the code I couldn't figure out >where anything that was locked prior to this being called would get >unlocked like it did prior to the change. Again, I could be miles off >here and one of the bulk reclaim/unlock passes that was added might be >cleaning this up correctly or some other functional change might take >care of this, but it looks to be potentially in the code path I'm >excising and it has had some unlock logic changed. > >I've spent most of my time trying to find a solid quick reproducer. Not >that it takes long to start leaking folios, but I wanted something that >aggressively triggered it so a small vm would oom quickly and when >combined with crash_on_oom it could potentially be used for regression >testing by way of "did vm crash?". > >I'm not sure if it will super help, but I'll provide what details I can >about the actual workload that really sets it off. It's a python based >tool for downloading datasets. Datasets are split into N chunks and the >tool downloads them in parallel 100 at a time until all N chunks are >down. The compressed dataset is then unpacked and reassembled for >use with workloads. > >This is replicating a common home folder usecase in HPC. CephFS is very >attractive for home folders due to it's "NFS-like" utility and >performance. And many tools use a similar method for fetching large >datasets. Tools are frequently written in python or go. > >None of my customers have hit this yet, not have any enterprise >customers as none have moved to a new enough kernel yet due to slow >upgrade cycles. Even Proxmox have only just started testing on a kernel >version > 6.14. > >I'm more than happy to help however I can with testing. I can run >instrumented kernels or test patches or whatever you need. I am sorry I >haven't been able to produce a super clean, fast reproducer (my test >cluster at home is all spinners and only 500TB usable). But I figured I >needed to get the word out asap as distros and soon customers are going >to be moving past 6.12-6.14 kernels as the 5-7 year update cycle >marches on. Especially those wanting to take full advantage of CacheFS >and encryption functionality. > >Again thanks for looking at this and do reach out if I can help in >anyway. I am in the ceph slack if it's faster to reach out that way. > >Regards > >Mal Haak
On Wed, 17 Dec 2025 13:59:47 +0800 (CST) "David Wang" <00107082@163.com> wrote: > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: > >On Mon, 15 Dec 2025 19:42:56 +0000 > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > >> Hi Mal, > >> > ><SNIP> > >> > >> Thanks a lot for reporting the issue. Finally, I can see the > >> discussion in email list. :) Are you working on the patch with the > >> fix? Should we wait for the fix or I need to start the issue > >> reproduction and investigation? I am simply trying to avoid patches > >> collision and, also, I have multiple other issues for the fix in > >> CephFS kernel client. :) > >> > >> Thanks, > >> Slava. > > > >Hello, > > > >Unfortunately creating a patch is just outside my comfort zone, I've > >lived too long in Lustre land. > > > >I've have been trying to narrow down a consistent reproducer that's > >as fast as my production workload. (It crashes a 32GB VM in 2hrs) > >And I haven't got it quite as fast. I think the dd workload is too > >well behaved. > > > >I can confirm the issue appeared in the major patch set that was > >applied as part of the 6.15 kernel. So during the more complete pages > >to folios switch and that nothing has changed in the bug behaviour > >since then. I did have a look at all the diffs from 6.14 to 6.18 on > >addr.c and didn't see any changes post 6.15 that looked like they > >would impact the bug behavior. > > Hi, > Just a suggestion, in case you run out of idea for further > investigation. I think you can bisect *manually* targeting changes > of fs/cephfs between 6.14 and 6.15 > > > $ git log --pretty='format:%h %an' v6.14..v6.15 fs/ceph > 349b7d77f5a1 Linus Torvalds > b261d2222063 Eric Biggers > f452a2204614 David Howells > e63046adefc0 Linus Torvalds > 59b59a943177 Matthew Wilcox (Oracle) <-----------3 > efbdd92ed9f6 Matthew Wilcox (Oracle) > d1b452673af4 Matthew Wilcox (Oracle) > ad49fe2b3d54 Matthew Wilcox (Oracle) > a55cf4fd8fae Matthew Wilcox (Oracle) > 15fdaf2fd60d Matthew Wilcox (Oracle) > 62171c16da60 Matthew Wilcox (Oracle) > baff9740bc8f Matthew Wilcox (Oracle) > f9707a8b5b9d Matthew Wilcox (Oracle) > 88a59bda3f37 Matthew Wilcox (Oracle) > 19a288110435 Matthew Wilcox (Oracle) > fd7449d937e7 Viacheslav Dubeyko <---------2 > 1551ec61dc55 Viacheslav Dubeyko > ce80b76dd327 Viacheslav Dubeyko > f08068df4aa4 Viacheslav Dubeyko > 3f92c7b57687 NeilBrown <-----------1 > 88d5baf69082 NeilBrown > > There were 3 major patch set (group by author), the suspect could be > narrowed down further. > > > (Bisect, even over a short range of patch, is quite an unpleasant > experience though....) > > FYI > David > > > <SNIP> Yeah, I don't think it's a small patch that is the cause of the issue. It looks like there was a patch set that migrated cephfs off handling individual pages and onto folios to enable wider use of netfs features like local caching and encryption, as examples. I'm not sure that set can be broken up and result in a working cephfs module. Which limits the utility of a git-bisect. I'm pretty sure the issue is in addr.c somewhere and most of the changes in there are one patch. That said, after I get this crash dump I'll probably do it anyway. What I really need to do is get a crash dump to look at what state the folios and their tracking is in. Assuming I can grok what I'm looking at. This is the bit I'm most apprehensive of. I'm hoping I can find a list of folios used by the reclaim machinery that is missing a bunch of folios. That or a bunch with inflated refcounts or something. Something is going awry, but it's not fast. I thought I had a quick reproducer. I was wrong, I sized the DD workload incorrectly and triggered the panic_on_oom due to that, not the bug. I'm re-running the reproducer now, on a VM with 2GB of ram and its been running for around 3hrs and I think at most its leaked possibly 100MB-150MB of ram at most. (It was averaging 190-200MB of noncache usage. It's now averaging 290-340MB). It does accelerate. The more folios that are in the weird state, the more end up in the weird state. Which feels like fragmentation side effects, but I'm just speculating. Anyway, one of the wonderful ceph developers is looking into it. I just hope I can do enough to help them locate the issue. They are having troubles reproducing last I heard from them but they might have been expecting a slightly faster reproducer. I have however recreated it on a physical host not just a vm. So I feel like I can rule out being a VM as a cause. Anyway thanks for your continued assistance! Mal
At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: >On Mon, 15 Dec 2025 19:42:56 +0000 >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > >> Hi Mal, >> ><SNIP> >> >> Thanks a lot for reporting the issue. Finally, I can see the >> discussion in email list. :) Are you working on the patch with the >> fix? Should we wait for the fix or I need to start the issue >> reproduction and investigation? I am simply trying to avoid patches >> collision and, also, I have multiple other issues for the fix in >> CephFS kernel client. :) >> >> Thanks, >> Slava. > >Hello, > >Unfortunately creating a patch is just outside my comfort zone, I've >lived too long in Lustre land. Hi, just out of curiosity, have you narrowed down the caller of __filemap_get_folio causing the memory problem? Or do you have trouble applying the debug patch for memory allocation profiling? David > >I've have been trying to narrow down a consistent reproducer that's as >fast as my production workload. (It crashes a 32GB VM in 2hrs) And I >haven't got it quite as fast. I think the dd workload is too well >behaved. > >I can confirm the issue appeared in the major patch set that was >applied as part of the 6.15 kernel. So during the more complete pages >to folios switch and that nothing has changed in the bug behaviour since >then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c >and didn't see any changes post 6.15 that looked like they would impact >the bug behavior. > >Again, I'm not super familiar with the CephFS code but to hazard a >guess, but I think that the web download workload triggers things faster >suggests that unaligned writes might make things worse. But again, I'm >not 100% sure. I can't find a reproducer as fast as downloading a >dataset. Rsync of lots and lots of tiny files is a tad faster than the >dd case. > >I did see some changes in ceph_check_page_before_write where the >previous code unlocked pages and then continued where as the changed >folio code just returns ENODATA and doesn't unlock anything with most >of the rest of the logic unchanged. This might be perfectly fine, but >in my, admittedly limited, reading of the code I couldn't figure out >where anything that was locked prior to this being called would get >unlocked like it did prior to the change. Again, I could be miles off >here and one of the bulk reclaim/unlock passes that was added might be >cleaning this up correctly or some other functional change might take >care of this, but it looks to be potentially in the code path I'm >excising and it has had some unlock logic changed. > >I've spent most of my time trying to find a solid quick reproducer. Not >that it takes long to start leaking folios, but I wanted something that >aggressively triggered it so a small vm would oom quickly and when >combined with crash_on_oom it could potentially be used for regression >testing by way of "did vm crash?". > >I'm not sure if it will super help, but I'll provide what details I can >about the actual workload that really sets it off. It's a python based >tool for downloading datasets. Datasets are split into N chunks and the >tool downloads them in parallel 100 at a time until all N chunks are >down. The compressed dataset is then unpacked and reassembled for >use with workloads. > >This is replicating a common home folder usecase in HPC. CephFS is very >attractive for home folders due to it's "NFS-like" utility and >performance. And many tools use a similar method for fetching large >datasets. Tools are frequently written in python or go. > >None of my customers have hit this yet, not have any enterprise >customers as none have moved to a new enough kernel yet due to slow >upgrade cycles. Even Proxmox have only just started testing on a kernel >version > 6.14. > >I'm more than happy to help however I can with testing. I can run >instrumented kernels or test patches or whatever you need. I am sorry I >haven't been able to produce a super clean, fast reproducer (my test >cluster at home is all spinners and only 500TB usable). But I figured I >needed to get the word out asap as distros and soon customers are going >to be moving past 6.12-6.14 kernels as the 5-7 year update cycle >marches on. Especially those wanting to take full advantage of CacheFS >and encryption functionality. > >Again thanks for looking at this and do reach out if I can help in >anyway. I am in the ceph slack if it's faster to reach out that way. > >Regards > >Mal Haak
On Tue, 16 Dec 2025 15:00:43 +0800 (CST) "David Wang" <00107082@163.com> wrote: > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: > >On Mon, 15 Dec 2025 19:42:56 +0000 > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > >> Hi Mal, > >> > ><SNIP> > >> > >> Thanks a lot for reporting the issue. Finally, I can see the > >> discussion in email list. :) Are you working on the patch with the > >> fix? Should we wait for the fix or I need to start the issue > >> reproduction and investigation? I am simply trying to avoid patches > >> collision and, also, I have multiple other issues for the fix in > >> CephFS kernel client. :) > >> > >> Thanks, > >> Slava. > > > >Hello, > > > >Unfortunately creating a patch is just outside my comfort zone, I've > >lived too long in Lustre land. > > Hi, just out of curiosity, have you narrowed down the caller of > __filemap_get_folio causing the memory problem? Or do you have > trouble applying the debug patch for memory allocation profiling? > > David > Hi David, I hadn't yet as I did test XFS and NFS to see if it replicated the behaviour and it did not. But actually this could speed things up considerably. I will do that now and see what I get. Thanks Mal > > > >I've have been trying to narrow down a consistent reproducer that's > >as fast as my production workload. (It crashes a 32GB VM in 2hrs) > >And I haven't got it quite as fast. I think the dd workload is too > >well behaved. > > > >I can confirm the issue appeared in the major patch set that was > >applied as part of the 6.15 kernel. So during the more complete pages > >to folios switch and that nothing has changed in the bug behaviour > >since then. I did have a look at all the diffs from 6.14 to 6.18 on > >addr.c and didn't see any changes post 6.15 that looked like they > >would impact the bug behavior. > > > >Again, I'm not super familiar with the CephFS code but to hazard a > >guess, but I think that the web download workload triggers things > >faster suggests that unaligned writes might make things worse. But > >again, I'm not 100% sure. I can't find a reproducer as fast as > >downloading a dataset. Rsync of lots and lots of tiny files is a tad > >faster than the dd case. > > > >I did see some changes in ceph_check_page_before_write where the > >previous code unlocked pages and then continued where as the changed > >folio code just returns ENODATA and doesn't unlock anything with most > >of the rest of the logic unchanged. This might be perfectly fine, but > >in my, admittedly limited, reading of the code I couldn't figure out > >where anything that was locked prior to this being called would get > >unlocked like it did prior to the change. Again, I could be miles off > >here and one of the bulk reclaim/unlock passes that was added might > >be cleaning this up correctly or some other functional change might > >take care of this, but it looks to be potentially in the code path > >I'm excising and it has had some unlock logic changed. > > > >I've spent most of my time trying to find a solid quick reproducer. > >Not that it takes long to start leaking folios, but I wanted > >something that aggressively triggered it so a small vm would oom > >quickly and when combined with crash_on_oom it could potentially be > >used for regression testing by way of "did vm crash?". > > > >I'm not sure if it will super help, but I'll provide what details I > >can about the actual workload that really sets it off. It's a python > >based tool for downloading datasets. Datasets are split into N > >chunks and the tool downloads them in parallel 100 at a time until > >all N chunks are down. The compressed dataset is then unpacked and > >reassembled for use with workloads. > > > >This is replicating a common home folder usecase in HPC. CephFS is > >very attractive for home folders due to it's "NFS-like" utility and > >performance. And many tools use a similar method for fetching large > >datasets. Tools are frequently written in python or go. > > > >None of my customers have hit this yet, not have any enterprise > >customers as none have moved to a new enough kernel yet due to slow > >upgrade cycles. Even Proxmox have only just started testing on a > >kernel version > 6.14. > > > >I'm more than happy to help however I can with testing. I can run > >instrumented kernels or test patches or whatever you need. I am > >sorry I haven't been able to produce a super clean, fast reproducer > >(my test cluster at home is all spinners and only 500TB usable). But > >I figured I needed to get the word out asap as distros and soon > >customers are going to be moving past 6.12-6.14 kernels as the 5-7 > >year update cycle marches on. Especially those wanting to take full > >advantage of CacheFS and encryption functionality. > > > >Again thanks for looking at this and do reach out if I can help in > >anyway. I am in the ceph slack if it's faster to reach out that way. > > > >Regards > > > >Mal Haak
On Tue, 16 Dec 2025 17:09:18 +1000
Mal Haak <malcolm@haak.id.au> wrote:
> On Tue, 16 Dec 2025 15:00:43 +0800 (CST)
> "David Wang" <00107082@163.com> wrote:
>
> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote:
> > >On Mon, 15 Dec 2025 19:42:56 +0000
> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > >
> > >> Hi Mal,
> > >>
> > ><SNIP>
> > >>
> > >> Thanks a lot for reporting the issue. Finally, I can see the
> > >> discussion in email list. :) Are you working on the patch with
> > >> the fix? Should we wait for the fix or I need to start the issue
> > >> reproduction and investigation? I am simply trying to avoid
> > >> patches collision and, also, I have multiple other issues for
> > >> the fix in CephFS kernel client. :)
> > >>
> > >> Thanks,
> > >> Slava.
> > >
> > >Hello,
> > >
> > >Unfortunately creating a patch is just outside my comfort zone,
> > >I've lived too long in Lustre land.
> >
> > Hi, just out of curiosity, have you narrowed down the caller of
> > __filemap_get_folio causing the memory problem? Or do you have
> > trouble applying the debug patch for memory allocation profiling?
> >
> > David
> >
> Hi David,
>
> I hadn't yet as I did test XFS and NFS to see if it replicated the
> behaviour and it did not.
>
> But actually this could speed things up considerably. I will do that
> now and see what I get.
>
> Thanks
>
> Mal
>
I did just give it a blast.
Unfortunately it returned exactly what I expected, that is the calls
are all coming from netfs.
Which makes sense for cephfs.
# sort -g /proc/allocinfo|tail|numfmt --to=iec
10M 2541 drivers/block/zram/zram_drv.c:1597 [zram]
func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc
12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node
16M 992 mm/slub.c:3061 func:alloc_slab_page
20M 35544 lib/xarray.c:378 func:xas_alloc
31M 7704 mm/memory.c:1192 func:folio_prealloc
69M 17562 mm/memory.c:1190 func:folio_prealloc
104M 8212 mm/slub.c:3059 func:alloc_slab_page
124M 30075 mm/readahead.c:189 func:ractl_alloc_folio
2.6G 661392 fs/netfs/buffered_read.c:635 [netfs]
func:netfs_write_begin
So, unfortunately it doesn't reveal the true source. But was worth a
shot! So thanks again
Mal
> > >
> > >I've have been trying to narrow down a consistent reproducer that's
> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs)
> > >And I haven't got it quite as fast. I think the dd workload is too
> > >well behaved.
> > >
> > >I can confirm the issue appeared in the major patch set that was
> > >applied as part of the 6.15 kernel. So during the more complete
> > >pages to folios switch and that nothing has changed in the bug
> > >behaviour since then. I did have a look at all the diffs from 6.14
> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked
> > >like they would impact the bug behavior.
> > >
> > >Again, I'm not super familiar with the CephFS code but to hazard a
> > >guess, but I think that the web download workload triggers things
> > >faster suggests that unaligned writes might make things worse. But
> > >again, I'm not 100% sure. I can't find a reproducer as fast as
> > >downloading a dataset. Rsync of lots and lots of tiny files is a
> > >tad faster than the dd case.
> > >
> > >I did see some changes in ceph_check_page_before_write where the
> > >previous code unlocked pages and then continued where as the
> > >changed folio code just returns ENODATA and doesn't unlock
> > >anything with most of the rest of the logic unchanged. This might
> > >be perfectly fine, but in my, admittedly limited, reading of the
> > >code I couldn't figure out where anything that was locked prior to
> > >this being called would get unlocked like it did prior to the
> > >change. Again, I could be miles off here and one of the bulk
> > >reclaim/unlock passes that was added might be cleaning this up
> > >correctly or some other functional change might take care of this,
> > >but it looks to be potentially in the code path I'm excising and
> > >it has had some unlock logic changed.
> > >
> > >I've spent most of my time trying to find a solid quick reproducer.
> > >Not that it takes long to start leaking folios, but I wanted
> > >something that aggressively triggered it so a small vm would oom
> > >quickly and when combined with crash_on_oom it could potentially be
> > >used for regression testing by way of "did vm crash?".
> > >
> > >I'm not sure if it will super help, but I'll provide what details I
> > >can about the actual workload that really sets it off. It's a
> > >python based tool for downloading datasets. Datasets are split
> > >into N chunks and the tool downloads them in parallel 100 at a
> > >time until all N chunks are down. The compressed dataset is then
> > >unpacked and reassembled for use with workloads.
> > >
> > >This is replicating a common home folder usecase in HPC. CephFS is
> > >very attractive for home folders due to it's "NFS-like" utility and
> > >performance. And many tools use a similar method for fetching large
> > >datasets. Tools are frequently written in python or go.
> > >
> > >None of my customers have hit this yet, not have any enterprise
> > >customers as none have moved to a new enough kernel yet due to slow
> > >upgrade cycles. Even Proxmox have only just started testing on a
> > >kernel version > 6.14.
> > >
> > >I'm more than happy to help however I can with testing. I can run
> > >instrumented kernels or test patches or whatever you need. I am
> > >sorry I haven't been able to produce a super clean, fast reproducer
> > >(my test cluster at home is all spinners and only 500TB usable).
> > >But I figured I needed to get the word out asap as distros and soon
> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7
> > >year update cycle marches on. Especially those wanting to take full
> > >advantage of CacheFS and encryption functionality.
> > >
> > >Again thanks for looking at this and do reach out if I can help in
> > >anyway. I am in the ceph slack if it's faster to reach out that
> > >way.
> > >
> > >Regards
> > >
> > >Mal Haak
>
At 2025-12-16 19:55:27, "Mal Haak" <malcolm@haak.id.au> wrote: >On Tue, 16 Dec 2025 17:09:18 +1000 >Mal Haak <malcolm@haak.id.au> wrote: > >> On Tue, 16 Dec 2025 15:00:43 +0800 (CST) >> "David Wang" <00107082@163.com> wrote: >> >> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: >> > >On Mon, 15 Dec 2025 19:42:56 +0000 >> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: >> > > >> > >> Hi Mal, >> > >> >> > ><SNIP> >> > >> >> > >> Thanks a lot for reporting the issue. Finally, I can see the >> > >> discussion in email list. :) Are you working on the patch with >> > >> the fix? Should we wait for the fix or I need to start the issue >> > >> reproduction and investigation? I am simply trying to avoid >> > >> patches collision and, also, I have multiple other issues for >> > >> the fix in CephFS kernel client. :) >> > >> >> > >> Thanks, >> > >> Slava. >> > > >> > >Hello, >> > > >> > >Unfortunately creating a patch is just outside my comfort zone, >> > >I've lived too long in Lustre land. >> > >> > Hi, just out of curiosity, have you narrowed down the caller of >> > __filemap_get_folio causing the memory problem? Or do you have >> > trouble applying the debug patch for memory allocation profiling? >> > >> > David >> > >> Hi David, >> >> I hadn't yet as I did test XFS and NFS to see if it replicated the >> behaviour and it did not. >> >> But actually this could speed things up considerably. I will do that >> now and see what I get. >> >> Thanks >> >> Mal >> >I did just give it a blast. > >Unfortunately it returned exactly what I expected, that is the calls >are all coming from netfs. > >Which makes sense for cephfs. > ># sort -g /proc/allocinfo|tail|numfmt --to=iec > 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram] >func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc > 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node > 16M 992 mm/slub.c:3061 func:alloc_slab_page > 20M 35544 lib/xarray.c:378 func:xas_alloc > 31M 7704 mm/memory.c:1192 func:folio_prealloc > 69M 17562 mm/memory.c:1190 func:folio_prealloc > 104M 8212 mm/slub.c:3059 func:alloc_slab_page > 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio > 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] >func:netfs_write_begin > >So, unfortunately it doesn't reveal the true source. But was worth a >shot! So thanks again Oh, at least cephfs could be ruled out, right? CC netfs folks then. :) > >Mal > > >> > > >> > >I've have been trying to narrow down a consistent reproducer that's >> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs) >> > >And I haven't got it quite as fast. I think the dd workload is too >> > >well behaved. >> > > >> > >I can confirm the issue appeared in the major patch set that was >> > >applied as part of the 6.15 kernel. So during the more complete >> > >pages to folios switch and that nothing has changed in the bug >> > >behaviour since then. I did have a look at all the diffs from 6.14 >> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked >> > >like they would impact the bug behavior. >> > > >> > >Again, I'm not super familiar with the CephFS code but to hazard a >> > >guess, but I think that the web download workload triggers things >> > >faster suggests that unaligned writes might make things worse. But >> > >again, I'm not 100% sure. I can't find a reproducer as fast as >> > >downloading a dataset. Rsync of lots and lots of tiny files is a >> > >tad faster than the dd case. >> > > >> > >I did see some changes in ceph_check_page_before_write where the >> > >previous code unlocked pages and then continued where as the >> > >changed folio code just returns ENODATA and doesn't unlock >> > >anything with most of the rest of the logic unchanged. This might >> > >be perfectly fine, but in my, admittedly limited, reading of the >> > >code I couldn't figure out where anything that was locked prior to >> > >this being called would get unlocked like it did prior to the >> > >change. Again, I could be miles off here and one of the bulk >> > >reclaim/unlock passes that was added might be cleaning this up >> > >correctly or some other functional change might take care of this, >> > >but it looks to be potentially in the code path I'm excising and >> > >it has had some unlock logic changed. >> > > >> > >I've spent most of my time trying to find a solid quick reproducer. >> > >Not that it takes long to start leaking folios, but I wanted >> > >something that aggressively triggered it so a small vm would oom >> > >quickly and when combined with crash_on_oom it could potentially be >> > >used for regression testing by way of "did vm crash?". >> > > >> > >I'm not sure if it will super help, but I'll provide what details I >> > >can about the actual workload that really sets it off. It's a >> > >python based tool for downloading datasets. Datasets are split >> > >into N chunks and the tool downloads them in parallel 100 at a >> > >time until all N chunks are down. The compressed dataset is then >> > >unpacked and reassembled for use with workloads. >> > > >> > >This is replicating a common home folder usecase in HPC. CephFS is >> > >very attractive for home folders due to it's "NFS-like" utility and >> > >performance. And many tools use a similar method for fetching large >> > >datasets. Tools are frequently written in python or go. >> > > >> > >None of my customers have hit this yet, not have any enterprise >> > >customers as none have moved to a new enough kernel yet due to slow >> > >upgrade cycles. Even Proxmox have only just started testing on a >> > >kernel version > 6.14. >> > > >> > >I'm more than happy to help however I can with testing. I can run >> > >instrumented kernels or test patches or whatever you need. I am >> > >sorry I haven't been able to produce a super clean, fast reproducer >> > >(my test cluster at home is all spinners and only 500TB usable). >> > >But I figured I needed to get the word out asap as distros and soon >> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7 >> > >year update cycle marches on. Especially those wanting to take full >> > >advantage of CacheFS and encryption functionality. >> > > >> > >Again thanks for looking at this and do reach out if I can help in >> > >anyway. I am in the ceph slack if it's faster to reach out that >> > >way. >> > > >> > >Regards >> > > >> > >Mal Haak >>
At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote: > >At 2025-12-16 19:55:27, "Mal Haak" <malcolm@haak.id.au> wrote: >>On Tue, 16 Dec 2025 17:09:18 +1000 >>Mal Haak <malcolm@haak.id.au> wrote: >> >>> On Tue, 16 Dec 2025 15:00:43 +0800 (CST) >>> "David Wang" <00107082@163.com> wrote: >>> >>> > At 2025-12-16 09:26:47, "Mal Haak" <malcolm@haak.id.au> wrote: >>> > >On Mon, 15 Dec 2025 19:42:56 +0000 >>> > >Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: >>> > > >>> > >> Hi Mal, >>> > >> >>> > ><SNIP> >>> > >> >>> > >> Thanks a lot for reporting the issue. Finally, I can see the >>> > >> discussion in email list. :) Are you working on the patch with >>> > >> the fix? Should we wait for the fix or I need to start the issue >>> > >> reproduction and investigation? I am simply trying to avoid >>> > >> patches collision and, also, I have multiple other issues for >>> > >> the fix in CephFS kernel client. :) >>> > >> >>> > >> Thanks, >>> > >> Slava. >>> > > >>> > >Hello, >>> > > >>> > >Unfortunately creating a patch is just outside my comfort zone, >>> > >I've lived too long in Lustre land. >>> > >>> > Hi, just out of curiosity, have you narrowed down the caller of >>> > __filemap_get_folio causing the memory problem? Or do you have >>> > trouble applying the debug patch for memory allocation profiling? >>> > >>> > David >>> > >>> Hi David, >>> >>> I hadn't yet as I did test XFS and NFS to see if it replicated the >>> behaviour and it did not. >>> >>> But actually this could speed things up considerably. I will do that >>> now and see what I get. >>> >>> Thanks >>> >>> Mal >>> >>I did just give it a blast. >> >>Unfortunately it returned exactly what I expected, that is the calls >>are all coming from netfs. >> >>Which makes sense for cephfs. >> >># sort -g /proc/allocinfo|tail|numfmt --to=iec >> 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram] >>func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc >> 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node >> 16M 992 mm/slub.c:3061 func:alloc_slab_page >> 20M 35544 lib/xarray.c:378 func:xas_alloc >> 31M 7704 mm/memory.c:1192 func:folio_prealloc >> 69M 17562 mm/memory.c:1190 func:folio_prealloc >> 104M 8212 mm/slub.c:3059 func:alloc_slab_page >> 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio >> 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] >>func:netfs_write_begin >> >>So, unfortunately it doesn't reveal the true source. But was worth a >>shot! So thanks again > >Oh, at least cephfs could be ruled out, right? ehh...., I think I could be wrong about this..... > >CC netfs folks then. :) > > >> >>Mal >> >> >>> > > >>> > >I've have been trying to narrow down a consistent reproducer that's >>> > >as fast as my production workload. (It crashes a 32GB VM in 2hrs) >>> > >And I haven't got it quite as fast. I think the dd workload is too >>> > >well behaved. >>> > > >>> > >I can confirm the issue appeared in the major patch set that was >>> > >applied as part of the 6.15 kernel. So during the more complete >>> > >pages to folios switch and that nothing has changed in the bug >>> > >behaviour since then. I did have a look at all the diffs from 6.14 >>> > >to 6.18 on addr.c and didn't see any changes post 6.15 that looked >>> > >like they would impact the bug behavior. >>> > > >>> > >Again, I'm not super familiar with the CephFS code but to hazard a >>> > >guess, but I think that the web download workload triggers things >>> > >faster suggests that unaligned writes might make things worse. But >>> > >again, I'm not 100% sure. I can't find a reproducer as fast as >>> > >downloading a dataset. Rsync of lots and lots of tiny files is a >>> > >tad faster than the dd case. >>> > > >>> > >I did see some changes in ceph_check_page_before_write where the >>> > >previous code unlocked pages and then continued where as the >>> > >changed folio code just returns ENODATA and doesn't unlock >>> > >anything with most of the rest of the logic unchanged. This might >>> > >be perfectly fine, but in my, admittedly limited, reading of the >>> > >code I couldn't figure out where anything that was locked prior to >>> > >this being called would get unlocked like it did prior to the >>> > >change. Again, I could be miles off here and one of the bulk >>> > >reclaim/unlock passes that was added might be cleaning this up >>> > >correctly or some other functional change might take care of this, >>> > >but it looks to be potentially in the code path I'm excising and >>> > >it has had some unlock logic changed. >>> > > >>> > >I've spent most of my time trying to find a solid quick reproducer. >>> > >Not that it takes long to start leaking folios, but I wanted >>> > >something that aggressively triggered it so a small vm would oom >>> > >quickly and when combined with crash_on_oom it could potentially be >>> > >used for regression testing by way of "did vm crash?". >>> > > >>> > >I'm not sure if it will super help, but I'll provide what details I >>> > >can about the actual workload that really sets it off. It's a >>> > >python based tool for downloading datasets. Datasets are split >>> > >into N chunks and the tool downloads them in parallel 100 at a >>> > >time until all N chunks are down. The compressed dataset is then >>> > >unpacked and reassembled for use with workloads. >>> > > >>> > >This is replicating a common home folder usecase in HPC. CephFS is >>> > >very attractive for home folders due to it's "NFS-like" utility and >>> > >performance. And many tools use a similar method for fetching large >>> > >datasets. Tools are frequently written in python or go. >>> > > >>> > >None of my customers have hit this yet, not have any enterprise >>> > >customers as none have moved to a new enough kernel yet due to slow >>> > >upgrade cycles. Even Proxmox have only just started testing on a >>> > >kernel version > 6.14. >>> > > >>> > >I'm more than happy to help however I can with testing. I can run >>> > >instrumented kernels or test patches or whatever you need. I am >>> > >sorry I haven't been able to produce a super clean, fast reproducer >>> > >(my test cluster at home is all spinners and only 500TB usable). >>> > >But I figured I needed to get the word out asap as distros and soon >>> > >customers are going to be moving past 6.12-6.14 kernels as the 5-7 >>> > >year update cycle marches on. Especially those wanting to take full >>> > >advantage of CacheFS and encryption functionality. >>> > > >>> > >Again thanks for looking at this and do reach out if I can help in >>> > >anyway. I am in the ceph slack if it's faster to reach out that >>> > >way. >>> > > >>> > >Regards >>> > > >>> > >Mal Haak >>>
Hi Mal, On Tue, 2025-12-16 at 20:42 +0800, David Wang wrote: > At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote: > > > > <skipped> > > > > > > > > > > > > > > > > > > I've have been trying to narrow down a consistent reproducer that's > > > > > > as fast as my production workload. (It crashes a 32GB VM in 2hrs) > > > > > > And I haven't got it quite as fast. I think the dd workload is too > > > > > > well behaved. > > > > > > > > > > > > I can confirm the issue appeared in the major patch set that was > > > > > > applied as part of the 6.15 kernel. So during the more complete > > > > > > pages to folios switch and that nothing has changed in the bug > > > > > > behaviour since then. I did have a look at all the diffs from 6.14 > > > > > > to 6.18 on addr.c and didn't see any changes post 6.15 that looked > > > > > > like they would impact the bug behavior. > > > > > > > > > > > > Again, I'm not super familiar with the CephFS code but to hazard a > > > > > > guess, but I think that the web download workload triggers things > > > > > > faster suggests that unaligned writes might make things worse. But > > > > > > again, I'm not 100% sure. I can't find a reproducer as fast as > > > > > > downloading a dataset. Rsync of lots and lots of tiny files is a > > > > > > tad faster than the dd case. > > > > > > > > > > > > I did see some changes in ceph_check_page_before_write where the > > > > > > previous code unlocked pages and then continued where as the > > > > > > changed folio code just returns ENODATA and doesn't unlock > > > > > > anything with most of the rest of the logic unchanged. This might > > > > > > be perfectly fine, but in my, admittedly limited, reading of the > > > > > > code I couldn't figure out where anything that was locked prior to > > > > > > this being called would get unlocked like it did prior to the > > > > > > change. Again, I could be miles off here and one of the bulk > > > > > > reclaim/unlock passes that was added might be cleaning this up > > > > > > correctly or some other functional change might take care of this, > > > > > > but it looks to be potentially in the code path I'm excising and > > > > > > it has had some unlock logic changed. > > > > > > > > > > > > I've spent most of my time trying to find a solid quick reproducer. > > > > > > Not that it takes long to start leaking folios, but I wanted > > > > > > something that aggressively triggered it so a small vm would oom > > > > > > quickly and when combined with crash_on_oom it could potentially be > > > > > > used for regression testing by way of "did vm crash?". > > > > > > > > > > > > I'm not sure if it will super help, but I'll provide what details I > > > > > > can about the actual workload that really sets it off. It's a > > > > > > python based tool for downloading datasets. Datasets are split > > > > > > into N chunks and the tool downloads them in parallel 100 at a > > > > > > time until all N chunks are down. The compressed dataset is then > > > > > > unpacked and reassembled for use with workloads. > > > > > > > > > > > > This is replicating a common home folder usecase in HPC. CephFS is > > > > > > very attractive for home folders due to it's "NFS-like" utility and > > > > > > performance. And many tools use a similar method for fetching large > > > > > > datasets. Tools are frequently written in python or go. > > > > > > > > > > > > None of my customers have hit this yet, not have any enterprise > > > > > > customers as none have moved to a new enough kernel yet due to slow > > > > > > upgrade cycles. Even Proxmox have only just started testing on a > > > > > > kernel version > 6.14. > > > > > > > > > > > > I'm more than happy to help however I can with testing. I can run > > > > > > instrumented kernels or test patches or whatever you need. I am > > > > > > sorry I haven't been able to produce a super clean, fast reproducer > > > > > > (my test cluster at home is all spinners and only 500TB usable). > > > > > > But I figured I needed to get the word out asap as distros and soon > > > > > > customers are going to be moving past 6.12-6.14 kernels as the 5-7 > > > > > > year update cycle marches on. Especially those wanting to take full > > > > > > advantage of CacheFS and encryption functionality. > > > > > > > > > > > > Again thanks for looking at this and do reach out if I can help in > > > > > > anyway. I am in the ceph slack if it's faster to reach out that > > > > > > way. > > > > > > > > > > Could you please add your CephFS kernel client's mount options into the ticket [1]? Thanks a lot, Slava. [1] https://tracker.ceph.com/issues/74156
On Wed, 17 Dec 2025 01:56:52 +0000 Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > Hi Mal, > > On Tue, 2025-12-16 at 20:42 +0800, David Wang wrote: > > At 2025-12-16 20:18:11, "David Wang" <00107082@163.com> wrote: > > > > > > > > <skipped> > > > > > > > > > > > > > > > > > > > > > > > I've have been trying to narrow down a consistent > > > > > > > reproducer that's as fast as my production workload. (It > > > > > > > crashes a 32GB VM in 2hrs) And I haven't got it quite as > > > > > > > fast. I think the dd workload is too well behaved. > > > > > > > > > > > > > > I can confirm the issue appeared in the major patch set > > > > > > > that was applied as part of the 6.15 kernel. So during > > > > > > > the more complete pages to folios switch and that nothing > > > > > > > has changed in the bug behaviour since then. I did have a > > > > > > > look at all the diffs from 6.14 to 6.18 on addr.c and > > > > > > > didn't see any changes post 6.15 that looked like they > > > > > > > would impact the bug behavior. > > > > > > > > > > > > > > Again, I'm not super familiar with the CephFS code but to > > > > > > > hazard a guess, but I think that the web download > > > > > > > workload triggers things faster suggests that unaligned > > > > > > > writes might make things worse. But again, I'm not 100% > > > > > > > sure. I can't find a reproducer as fast as downloading a > > > > > > > dataset. Rsync of lots and lots of tiny files is a tad > > > > > > > faster than the dd case. > > > > > > > > > > > > > > I did see some changes in ceph_check_page_before_write > > > > > > > where the previous code unlocked pages and then continued > > > > > > > where as the changed folio code just returns ENODATA and > > > > > > > doesn't unlock anything with most of the rest of the > > > > > > > logic unchanged. This might be perfectly fine, but in my, > > > > > > > admittedly limited, reading of the code I couldn't figure > > > > > > > out where anything that was locked prior to this being > > > > > > > called would get unlocked like it did prior to the > > > > > > > change. Again, I could be miles off here and one of the > > > > > > > bulk reclaim/unlock passes that was added might be > > > > > > > cleaning this up correctly or some other functional > > > > > > > change might take care of this, but it looks to be > > > > > > > potentially in the code path I'm excising and it has had > > > > > > > some unlock logic changed. > > > > > > > > > > > > > > I've spent most of my time trying to find a solid quick > > > > > > > reproducer. Not that it takes long to start leaking > > > > > > > folios, but I wanted something that aggressively > > > > > > > triggered it so a small vm would oom quickly and when > > > > > > > combined with crash_on_oom it could potentially be used > > > > > > > for regression testing by way of "did vm crash?". > > > > > > > > > > > > > > I'm not sure if it will super help, but I'll provide what > > > > > > > details I can about the actual workload that really sets > > > > > > > it off. It's a python based tool for downloading > > > > > > > datasets. Datasets are split into N chunks and the tool > > > > > > > downloads them in parallel 100 at a time until all N > > > > > > > chunks are down. The compressed dataset is then unpacked > > > > > > > and reassembled for use with workloads. > > > > > > > > > > > > > > This is replicating a common home folder usecase in HPC. > > > > > > > CephFS is very attractive for home folders due to it's > > > > > > > "NFS-like" utility and performance. And many tools use a > > > > > > > similar method for fetching large datasets. Tools are > > > > > > > frequently written in python or go. > > > > > > > > > > > > > > None of my customers have hit this yet, not have any > > > > > > > enterprise customers as none have moved to a new enough > > > > > > > kernel yet due to slow upgrade cycles. Even Proxmox have > > > > > > > only just started testing on a kernel version > 6.14. > > > > > > > > > > > > > > I'm more than happy to help however I can with testing. I > > > > > > > can run instrumented kernels or test patches or whatever > > > > > > > you need. I am sorry I haven't been able to produce a > > > > > > > super clean, fast reproducer (my test cluster at home is > > > > > > > all spinners and only 500TB usable). But I figured I > > > > > > > needed to get the word out asap as distros and soon > > > > > > > customers are going to be moving past 6.12-6.14 kernels > > > > > > > as the 5-7 year update cycle marches on. Especially those > > > > > > > wanting to take full advantage of CacheFS and encryption > > > > > > > functionality. > > > > > > > > > > > > > > Again thanks for looking at this and do reach out if I > > > > > > > can help in anyway. I am in the ceph slack if it's faster > > > > > > > to reach out that way. > > > > > > > > > > > > > > Could you please add your CephFS kernel client's mount options into > the ticket [1]? > > Thanks a lot, > Slava. > > [1] https://tracker.ceph.com/issues/74156 I've updated the ticket. I am curious about the differences between your test setup and my actual setup in terms of capacity and hardware. I can provide crash dumps if it is helpful. Thanks Mal
On Tue, 2025-12-16 at 11:26 +1000, Mal Haak wrote: > On Mon, 15 Dec 2025 19:42:56 +0000 > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > Hi Mal, > > > <SNIP> > > > > Thanks a lot for reporting the issue. Finally, I can see the > > discussion in email list. :) Are you working on the patch with the > > fix? Should we wait for the fix or I need to start the issue > > reproduction and investigation? I am simply trying to avoid patches > > collision and, also, I have multiple other issues for the fix in > > CephFS kernel client. :) > > > > Thanks, > > Slava. > > Hello, > > Unfortunately creating a patch is just outside my comfort zone, I've > lived too long in Lustre land. > > I've have been trying to narrow down a consistent reproducer that's as > fast as my production workload. (It crashes a 32GB VM in 2hrs) And I > haven't got it quite as fast. I think the dd workload is too well > behaved. > > I can confirm the issue appeared in the major patch set that was > applied as part of the 6.15 kernel. So during the more complete pages > to folios switch and that nothing has changed in the bug behaviour since > then. I did have a look at all the diffs from 6.14 to 6.18 on addr.c > and didn't see any changes post 6.15 that looked like they would impact > the bug behavior. > > Again, I'm not super familiar with the CephFS code but to hazard a > guess, but I think that the web download workload triggers things faster > suggests that unaligned writes might make things worse. But again, I'm > not 100% sure. I can't find a reproducer as fast as downloading a > dataset. Rsync of lots and lots of tiny files is a tad faster than the > dd case. > > I did see some changes in ceph_check_page_before_write where the > previous code unlocked pages and then continued where as the changed > folio code just returns ENODATA and doesn't unlock anything with most > of the rest of the logic unchanged. This might be perfectly fine, but > in my, admittedly limited, reading of the code I couldn't figure out > where anything that was locked prior to this being called would get > unlocked like it did prior to the change. Again, I could be miles off > here and one of the bulk reclaim/unlock passes that was added might be > cleaning this up correctly or some other functional change might take > care of this, but it looks to be potentially in the code path I'm > excising and it has had some unlock logic changed. > > I've spent most of my time trying to find a solid quick reproducer. Not > that it takes long to start leaking folios, but I wanted something that > aggressively triggered it so a small vm would oom quickly and when > combined with crash_on_oom it could potentially be used for regression > testing by way of "did vm crash?". > > I'm not sure if it will super help, but I'll provide what details I can > about the actual workload that really sets it off. It's a python based > tool for downloading datasets. Datasets are split into N chunks and the > tool downloads them in parallel 100 at a time until all N chunks are > down. The compressed dataset is then unpacked and reassembled for > use with workloads. > > This is replicating a common home folder usecase in HPC. CephFS is very > attractive for home folders due to it's "NFS-like" utility and > performance. And many tools use a similar method for fetching large > datasets. Tools are frequently written in python or go. > > None of my customers have hit this yet, not have any enterprise > customers as none have moved to a new enough kernel yet due to slow > upgrade cycles. Even Proxmox have only just started testing on a kernel > version > 6.14. > > I'm more than happy to help however I can with testing. I can run > instrumented kernels or test patches or whatever you need. I am sorry I > haven't been able to produce a super clean, fast reproducer (my test > cluster at home is all spinners and only 500TB usable). But I figured I > needed to get the word out asap as distros and soon customers are going > to be moving past 6.12-6.14 kernels as the 5-7 year update cycle > marches on. Especially those wanting to take full advantage of CacheFS > and encryption functionality. > > Again thanks for looking at this and do reach out if I can help in > anyway. I am in the ceph slack if it's faster to reach out that way. > > Thanks a lot for of your efforts. I hope it will help a lot. Let me start to reproduce the issue. I'll let you know if I need additional details. I'll share my progress and potential troubles in the ticket that you've created. Thanks, Slava.
© 2016 - 2026 Red Hat, Inc.