[PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info

Yueyang Pan posted 2 patches 1 month ago
There is a newer version of this series
mm/show_mem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
[PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info
Posted by Yueyang Pan 1 month ago
This patch set fixes two issues we saw in production rollout. 

The first issue is that we saw all zero output of memory allocation 
profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING 
is set and sysctl.vm.mem_profiling=0. In this case, the behaviour 
should be the same as when CONFIG_MEM_ALLOC_PROFILING is unset, 
where show_mem prints nothing about the information. This will make 
further parse easier as we don't have to differentiate what a all 
zero line actually means (Does it mean  0 bytes are allocated 
or simply memory allocation profiling is disabled).

The second issue is that multiple entities can call show_mem() 
which messed up the allocation info in dmesg. We saw outputs like this:  
```
    327 MiB    83635 mm/compaction.c:1880 func:compaction_alloc
   48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
   7.48 GiB    10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
    298 MiB    95216 kernel/fork.c:318 func:alloc_thread_stack_node
    250 MiB    63901 mm/zsmalloc.c:987 func:alloc_zspage
    1.42 GiB   372527 mm/memory.c:1063 func:folio_prealloc
    1.17 GiB    95693 mm/slub.c:2424 func:alloc_slab_page
     651 MiB   166732 mm/readahead.c:270 func:page_cache_ra_unbounded
     419 MiB   107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
     404 MiB   103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
```
The above example is because one kthread invokes show_mem() 
from __alloc_pages_slowpath while kernel itself calls 
oom_kill_process()

Yueyang Pan (2):
  mm/show_mem: No print when not mem_alloc_profiling_enabled()
  mm/show_mem: Add trylock while printing alloc info

 mm/show_mem.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

-- 
2.47.3
Re: [PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info
Posted by Vishal Moola (Oracle) 1 month ago
On Wed, Aug 27, 2025 at 11:34:21AM -0700, Yueyang Pan wrote:
> This patch set fixes two issues we saw in production rollout. 
> 
> The first issue is that we saw all zero output of memory allocation 
> profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING 
> is set and sysctl.vm.mem_profiling=0. In this case, the behaviour 
> should be the same as when CONFIG_MEM_ALLOC_PROFILING is unset, 

Did you mean to say when sysctl.vm.mem_profiling=never?

My understanding is that setting the sysctl=0 Pauses memory allocation
profiling, while 1 Resumes it. When the sysctl=never should be the same
as when the config is unset, but I suspect we might still want the info
when set to 0.

> where show_mem prints nothing about the information. This will make 
> further parse easier as we don't have to differentiate what a all 
> zero line actually means (Does it mean  0 bytes are allocated 
> or simply memory allocation profiling is disabled).
> 
> The second issue is that multiple entities can call show_mem() 
> which messed up the allocation info in dmesg. We saw outputs like this:  
> ```
>     327 MiB    83635 mm/compaction.c:1880 func:compaction_alloc
>    48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
>    7.48 GiB    10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
>     298 MiB    95216 kernel/fork.c:318 func:alloc_thread_stack_node
>     250 MiB    63901 mm/zsmalloc.c:987 func:alloc_zspage
>     1.42 GiB   372527 mm/memory.c:1063 func:folio_prealloc
>     1.17 GiB    95693 mm/slub.c:2424 func:alloc_slab_page
>      651 MiB   166732 mm/readahead.c:270 func:page_cache_ra_unbounded
>      419 MiB   107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
>      404 MiB   103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
> ```
> The above example is because one kthread invokes show_mem() 
> from __alloc_pages_slowpath while kernel itself calls 
> oom_kill_process()

I'm not familiar with show_mem(). Could you spell out what's wrong with
the output above?

> Yueyang Pan (2):
>   mm/show_mem: No print when not mem_alloc_profiling_enabled()
>   mm/show_mem: Add trylock while printing alloc info
> 
>  mm/show_mem.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> -- 
> 2.47.3
>
Re: [PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info
Posted by Yueyang Pan 1 month ago
On Wed, Aug 27, 2025 at 12:51:17PM -0700, Vishal Moola (Oracle) wrote:
> On Wed, Aug 27, 2025 at 11:34:21AM -0700, Yueyang Pan wrote:
> > This patch set fixes two issues we saw in production rollout. 
> > 
> > The first issue is that we saw all zero output of memory allocation 
> > profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING 
> > is set and sysctl.vm.mem_profiling=0. In this case, the behaviour 
> > should be the same as when CONFIG_MEM_ALLOC_PROFILING is unset, 
> 
> Did you mean to say when sysctl.vm.mem_profiling=never?
> 
> My understanding is that setting the sysctl=0 Pauses memory allocation
> profiling, while 1 Resumes it. When the sysctl=never should be the same
> as when the config is unset, but I suspect we might still want the info
> when set to 0.

Thanks for your feedback Vishal. Here I mean for both =0 and =never. 
In both cases, now __show_mem() will print all 0s, which both is redundant 
and also makes differentiate hard. IMO when __show_mem() prints something 
the output should be useful at least. 

> 
> > where show_mem prints nothing about the information. This will make 
> > further parse easier as we don't have to differentiate what a all 
> > zero line actually means (Does it mean  0 bytes are allocated 
> > or simply memory allocation profiling is disabled).
> > 
> > The second issue is that multiple entities can call show_mem() 
> > which messed up the allocation info in dmesg. We saw outputs like this:  
> > ```
> >     327 MiB    83635 mm/compaction.c:1880 func:compaction_alloc
> >    48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
> >    7.48 GiB    10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
> >     298 MiB    95216 kernel/fork.c:318 func:alloc_thread_stack_node
> >     250 MiB    63901 mm/zsmalloc.c:987 func:alloc_zspage
> >     1.42 GiB   372527 mm/memory.c:1063 func:folio_prealloc
> >     1.17 GiB    95693 mm/slub.c:2424 func:alloc_slab_page
> >      651 MiB   166732 mm/readahead.c:270 func:page_cache_ra_unbounded
> >      419 MiB   107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
> >      404 MiB   103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
> > ```
> > The above example is because one kthread invokes show_mem() 
> > from __alloc_pages_slowpath while kernel itself calls 
> > oom_kill_process()
> 
> I'm not familiar with show_mem(). Could you spell out what's wrong with
> the output above?

So here in the normal case, the output should be sorted by size. Here 
two print happen at the same time so they interleave with each other, 
making further parse harder (need to sort again and dedup).

> 
> > Yueyang Pan (2):
> >   mm/show_mem: No print when not mem_alloc_profiling_enabled()
> >   mm/show_mem: Add trylock while printing alloc info
> > 
> >  mm/show_mem.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > -- 
> > 2.47.3
> > 

Thanks,
Pan
Re: [PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info
Posted by Vishal Moola (Oracle) 1 month ago
On Thu, Aug 28, 2025 at 01:29:08AM -0700, Yueyang Pan wrote:
> On Wed, Aug 27, 2025 at 12:51:17PM -0700, Vishal Moola (Oracle) wrote:
> > On Wed, Aug 27, 2025 at 11:34:21AM -0700, Yueyang Pan wrote:
> > > This patch set fixes two issues we saw in production rollout. 
> > > 
> > > The first issue is that we saw all zero output of memory allocation 
> > > profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING 
> > > is set and sysctl.vm.mem_profiling=0. In this case, the behaviour 
> > > should be the same as when CONFIG_MEM_ALLOC_PROFILING is unset, 
> > 
> > Did you mean to say when sysctl.vm.mem_profiling=never?
> > 
> > My understanding is that setting the sysctl=0 Pauses memory allocation
> > profiling, while 1 Resumes it. When the sysctl=never should be the same
> > as when the config is unset, but I suspect we might still want the info
> > when set to 0.
> 
> Thanks for your feedback Vishal. Here I mean for both =0 and =never. 
> In both cases, now __show_mem() will print all 0s, which both is redundant 
> and also makes differentiate hard. IMO when __show_mem() prints something 
> the output should be useful at least. 

If differentiating between 0 allocations vs disabled is the primary
concern, I think prefacing the dump with the status of the tool is
better than treating =0 and =never as the same.

The way I see it, the {0,1,never} tristate offers a level of versatility
that I'm not sure we need to eliminate.

I'm thinking about cases where we may temporarily set =1 to track some
allocations, then back to =0 'pause' on that exact period of time. Memory
allocation profiling still has those allocations tracked while set to =0
(we can still see them in /proc/allocinfo at least). If a user decided to
do that just before an oom, could they see something useful from
show_mem() even when =0?

> > 
> > > where show_mem prints nothing about the information. This will make 
> > > further parse easier as we don't have to differentiate what a all 
> > > zero line actually means (Does it mean  0 bytes are allocated 
> > > or simply memory allocation profiling is disabled).
> > > 
> > > The second issue is that multiple entities can call show_mem() 
> > > which messed up the allocation info in dmesg. We saw outputs like this:  
> > > ```
> > >     327 MiB    83635 mm/compaction.c:1880 func:compaction_alloc
> > >    48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
> > >    7.48 GiB    10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
> > >     298 MiB    95216 kernel/fork.c:318 func:alloc_thread_stack_node
> > >     250 MiB    63901 mm/zsmalloc.c:987 func:alloc_zspage
> > >     1.42 GiB   372527 mm/memory.c:1063 func:folio_prealloc
> > >     1.17 GiB    95693 mm/slub.c:2424 func:alloc_slab_page
> > >      651 MiB   166732 mm/readahead.c:270 func:page_cache_ra_unbounded
> > >      419 MiB   107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
> > >      404 MiB   103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
> > > ```
> > > The above example is because one kthread invokes show_mem() 
> > > from __alloc_pages_slowpath while kernel itself calls 
> > > oom_kill_process()
> > 
> > I'm not familiar with show_mem(). Could you spell out what's wrong with
> > the output above?
> 
> So here in the normal case, the output should be sorted by size. Here 
> two print happen at the same time so they interleave with each other, 
> making further parse harder (need to sort again and dedup).

Gotcha.

> > 
> > > Yueyang Pan (2):
> > >   mm/show_mem: No print when not mem_alloc_profiling_enabled()
> > >   mm/show_mem: Add trylock while printing alloc info
> > > 
> > >  mm/show_mem.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > -- 
> > > 2.47.3
> > > 
> 
> Thanks,
> Pan
Re: [PATCH v1 0/2] mm/show_mem: Bug fix for print mem alloc info
Posted by Yueyang Pan 1 month ago
On Thu, Aug 28, 2025 at 10:05:18AM -0700, Vishal Moola (Oracle) wrote:
> On Thu, Aug 28, 2025 at 01:29:08AM -0700, Yueyang Pan wrote:
> > On Wed, Aug 27, 2025 at 12:51:17PM -0700, Vishal Moola (Oracle) wrote:
> > > On Wed, Aug 27, 2025 at 11:34:21AM -0700, Yueyang Pan wrote:
> > > > This patch set fixes two issues we saw in production rollout. 
> > > > 
> > > > The first issue is that we saw all zero output of memory allocation 
> > > > profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING 
> > > > is set and sysctl.vm.mem_profiling=0. In this case, the behaviour 
> > > > should be the same as when CONFIG_MEM_ALLOC_PROFILING is unset, 
> > > 
> > > Did you mean to say when sysctl.vm.mem_profiling=never?
> > > 
> > > My understanding is that setting the sysctl=0 Pauses memory allocation
> > > profiling, while 1 Resumes it. When the sysctl=never should be the same
> > > as when the config is unset, but I suspect we might still want the info
> > > when set to 0.
> > 
> > Thanks for your feedback Vishal. Here I mean for both =0 and =never. 
> > In both cases, now __show_mem() will print all 0s, which both is redundant 
> > and also makes differentiate hard. IMO when __show_mem() prints something 
> > the output should be useful at least. 
> 
> If differentiating between 0 allocations vs disabled is the primary
> concern, I think prefacing the dump with the status of the tool is
> better than treating =0 and =never as the same.
> 
> The way I see it, the {0,1,never} tristate offers a level of versatility
> that I'm not sure we need to eliminate.
> 
> I'm thinking about cases where we may temporarily set =1 to track some
> allocations, then back to =0 'pause' on that exact period of time. Memory
> allocation profiling still has those allocations tracked while set to =0
> (we can still see them in /proc/allocinfo at least). If a user decided to
> do that just before an oom, could they see something useful from
> show_mem() even when =0?

This is a good point. I agree with your suggestion about adding the state 
to print. I am still unsure about if we want to print it when =0. The first 
reason is that memory allocation profiler does not support runtime enabling 
now. We have to set it via boot cmdline. It will make more sense if we 
have this feature. Second is because memory allocation profiling is quite 
light-weighted, I would assume user really don't need this feature when 
they set =0.

The original reason why I tried to disable this is because in our production 
table we see a lot of 0Bs coming from the machines where the changes in boot 
cmdline have not been pushed to. If we have state info, we could possibly 
filter this info out before sending it to the table. So I agree upon adding 
the state to print.

Maybe others also have thoughts about this?

> 
> > > 
> > > > where show_mem prints nothing about the information. This will make 
> > > > further parse easier as we don't have to differentiate what a all 
> > > > zero line actually means (Does it mean  0 bytes are allocated 
> > > > or simply memory allocation profiling is disabled).
> > > > 
> > > > The second issue is that multiple entities can call show_mem() 
> > > > which messed up the allocation info in dmesg. We saw outputs like this:  
> > > > ```
> > > >     327 MiB    83635 mm/compaction.c:1880 func:compaction_alloc
> > > >    48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc
> > > >    7.48 GiB    10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd
> > > >     298 MiB    95216 kernel/fork.c:318 func:alloc_thread_stack_node
> > > >     250 MiB    63901 mm/zsmalloc.c:987 func:alloc_zspage
> > > >     1.42 GiB   372527 mm/memory.c:1063 func:folio_prealloc
> > > >     1.17 GiB    95693 mm/slub.c:2424 func:alloc_slab_page
> > > >      651 MiB   166732 mm/readahead.c:270 func:page_cache_ra_unbounded
> > > >      419 MiB   107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow
> > > >      404 MiB   103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one
> > > > ```
> > > > The above example is because one kthread invokes show_mem() 
> > > > from __alloc_pages_slowpath while kernel itself calls 
> > > > oom_kill_process()
> > > 
> > > I'm not familiar with show_mem(). Could you spell out what's wrong with
> > > the output above?
> > 
> > So here in the normal case, the output should be sorted by size. Here 
> > two print happen at the same time so they interleave with each other, 
> > making further parse harder (need to sort again and dedup).
> 
> Gotcha.
> 
> > > 
> > > > Yueyang Pan (2):
> > > >   mm/show_mem: No print when not mem_alloc_profiling_enabled()
> > > >   mm/show_mem: Add trylock while printing alloc info
> > > > 
> > > >  mm/show_mem.c | 5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > -- 
> > > > 2.47.3
> > > > 
> > 
> > Thanks,
> > Pan