[v1] mm: fix the inaccurate memory statistics issue for users

[RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Baolin Wang 8 months, 3 weeks ago

On some large machines with a high number of CPUs running a 64K kernel,
we found that the 'RES' field is always 0 displayed by the top command
for some processes, which will cause a lot of confusion for users.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
      1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd

The main reason is that the batch size of the percpu counter is quite large
on these machines, caching a significant percpu value, since converting mm's
rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
stats into percpu_counter"). Intuitively, the batch number should be optimized,
but on some paths, performance may take precedence over statistical accuracy.
Therefore, introducing a new interface to add the percpu statistical count
and display it to users, which can remove the confusion. In addition, this
change is not expected to be on a performance-critical path, so the modification
should be acceptable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 fs/proc/task_mmu.c | 14 +++++++-------
 include/linux/mm.h |  5 +++++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f629e6526935 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	unsigned long text, lib, swap, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
-	anon = get_mm_counter(mm, MM_ANONPAGES);
-	file = get_mm_counter(mm, MM_FILEPAGES);
-	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
+	file = get_mm_counter_sum(mm, MM_FILEPAGES);
+	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
 
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	text = min(text, mm->exec_vm << PAGE_SHIFT);
 	lib = (mm->exec_vm << PAGE_SHIFT) - text;
 
-	swap = get_mm_counter(mm, MM_SWAPENTS);
+	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
 	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
 	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
 	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES) +
-			get_mm_counter(mm, MM_SHMEMPAGES);
+	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
+			get_mm_counter_sum(mm, MM_SHMEMPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
-	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
 	return mm->total_vm;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 185424858f23..15ec5cfe9515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
 	return percpu_counter_read_positive(&mm->rss_stat[member]);
 }
 
+static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
+{
+	return percpu_counter_sum_positive(&mm->rss_stat[member]);
+}
+
 void mm_trace_rss_stat(struct mm_struct *mm, int member);
 
 static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
-- 
2.43.5

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by SeongJae Park 8 months, 2 weeks ago

On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> On some large machines with a high number of CPUs running a 64K kernel,

What does 64K kernel means?

> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by SeongJae Park 8 months, 2 weeks ago

On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
> > On some large machines with a high number of CPUs running a 64K kernel,
> 
> What does 64K kernel means?
> 
> > we found that the 'RES' field is always 0 displayed by the top command
> > for some processes, which will cause a lot of confusion for users.
> > 
> >     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
> >       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> > 
> > The main reason is that the batch size of the percpu counter is quite large
> > on these machines, caching a significant percpu value, since converting mm's
> > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > stats into percpu_counter").

Forgot asking this, sorry.  Should we add Fixes: tag and Cc stable@?


Thanks,
SJ

[...]

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Baolin Wang 8 months, 2 weeks ago


On 2025/5/24 01:23, SeongJae Park wrote:
> On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
>> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>
>>> On some large machines with a high number of CPUs running a 64K kernel,
>>
>> What does 64K kernel means?

Sorry for not being clear. I mean a 64K pagesize kernel on Arm servers.

>>> we found that the 'RES' field is always 0 displayed by the top command
>>> for some processes, which will cause a lot of confusion for users.
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>>
>>> The main reason is that the batch size of the percpu counter is quite large
>>> on these machines, caching a significant percpu value, since converting mm's
>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>> stats into percpu_counter").
> 
> Forgot asking this, sorry.  Should we add Fixes: tag and Cc stable@?

Yes, will add the Fixes tag in next version. Thanks for reviewing.

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Shakeel Butt 8 months, 2 weeks ago

CC Mathieu

On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Hi Baolin, this seems reasonale. For long term Mathieu is planning to
fix this with newer hierarchical percpu counter until then this looks
good.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Baolin Wang 8 months, 2 weeks ago


On 2025/5/23 22:11, Shakeel Butt wrote:
> CC Mathieu
> 
> On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> Hi Baolin, this seems reasonale. For long term Mathieu is planning to
> fix this with newer hierarchical percpu counter until then this looks
> good.

OK. Good.

> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Thanks.

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Aboorva Devarajan 8 months, 2 weeks ago

On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
> 
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  fs/proc/task_mmu.c | 14 +++++++-------
>  include/linux/mm.h |  5 +++++
>  2 files changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	unsigned long text, lib, swap, anon, file, shmem;
>  	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>  
> -	anon = get_mm_counter(mm, MM_ANONPAGES);
> -	file = get_mm_counter(mm, MM_FILEPAGES);
> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>  
>  	/*
>  	 * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  	text = min(text, mm->exec_vm << PAGE_SHIFT);
>  	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>  
> -	swap = get_mm_counter(mm, MM_SWAPENTS);
> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>  	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>  	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>  	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
>  			 unsigned long *data, unsigned long *resident)
>  {
> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
> -			get_mm_counter(mm, MM_SHMEMPAGES);
> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>  	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>  								>> PAGE_SHIFT;
>  	*data = mm->data_vm + mm->stack_vm;
> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>  	return mm->total_vm;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>  	return percpu_counter_read_positive(&mm->rss_stat[member]);
>  }
>  
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
>  void mm_trace_rss_stat(struct mm_struct *mm, int member);
>  
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long value)

Hi Baolin,

This patch looks good to me. We observed a similar issue where the
generic mm selftest split_huge_page_test failed due to outdated RssAnon
values reported in /proc/[pid]/status.

...

Without Patch:

# ./split_huge_page_test 
TAP version 13
1..34
Bail out! No RssAnon is allocated before split
# Planned tests != run tests (34 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0

...

With Patch:

# ./split_huge_page_test
# ./split_huge_page_test 
TAP version 13
1..34
...
# Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0

...

While this change may introduce some lock contention, it only affects
the task_mem function which is invoked only when reading
/proc/[pid]/status. Since this is not on a performance critical path,
it will be good to have this change in order to get accurate memory
stats.

This fix resolves the issue we've seen with split_huge_page_test.

Thanks!


Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Michal Hocko 8 months, 1 week ago

On Fri 23-05-25 15:44:37, Aboorva Devarajan wrote:
> While this change may introduce some lock contention, it only affects
> the task_mem function which is invoked only when reading
> /proc/[pid]/status. Since this is not on a performance critical path,
> it will be good to have this change in order to get accurate memory
> stats.

This particular function might not be performance critical but you are
exposing a lock contention to the userspace that could be abused and
cause contention controlled by unprivileged user. I do not think we want
that without any control. Or is the pcp lock not really affecting any
actual kernel code path?

So while precision is nice it should be weight against potential side
effects. 
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Baolin Wang 8 months, 2 weeks ago


On 2025/5/23 18:14, Aboorva Devarajan wrote:
> On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   fs/proc/task_mmu.c | 14 +++++++-------
>>   include/linux/mm.h |  5 +++++
>>   2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>   	unsigned long text, lib, swap, anon, file, shmem;
>>   	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>>   
>> -	anon = get_mm_counter(mm, MM_ANONPAGES);
>> -	file = get_mm_counter(mm, MM_FILEPAGES);
>> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);
>> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
>> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>>   
>>   	/*
>>   	 * Note: to minimize their overhead, mm maintains hiwater_vm and
>> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>   	text = min(text, mm->exec_vm << PAGE_SHIFT);
>>   	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>>   
>> -	swap = get_mm_counter(mm, MM_SWAPENTS);
>> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>>   	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>>   	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>>   	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
>> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>>   			 unsigned long *shared, unsigned long *text,
>>   			 unsigned long *data, unsigned long *resident)
>>   {
>> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
>> -			get_mm_counter(mm, MM_SHMEMPAGES);
>> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
>> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>>   	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>>   								>> PAGE_SHIFT;
>>   	*data = mm->data_vm + mm->stack_vm;
>> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
>> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>>   	return mm->total_vm;
>>   }
>>   
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 185424858f23..15ec5cfe9515 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>>   	return percpu_counter_read_positive(&mm->rss_stat[member]);
>>   }
>>   
>> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
>> +{
>> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
>> +}
>> +
>>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>   
>>   static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> 
> Hi Baolin,
> 
> This patch looks good to me. We observed a similar issue where the
> generic mm selftest split_huge_page_test failed due to outdated RssAnon
> values reported in /proc/[pid]/status.
> 
> ...
> 
> Without Patch:
> 
> # ./split_huge_page_test
> TAP version 13
> 1..34
> Bail out! No RssAnon is allocated before split
> # Planned tests != run tests (34 != 0)
> # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> ...
> 
> With Patch:
> 
> # ./split_huge_page_test
> # ./split_huge_page_test
> TAP version 13
> 1..34
> ...
> # Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0
> 
> ...
> 
> While this change may introduce some lock contention, it only affects
> the task_mem function which is invoked only when reading
> /proc/[pid]/status. Since this is not on a performance critical path,
> it will be good to have this change in order to get accurate memory
> stats.

Agree.

> 
> This fix resolves the issue we've seen with split_huge_page_test.
> 
> Thanks!
> 
> 
> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> 

Thanks for reviewing and testing.

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Donet Tom 8 months, 3 weeks ago

On 5/23/25 8:46 AM, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top
>        1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   fs/proc/task_mmu.c | 14 +++++++-------
>   include/linux/mm.h |  5 +++++
>   2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   	unsigned long text, lib, swap, anon, file, shmem;
>   	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>   
> -	anon = get_mm_counter(mm, MM_ANONPAGES);
> -	file = get_mm_counter(mm, MM_FILEPAGES);
> -	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> +	anon = get_mm_counter_sum(mm, MM_ANONPAGES);


Hi Baolin Wang,

We also observed the same issue where the RSS value in /proc/PID/status
was 0 on machines with a high number of CPUs. With this patch, the issue
got fixedl

Rss value without this patch
----------------------------
  # cat /proc/87406/status
.....
VmRSS:           0 kB
RssAnon:           0 kB
RssFile:           0 k


Rss values with this patch
--------------------------
  # cat /proc/3055/status
VmRSS:        2176 kB
RssAnon:         512 kB
RssFile:        1664 kB
RssShmem:           0 kB

Tested-by Donet Tom <donettom@linux.ibm.com>


> +	file = get_mm_counter_sum(mm, MM_FILEPAGES);
> +	shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>   
>   	/*
>   	 * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>   	text = min(text, mm->exec_vm << PAGE_SHIFT);
>   	lib = (mm->exec_vm << PAGE_SHIFT) - text;
>   
> -	swap = get_mm_counter(mm, MM_SWAPENTS);
> +	swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>   	SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>   	SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>   	SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>   			 unsigned long *shared, unsigned long *text,
>   			 unsigned long *data, unsigned long *resident)
>   {
> -	*shared = get_mm_counter(mm, MM_FILEPAGES) +
> -			get_mm_counter(mm, MM_SHMEMPAGES);
> +	*shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> +			get_mm_counter_sum(mm, MM_SHMEMPAGES);
>   	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>   								>> PAGE_SHIFT;
>   	*data = mm->data_vm + mm->stack_vm;
> -	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>   	return mm->total_vm;
>   }
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>   	return percpu_counter_read_positive(&mm->rss_stat[member]);
>   }
>   
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> +	return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
>   void mm_trace_rss_stat(struct mm_struct *mm, int member);
>   
>   static inline void add_mm_counter(struct mm_struct *mm, int member, long value)

Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users

Posted by Baolin Wang 8 months, 3 weeks ago


On 2025/5/23 13:25, Donet Tom wrote:
> 
> On 5/23/25 8:46 AM, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     
>> TIME+ COMMAND
>>   875525 root      20   0   12480      0      0 R   0.3   0.0   
>> 0:00.08 top
>>        1 root      20   0  172800      0      0 S   0.0   0.0   
>> 0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite 
>> large
>> on these machines, caching a significant percpu value, since 
>> converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert 
>> mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be 
>> optimized,
>> but on some paths, performance may take precedence over statistical 
>> accuracy.
>> Therefore, introducing a new interface to add the percpu statistical 
>> count
>> and display it to users, which can remove the confusion. In addition, 
>> this
>> change is not expected to be on a performance-critical path, so the 
>> modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   fs/proc/task_mmu.c | 14 +++++++-------
>>   include/linux/mm.h |  5 +++++
>>   2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>>       unsigned long text, lib, swap, anon, file, shmem;
>>       unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>> -    anon = get_mm_counter(mm, MM_ANONPAGES);
>> -    file = get_mm_counter(mm, MM_FILEPAGES);
>> -    shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> +    anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> 
> 
> Hi Baolin Wang,
> 
> We also observed the same issue where the RSS value in /proc/PID/status
> was 0 on machines with a high number of CPUs. With this patch, the issue
> got fixedl

Yes, we also observed this issue.

> Rss value without this patch
> ----------------------------
>   # cat /proc/87406/status
> .....
> VmRSS:           0 kB
> RssAnon:           0 kB
> RssFile:           0 k
> 
> 
> Rss values with this patch
> --------------------------
>   # cat /proc/3055/status
> VmRSS:        2176 kB
> RssAnon:         512 kB
> RssFile:        1664 kB
> RssShmem:           0 kB
> 
> Tested-by Donet Tom <donettom@linux.ibm.com>

Thanks for testing.