fs/proc/task_mmu.c | 14 +++++++------- include/linux/mm.h | 5 +++++ 2 files changed, 12 insertions(+), 7 deletions(-)
On some large machines with a high number of CPUs running a 64K kernel,
we found that the 'RES' field is always 0 displayed by the top command
for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite large
on these machines, caching a significant percpu value, since converting mm's
rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
stats into percpu_counter"). Intuitively, the batch number should be optimized,
but on some paths, performance may take precedence over statistical accuracy.
Therefore, introducing a new interface to add the percpu statistical count
and display it to users, which can remove the confusion. In addition, this
change is not expected to be on a performance-critical path, so the modification
should be acceptable.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
fs/proc/task_mmu.c | 14 +++++++-------
include/linux/mm.h | 5 +++++
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f629e6526935 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
unsigned long text, lib, swap, anon, file, shmem;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
- anon = get_mm_counter(mm, MM_ANONPAGES);
- file = get_mm_counter(mm, MM_FILEPAGES);
- shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+ anon = get_mm_counter_sum(mm, MM_ANONPAGES);
+ file = get_mm_counter_sum(mm, MM_FILEPAGES);
+ shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
text = min(text, mm->exec_vm << PAGE_SHIFT);
lib = (mm->exec_vm << PAGE_SHIFT) - text;
- swap = get_mm_counter(mm, MM_SWAPENTS);
+ swap = get_mm_counter_sum(mm, MM_SWAPENTS);
SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
@@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
unsigned long *shared, unsigned long *text,
unsigned long *data, unsigned long *resident)
{
- *shared = get_mm_counter(mm, MM_FILEPAGES) +
- get_mm_counter(mm, MM_SHMEMPAGES);
+ *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
+ get_mm_counter_sum(mm, MM_SHMEMPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->data_vm + mm->stack_vm;
- *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+ *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
return mm->total_vm;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 185424858f23..15ec5cfe9515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
return percpu_counter_read_positive(&mm->rss_stat[member]);
}
+static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
+{
+ return percpu_counter_sum_positive(&mm->rss_stat[member]);
+}
+
void mm_trace_rss_stat(struct mm_struct *mm, int member);
static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
--
2.43.5
On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
What does 64K kernel means?
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:
> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>
> > On some large machines with a high number of CPUs running a 64K kernel,
>
> What does 64K kernel means?
>
> > we found that the 'RES' field is always 0 displayed by the top command
> > for some processes, which will cause a lot of confusion for users.
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> > 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
> >
> > The main reason is that the batch size of the percpu counter is quite large
> > on these machines, caching a significant percpu value, since converting mm's
> > rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> > stats into percpu_counter").
Forgot asking this, sorry. Should we add Fixes: tag and Cc stable@?
Thanks,
SJ
[...]
On 2025/5/24 01:23, SeongJae Park wrote:
> On Fri, 23 May 2025 10:20:29 -0700 SeongJae Park <sj@kernel.org> wrote:
>
>> On Fri, 23 May 2025 11:16:13 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>>
>>> On some large machines with a high number of CPUs running a 64K kernel,
>>
>> What does 64K kernel means?
Sorry for not being clear. I mean a 64K pagesize kernel on Arm servers.
>>> we found that the 'RES' field is always 0 displayed by the top command
>>> for some processes, which will cause a lot of confusion for users.
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>>
>>> The main reason is that the batch size of the percpu counter is quite large
>>> on these machines, caching a significant percpu value, since converting mm's
>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>>> stats into percpu_counter").
>
> Forgot asking this, sorry. Should we add Fixes: tag and Cc stable@?
Yes, will add the Fixes tag in next version. Thanks for reviewing.
CC Mathieu
On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Hi Baolin, this seems reasonale. For long term Mathieu is planning to
fix this with newer hierarchical percpu counter until then this looks
good.
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
On 2025/5/23 22:11, Shakeel Butt wrote:
> CC Mathieu
>
> On Fri, May 23, 2025 at 11:16:13AM +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>
> Hi Baolin, this seems reasonale. For long term Mathieu is planning to
> fix this with newer hierarchical percpu counter until then this looks
> good.
OK. Good.
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Thanks.
On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> fs/proc/task_mmu.c | 14 +++++++-------
> include/linux/mm.h | 5 +++++
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> unsigned long text, lib, swap, anon, file, shmem;
> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>
> - anon = get_mm_counter(mm, MM_ANONPAGES);
> - file = get_mm_counter(mm, MM_FILEPAGES);
> - shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> + anon = get_mm_counter_sum(mm, MM_ANONPAGES);
> + file = get_mm_counter_sum(mm, MM_FILEPAGES);
> + shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>
> /*
> * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> text = min(text, mm->exec_vm << PAGE_SHIFT);
> lib = (mm->exec_vm << PAGE_SHIFT) - text;
>
> - swap = get_mm_counter(mm, MM_SWAPENTS);
> + swap = get_mm_counter_sum(mm, MM_SWAPENTS);
> SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
> SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
> SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
> unsigned long *shared, unsigned long *text,
> unsigned long *data, unsigned long *resident)
> {
> - *shared = get_mm_counter(mm, MM_FILEPAGES) +
> - get_mm_counter(mm, MM_SHMEMPAGES);
> + *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> + get_mm_counter_sum(mm, MM_SHMEMPAGES);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->data_vm + mm->stack_vm;
> - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> + *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
> return mm->total_vm;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> return percpu_counter_read_positive(&mm->rss_stat[member]);
> }
>
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> + return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
> void mm_trace_rss_stat(struct mm_struct *mm, int member);
>
> static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
Hi Baolin,
This patch looks good to me. We observed a similar issue where the
generic mm selftest split_huge_page_test failed due to outdated RssAnon
values reported in /proc/[pid]/status.
...
Without Patch:
# ./split_huge_page_test
TAP version 13
1..34
Bail out! No RssAnon is allocated before split
# Planned tests != run tests (34 != 0)
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
...
With Patch:
# ./split_huge_page_test
# ./split_huge_page_test
TAP version 13
1..34
...
# Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0
...
While this change may introduce some lock contention, it only affects
the task_mem function which is invoked only when reading
/proc/[pid]/status. Since this is not on a performance critical path,
it will be good to have this change in order to get accurate memory
stats.
This fix resolves the issue we've seen with split_huge_page_test.
Thanks!
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
On Fri 23-05-25 15:44:37, Aboorva Devarajan wrote: > While this change may introduce some lock contention, it only affects > the task_mem function which is invoked only when reading > /proc/[pid]/status. Since this is not on a performance critical path, > it will be good to have this change in order to get accurate memory > stats. This particular function might not be performance critical but you are exposing a lock contention to the userspace that could be abused and cause contention controlled by unprivileged user. I do not think we want that without any control. Or is the pcp lock not really affecting any actual kernel code path? So while precision is nice it should be weight against potential side effects. -- Michal Hocko SUSE Labs
On 2025/5/23 18:14, Aboorva Devarajan wrote:
> On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite large
>> on these machines, caching a significant percpu value, since converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be optimized,
>> but on some paths, performance may take precedence over statistical accuracy.
>> Therefore, introducing a new interface to add the percpu statistical count
>> and display it to users, which can remove the confusion. In addition, this
>> change is not expected to be on a performance-critical path, so the modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> fs/proc/task_mmu.c | 14 +++++++-------
>> include/linux/mm.h | 5 +++++
>> 2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>> unsigned long text, lib, swap, anon, file, shmem;
>> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>>
>> - anon = get_mm_counter(mm, MM_ANONPAGES);
>> - file = get_mm_counter(mm, MM_FILEPAGES);
>> - shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> + anon = get_mm_counter_sum(mm, MM_ANONPAGES);
>> + file = get_mm_counter_sum(mm, MM_FILEPAGES);
>> + shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>>
>> /*
>> * Note: to minimize their overhead, mm maintains hiwater_vm and
>> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>> text = min(text, mm->exec_vm << PAGE_SHIFT);
>> lib = (mm->exec_vm << PAGE_SHIFT) - text;
>>
>> - swap = get_mm_counter(mm, MM_SWAPENTS);
>> + swap = get_mm_counter_sum(mm, MM_SWAPENTS);
>> SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
>> SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
>> SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
>> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
>> unsigned long *shared, unsigned long *text,
>> unsigned long *data, unsigned long *resident)
>> {
>> - *shared = get_mm_counter(mm, MM_FILEPAGES) +
>> - get_mm_counter(mm, MM_SHMEMPAGES);
>> + *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
>> + get_mm_counter_sum(mm, MM_SHMEMPAGES);
>> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> >> PAGE_SHIFT;
>> *data = mm->data_vm + mm->stack_vm;
>> - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
>> + *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
>> return mm->total_vm;
>> }
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 185424858f23..15ec5cfe9515 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>> return percpu_counter_read_positive(&mm->rss_stat[member]);
>> }
>>
>> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
>> +{
>> + return percpu_counter_sum_positive(&mm->rss_stat[member]);
>> +}
>> +
>> void mm_trace_rss_stat(struct mm_struct *mm, int member);
>>
>> static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
>
> Hi Baolin,
>
> This patch looks good to me. We observed a similar issue where the
> generic mm selftest split_huge_page_test failed due to outdated RssAnon
> values reported in /proc/[pid]/status.
>
> ...
>
> Without Patch:
>
> # ./split_huge_page_test
> TAP version 13
> 1..34
> Bail out! No RssAnon is allocated before split
> # Planned tests != run tests (34 != 0)
> # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> ...
>
> With Patch:
>
> # ./split_huge_page_test
> # ./split_huge_page_test
> TAP version 13
> 1..34
> ...
> # Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0
>
> ...
>
> While this change may introduce some lock contention, it only affects
> the task_mem function which is invoked only when reading
> /proc/[pid]/status. Since this is not on a performance critical path,
> it will be good to have this change in order to get accurate memory
> stats.
Agree.
>
> This fix resolves the issue we've seen with split_huge_page_test.
>
> Thanks!
>
>
> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
>
Thanks for reviewing and testing.
On 5/23/25 8:46 AM, Baolin Wang wrote:
> On some large machines with a high number of CPUs running a 64K kernel,
> we found that the 'RES' field is always 0 displayed by the top command
> for some processes, which will cause a lot of confusion for users.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
>
> The main reason is that the batch size of the percpu counter is quite large
> on these machines, caching a significant percpu value, since converting mm's
> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss
> stats into percpu_counter"). Intuitively, the batch number should be optimized,
> but on some paths, performance may take precedence over statistical accuracy.
> Therefore, introducing a new interface to add the percpu statistical count
> and display it to users, which can remove the confusion. In addition, this
> change is not expected to be on a performance-critical path, so the modification
> should be acceptable.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
> fs/proc/task_mmu.c | 14 +++++++-------
> include/linux/mm.h | 5 +++++
> 2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f629e6526935 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> unsigned long text, lib, swap, anon, file, shmem;
> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>
> - anon = get_mm_counter(mm, MM_ANONPAGES);
> - file = get_mm_counter(mm, MM_FILEPAGES);
> - shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> + anon = get_mm_counter_sum(mm, MM_ANONPAGES);
Hi Baolin Wang,
We also observed the same issue where the RSS value in /proc/PID/status
was 0 on machines with a high number of CPUs. With this patch, the issue
got fixedl
Rss value without this patch
----------------------------
# cat /proc/87406/status
.....
VmRSS: 0 kB
RssAnon: 0 kB
RssFile: 0 k
Rss values with this patch
--------------------------
# cat /proc/3055/status
VmRSS: 2176 kB
RssAnon: 512 kB
RssFile: 1664 kB
RssShmem: 0 kB
Tested-by Donet Tom <donettom@linux.ibm.com>
> + file = get_mm_counter_sum(mm, MM_FILEPAGES);
> + shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES);
>
> /*
> * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> text = min(text, mm->exec_vm << PAGE_SHIFT);
> lib = (mm->exec_vm << PAGE_SHIFT) - text;
>
> - swap = get_mm_counter(mm, MM_SWAPENTS);
> + swap = get_mm_counter_sum(mm, MM_SWAPENTS);
> SEQ_PUT_DEC("VmPeak:\t", hiwater_vm);
> SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm);
> SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm);
> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm,
> unsigned long *shared, unsigned long *text,
> unsigned long *data, unsigned long *resident)
> {
> - *shared = get_mm_counter(mm, MM_FILEPAGES) +
> - get_mm_counter(mm, MM_SHMEMPAGES);
> + *shared = get_mm_counter_sum(mm, MM_FILEPAGES) +
> + get_mm_counter_sum(mm, MM_SHMEMPAGES);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->data_vm + mm->stack_vm;
> - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> + *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES);
> return mm->total_vm;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 185424858f23..15ec5cfe9515 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> return percpu_counter_read_positive(&mm->rss_stat[member]);
> }
>
> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
> +{
> + return percpu_counter_sum_positive(&mm->rss_stat[member]);
> +}
> +
> void mm_trace_rss_stat(struct mm_struct *mm, int member);
>
> static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
On 2025/5/23 13:25, Donet Tom wrote:
>
> On 5/23/25 8:46 AM, Baolin Wang wrote:
>> On some large machines with a high number of CPUs running a 64K kernel,
>> we found that the 'RES' field is always 0 displayed by the top command
>> for some processes, which will cause a lot of confusion for users.
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM
>> TIME+ COMMAND
>> 875525 root 20 0 12480 0 0 R 0.3 0.0
>> 0:00.08 top
>> 1 root 20 0 172800 0 0 S 0.0 0.0
>> 0:04.52 systemd
>>
>> The main reason is that the batch size of the percpu counter is quite
>> large
>> on these machines, caching a significant percpu value, since
>> converting mm's
>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert
>> mm's rss
>> stats into percpu_counter"). Intuitively, the batch number should be
>> optimized,
>> but on some paths, performance may take precedence over statistical
>> accuracy.
>> Therefore, introducing a new interface to add the percpu statistical
>> count
>> and display it to users, which can remove the confusion. In addition,
>> this
>> change is not expected to be on a performance-critical path, so the
>> modification
>> should be acceptable.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> fs/proc/task_mmu.c | 14 +++++++-------
>> include/linux/mm.h | 5 +++++
>> 2 files changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index b9e4fbbdf6e6..f629e6526935 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>> unsigned long text, lib, swap, anon, file, shmem;
>> unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>> - anon = get_mm_counter(mm, MM_ANONPAGES);
>> - file = get_mm_counter(mm, MM_FILEPAGES);
>> - shmem = get_mm_counter(mm, MM_SHMEMPAGES);
>> + anon = get_mm_counter_sum(mm, MM_ANONPAGES);
>
>
> Hi Baolin Wang,
>
> We also observed the same issue where the RSS value in /proc/PID/status
> was 0 on machines with a high number of CPUs. With this patch, the issue
> got fixedl
Yes, we also observed this issue.
> Rss value without this patch
> ----------------------------
> # cat /proc/87406/status
> .....
> VmRSS: 0 kB
> RssAnon: 0 kB
> RssFile: 0 k
>
>
> Rss values with this patch
> --------------------------
> # cat /proc/3055/status
> VmRSS: 2176 kB
> RssAnon: 512 kB
> RssFile: 1664 kB
> RssShmem: 0 kB
>
> Tested-by Donet Tom <donettom@linux.ibm.com>
Thanks for testing.
© 2016 - 2025 Red Hat, Inc.