mm: make pcp_decay_high working better with NOHZ full

[PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

Posted by mengensun88@gmail.com 1 year, 3 months ago

From: MengEn Sun <mengensun@tencent.com>

When a cpu entring NOHZ full, quiet_vmstat may flush percpu
zonestats and nodestats.

The vmstat_shepherd only check percpu zonestats and nodestats
to determine whether it is necessary to fire vmstat_update on
the target cpu for now.

If a process on a certain CPU allocates a large amount of memory,
then frees that memory, and subsequently the CPU enters NOHZ, and
the process not freeing and allocating memory anymore,the
vmstat_update not being executed on the cpu. Because
vmstat_shepherd may not see zonestats and nodestats of the cpu
changed, so may resulting in vmstat_update on the cpu not fired
for a long time.

While, This seems to be fine:
- if freeing and allocating memory occur later, it may the
  high_max may be adjust automatically
- If memory is tight, the memory reclamation process will
  release the pcp

Whatever, we make vmstat_shepherd to checking whether we need
decay pcp high_max, and fire pcp_decay_high early if we need.

Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
Signed-off-by: MengEn Sun <mengensun@tencent.com>
---
changelog:
v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
v2: Make the commit message clearer by adding some comments.
---
 mm/vmstat.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1917c034c045..07b494b06872 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
 		struct per_cpu_nodestat *n;
 
+		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
+		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
+		 * high_{min,max} to determine whether it is necessary to run
+		 * decay_pcp_high on the corresponding CPU
+		 */
+		if (pcp->high_max > pcp->high_min)
+			return true;
+
 		/*
 		 * The fast way of checking if there are any vmstat diffs.
 		 */
-- 
2.43.5

Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

Posted by Huang, Ying 1 year, 3 months ago

Hi, Mengen,

mengensun88@gmail.com writes:

> From: MengEn Sun <mengensun@tencent.com>
>
> When a cpu entring NOHZ full, quiet_vmstat may flush percpu
> zonestats and nodestats.
>
> The vmstat_shepherd only check percpu zonestats and nodestats
> to determine whether it is necessary to fire vmstat_update on
> the target cpu for now.
>
> If a process on a certain CPU allocates a large amount of memory,
> then frees that memory, and subsequently the CPU enters NOHZ, and
> the process not freeing and allocating memory anymore,the
> vmstat_update not being executed on the cpu. Because
> vmstat_shepherd may not see zonestats and nodestats of the cpu
> changed, so may resulting in vmstat_update on the cpu not fired
> for a long time.

The issue description is too long, can you make it a little shorter?
And can you correct your grammar with some tool?  Something like chatgpt
is good for that, e.g., "fix the grammar of the following text: ...".
To make variable and function name distinct, I personally prefer to add
"()" after the function name.

Have verified the issue with some test?  If not, I suggest you to do
that.

> While, This seems to be fine:
> - if freeing and allocating memory occur later, it may the
>   high_max may be adjust automatically
> - If memory is tight, the memory reclamation process will
>   release the pcp

This could be a real issue for me.

> Whatever, we make vmstat_shepherd to checking whether we need
> decay pcp high_max, and fire pcp_decay_high early if we need.
>
> Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
> Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
> Signed-off-by: MengEn Sun <mengensun@tencent.com>
> ---
> changelog:
> v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
> v2: Make the commit message clearer by adding some comments.
> ---
>  mm/vmstat.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1917c034c045..07b494b06872 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>  
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> +		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>  		struct per_cpu_nodestat *n;
>  
> +		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
> +		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
> +		 * high_{min,max} to determine whether it is necessary to run
> +		 * decay_pcp_high on the corresponding CPU
> +		 */

Please follow the comments coding style.

                /*
                 * comments line 1
                 * comments line 2
                 */

> +		if (pcp->high_max > pcp->high_min)
> +			return true;
> +

We don't tune pcp->high_max/min in fact.  Instead, we tune pcp->high.
Your code may make need_update() return true in most cases.

>  		/*
>  		 * The fast way of checking if there are any vmstat diffs.
>  		 */

--
Best Regards,
Huang, Ying

Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

Posted by MengEn Sun 1 year, 3 months ago

Thank you for your suggestion. I understand and am ready to make
some changes  

> 
> Have verified the issue with some test?  If not, I suggest you to do
> that.
> 

I have conducted tests:
Applying this patch or not does not have a significant impact on the test results.
perhaps my testing was not thorough enough. #^_^

But, the logic of the code is like following:
CPU0                              CPUx
----                              -----
                                  T0: vmstat_work is pending
T1: vmstat_shepherd
    check vmstat_work
    and do nothing
                                  T2: vmstat_work is in unpending state.

                                  T3: alloc many pages
                                  T4: free all the pages allocated at T3
                                  T5: entry NOHZ, flushing all zonestats
                                      and nodestats
T6: next vmstat_shepherd fired

In my opinion, there are indeed some issues. I'm not sure if there's
something I haven't understood?


By the way,
There are two other questions for me:
Q1:
Vmstat_work is a **deferreable work** So, It may be delayed for a long time
by NOHZ. As a result, "vmstat_update() may not be executed once every
second in the above scenario. Therefore, I'm not sure if using a deferrable
work to reduce pcp->high is appropriate. In my tests, if I don't use
deferrable work, it takes about a minute to reduce high to high_min, but
using deferrable work may take several minutes to reduce high to high_min.

Q2:
On a big machine, for example, with 1TB of memory, the default maximum
memory on PCP can be 1TB * 0.125.
This portion of memory is not accounted for in MemFree in /proc/meminfo.
Users can see this portion of memory from /proc/zoneinfo, but the memory
reported by the `free` command is reduced.
can we include the PCP memory in the MemFree statistic in /proc/meminfo?

> > While, This seems to be fine:
> > - if freeing and allocating memory occur later, it may the
> >   high_max may be adjust automatically
> > - If memory is tight, the memory reclamation process will
> >   release the pcp
> 
> This could be a real issue for me.

Thanks, I will test more carefully for those issue

> 
> > Whatever, we make vmstat_shepherd to checking whether we need
> > decay pcp high_max, and fire pcp_decay_high early if we need.
> >
> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
> > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
> > Signed-off-by: MengEn Sun <mengensun@tencent.com>
> > ---
> > changelog:
> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
> > v2: Make the commit message clearer by adding some comments.
> > ---
> >  mm/vmstat.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 1917c034c045..07b494b06872 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
> >  
> >  	for_each_populated_zone(zone) {
> >  		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> > +		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> >  		struct per_cpu_nodestat *n;
> >  
> > +		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
> > +		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
> > +		 * high_{min,max} to determine whether it is necessary to run
> > +		 * decay_pcp_high on the corresponding CPU
> > +		 */
> 
> Please follow the comments coding style.
> 
>                 /*
>                  * comments line 1
>                  * comments line 2
>                  */
> 

Thank you for your suggestion. I understand and am ready to make
some changes

> > +		if (pcp->high_max > pcp->high_min)
> > +			return true;
> > +
> 
> We don't tune pcp->high_max/min in fact.  Instead, we tune pcp->high.
> Your code may make need_update() return true in most cases.

You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?

> 
> >  		/*
> >  		 * The fast way of checking if there are any vmstat diffs.
> >  		 */
> 
> --
> Best Regards,
> Huang, Ying

Best Regards,
MengEn, Sun

Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full

Posted by Huang, Ying 1 year, 3 months ago

MengEn Sun <mengensun88@gmail.com> writes:

> Thank you for your suggestion. I understand and am ready to make
> some changes  
>
>> 
>> Have verified the issue with some test?  If not, I suggest you to do
>> that.
>> 
>
> I have conducted tests:
> Applying this patch or not does not have a significant impact on the
> test results.

I don't expect some measurable performance difference with the patch.
If we can observe that the PCP size isn't tuned down to high_min before
and is after, that should be a valid test result to show the value of
the patch.  Can you try that?  The PCP size can be observed via
/proc/zoneinfo.

> perhaps my testing was not thorough enough. #^_^
>
> But, the logic of the code is like following:
> CPU0                              CPUx
> ----                              -----
>                                   T0: vmstat_work is pending
> T1: vmstat_shepherd
>     check vmstat_work
>     and do nothing
>                                   T2: vmstat_work is in unpending state.
>
>                                   T3: alloc many pages
>                                   T4: free all the pages allocated at T3
>                                   T5: entry NOHZ, flushing all zonestats
>                                       and nodestats
> T6: next vmstat_shepherd fired
>
> In my opinion, there are indeed some issues. I'm not sure if there's
> something I haven't understood?
>
>
> By the way,
> There are two other questions for me:
> Q1:
> Vmstat_work is a **deferreable work** So, It may be delayed for a long time
> by NOHZ. As a result, "vmstat_update() may not be executed once every
> second in the above scenario. Therefore, I'm not sure if using a deferrable
> work to reduce pcp->high is appropriate. In my tests, if I don't use
> deferrable work, it takes about a minute to reduce high to high_min, but
> using deferrable work may take several minutes to reduce high to high_min.

It's not a big issue to take longer time to decay pcp->high.

> Q2:
> On a big machine, for example, with 1TB of memory, the default maximum
> memory on PCP can be 1TB * 0.125.
> This portion of memory is not accounted for in MemFree in /proc/meminfo.
> Users can see this portion of memory from /proc/zoneinfo, but the memory
> reported by the `free` command is reduced.
> can we include the PCP memory in the MemFree statistic in /proc/meminfo?

This has been discussed before.

https://lore.kernel.org/linux-mm/20220816084426.135528-1-wangkefeng.wang@huawei.com/
https://lore.kernel.org/linux-mm/20240830014453.3070909-1-mawupeng1@huawei.com/

>> > While, This seems to be fine:
>> > - if freeing and allocating memory occur later, it may the
>> >   high_max may be adjust automatically
>> > - If memory is tight, the memory reclamation process will
>> >   release the pcp
>> 
>> This could be a real issue for me.
>
> Thanks, I will test more carefully for those issue
>
>> 
>> > Whatever, we make vmstat_shepherd to checking whether we need
>> > decay pcp high_max, and fire pcp_decay_high early if we need.
>> >
>> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
>> > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
>> > Signed-off-by: MengEn Sun <mengensun@tencent.com>
>> > ---
>> > changelog:
>> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
>> > v2: Make the commit message clearer by adding some comments.
>> > ---
>> >  mm/vmstat.c | 9 +++++++++
>> >  1 file changed, 9 insertions(+)
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index 1917c034c045..07b494b06872 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>> >  
>> >  	for_each_populated_zone(zone) {
>> >  		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>> > +		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> >  		struct per_cpu_nodestat *n;
>> >  
>> > +		/* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
>> > +		 * entering NOHZ full, see quiet_vmstat. so, we check pcp
>> > +		 * high_{min,max} to determine whether it is necessary to run
>> > +		 * decay_pcp_high on the corresponding CPU
>> > +		 */
>> 
>> Please follow the comments coding style.
>> 
>>                 /*
>>                  * comments line 1
>>                  * comments line 2
>>                  */
>> 
>
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>> > +		if (pcp->high_max > pcp->high_min)
>> > +			return true;
>> > +
>> 
>> We don't tune pcp->high_max/min in fact.  Instead, we tune pcp->high.
>> Your code may make need_update() return true in most cases.
>
> You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?
>
>> 
>> >  		/*
>> >  		 * The fast way of checking if there are any vmstat diffs.
>> >  		 */

--
Best Regards,
Huang, Ying