mm/vmstat.c | 9 +++++++++ 1 file changed, 9 insertions(+)
From: MengEn Sun <mengensun@tencent.com>
When a cpu entring NOHZ full, quiet_vmstat may flush percpu
zonestats and nodestats.
The vmstat_shepherd only check percpu zonestats and nodestats
to determine whether it is necessary to fire vmstat_update on
the target cpu for now.
If a process on a certain CPU allocates a large amount of memory,
then frees that memory, and subsequently the CPU enters NOHZ, and
the process not freeing and allocating memory anymore,the
vmstat_update not being executed on the cpu. Because
vmstat_shepherd may not see zonestats and nodestats of the cpu
changed, so may resulting in vmstat_update on the cpu not fired
for a long time.
While, This seems to be fine:
- if freeing and allocating memory occur later, it may the
high_max may be adjust automatically
- If memory is tight, the memory reclamation process will
release the pcp
Whatever, we make vmstat_shepherd to checking whether we need
decay pcp high_max, and fire pcp_decay_high early if we need.
Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
Signed-off-by: MengEn Sun <mengensun@tencent.com>
---
changelog:
v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
v2: Make the commit message clearer by adding some comments.
---
mm/vmstat.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1917c034c045..07b494b06872 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
for_each_populated_zone(zone) {
struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
struct per_cpu_nodestat *n;
+ /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
+ * entering NOHZ full, see quiet_vmstat. so, we check pcp
+ * high_{min,max} to determine whether it is necessary to run
+ * decay_pcp_high on the corresponding CPU
+ */
+ if (pcp->high_max > pcp->high_min)
+ return true;
+
/*
* The fast way of checking if there are any vmstat diffs.
*/
--
2.43.5
Hi, Mengen, mengensun88@gmail.com writes: > From: MengEn Sun <mengensun@tencent.com> > > When a cpu entring NOHZ full, quiet_vmstat may flush percpu > zonestats and nodestats. > > The vmstat_shepherd only check percpu zonestats and nodestats > to determine whether it is necessary to fire vmstat_update on > the target cpu for now. > > If a process on a certain CPU allocates a large amount of memory, > then frees that memory, and subsequently the CPU enters NOHZ, and > the process not freeing and allocating memory anymore,the > vmstat_update not being executed on the cpu. Because > vmstat_shepherd may not see zonestats and nodestats of the cpu > changed, so may resulting in vmstat_update on the cpu not fired > for a long time. The issue description is too long, can you make it a little shorter? And can you correct your grammar with some tool? Something like chatgpt is good for that, e.g., "fix the grammar of the following text: ...". To make variable and function name distinct, I personally prefer to add "()" after the function name. Have verified the issue with some test? If not, I suggest you to do that. > While, This seems to be fine: > - if freeing and allocating memory occur later, it may the > high_max may be adjust automatically > - If memory is tight, the memory reclamation process will > release the pcp This could be a real issue for me. > Whatever, we make vmstat_shepherd to checking whether we need > decay pcp high_max, and fire pcp_decay_high early if we need. > > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically") > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com> > Signed-off-by: MengEn Sun <mengensun@tencent.com> > --- > changelog: > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t > v2: Make the commit message clearer by adding some comments. > --- > mm/vmstat.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 1917c034c045..07b494b06872 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu) > > for_each_populated_zone(zone) { > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > struct per_cpu_nodestat *n; > > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu > + * entering NOHZ full, see quiet_vmstat. so, we check pcp > + * high_{min,max} to determine whether it is necessary to run > + * decay_pcp_high on the corresponding CPU > + */ Please follow the comments coding style. /* * comments line 1 * comments line 2 */ > + if (pcp->high_max > pcp->high_min) > + return true; > + We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high. Your code may make need_update() return true in most cases. > /* > * The fast way of checking if there are any vmstat diffs. > */ -- Best Regards, Huang, Ying
Thank you for your suggestion. I understand and am ready to make some changes > > Have verified the issue with some test? If not, I suggest you to do > that. > I have conducted tests: Applying this patch or not does not have a significant impact on the test results. perhaps my testing was not thorough enough. #^_^ But, the logic of the code is like following: CPU0 CPUx ---- ----- T0: vmstat_work is pending T1: vmstat_shepherd check vmstat_work and do nothing T2: vmstat_work is in unpending state. T3: alloc many pages T4: free all the pages allocated at T3 T5: entry NOHZ, flushing all zonestats and nodestats T6: next vmstat_shepherd fired In my opinion, there are indeed some issues. I'm not sure if there's something I haven't understood? By the way, There are two other questions for me: Q1: Vmstat_work is a **deferreable work** So, It may be delayed for a long time by NOHZ. As a result, "vmstat_update() may not be executed once every second in the above scenario. Therefore, I'm not sure if using a deferrable work to reduce pcp->high is appropriate. In my tests, if I don't use deferrable work, it takes about a minute to reduce high to high_min, but using deferrable work may take several minutes to reduce high to high_min. Q2: On a big machine, for example, with 1TB of memory, the default maximum memory on PCP can be 1TB * 0.125. This portion of memory is not accounted for in MemFree in /proc/meminfo. Users can see this portion of memory from /proc/zoneinfo, but the memory reported by the `free` command is reduced. can we include the PCP memory in the MemFree statistic in /proc/meminfo? > > While, This seems to be fine: > > - if freeing and allocating memory occur later, it may the > > high_max may be adjust automatically > > - If memory is tight, the memory reclamation process will > > release the pcp > > This could be a real issue for me. Thanks, I will test more carefully for those issue > > > Whatever, we make vmstat_shepherd to checking whether we need > > decay pcp high_max, and fire pcp_decay_high early if we need. > > > > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically") > > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com> > > Signed-off-by: MengEn Sun <mengensun@tencent.com> > > --- > > changelog: > > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t > > v2: Make the commit message clearer by adding some comments. > > --- > > mm/vmstat.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 1917c034c045..07b494b06872 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu) > > > > for_each_populated_zone(zone) { > > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); > > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); > > struct per_cpu_nodestat *n; > > > > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu > > + * entering NOHZ full, see quiet_vmstat. so, we check pcp > > + * high_{min,max} to determine whether it is necessary to run > > + * decay_pcp_high on the corresponding CPU > > + */ > > Please follow the comments coding style. > > /* > * comments line 1 > * comments line 2 > */ > Thank you for your suggestion. I understand and am ready to make some changes > > + if (pcp->high_max > pcp->high_min) > > + return true; > > + > > We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high. > Your code may make need_update() return true in most cases. You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min? > > > /* > > * The fast way of checking if there are any vmstat diffs. > > */ > > -- > Best Regards, > Huang, Ying Best Regards, MengEn, Sun
MengEn Sun <mengensun88@gmail.com> writes: > Thank you for your suggestion. I understand and am ready to make > some changes > >> >> Have verified the issue with some test? If not, I suggest you to do >> that. >> > > I have conducted tests: > Applying this patch or not does not have a significant impact on the > test results. I don't expect some measurable performance difference with the patch. If we can observe that the PCP size isn't tuned down to high_min before and is after, that should be a valid test result to show the value of the patch. Can you try that? The PCP size can be observed via /proc/zoneinfo. > perhaps my testing was not thorough enough. #^_^ > > But, the logic of the code is like following: > CPU0 CPUx > ---- ----- > T0: vmstat_work is pending > T1: vmstat_shepherd > check vmstat_work > and do nothing > T2: vmstat_work is in unpending state. > > T3: alloc many pages > T4: free all the pages allocated at T3 > T5: entry NOHZ, flushing all zonestats > and nodestats > T6: next vmstat_shepherd fired > > In my opinion, there are indeed some issues. I'm not sure if there's > something I haven't understood? > > > By the way, > There are two other questions for me: > Q1: > Vmstat_work is a **deferreable work** So, It may be delayed for a long time > by NOHZ. As a result, "vmstat_update() may not be executed once every > second in the above scenario. Therefore, I'm not sure if using a deferrable > work to reduce pcp->high is appropriate. In my tests, if I don't use > deferrable work, it takes about a minute to reduce high to high_min, but > using deferrable work may take several minutes to reduce high to high_min. It's not a big issue to take longer time to decay pcp->high. > Q2: > On a big machine, for example, with 1TB of memory, the default maximum > memory on PCP can be 1TB * 0.125. > This portion of memory is not accounted for in MemFree in /proc/meminfo. > Users can see this portion of memory from /proc/zoneinfo, but the memory > reported by the `free` command is reduced. > can we include the PCP memory in the MemFree statistic in /proc/meminfo? This has been discussed before. https://lore.kernel.org/linux-mm/20220816084426.135528-1-wangkefeng.wang@huawei.com/ https://lore.kernel.org/linux-mm/20240830014453.3070909-1-mawupeng1@huawei.com/ >> > While, This seems to be fine: >> > - if freeing and allocating memory occur later, it may the >> > high_max may be adjust automatically >> > - If memory is tight, the memory reclamation process will >> > release the pcp >> >> This could be a real issue for me. > > Thanks, I will test more carefully for those issue > >> >> > Whatever, we make vmstat_shepherd to checking whether we need >> > decay pcp high_max, and fire pcp_decay_high early if we need. >> > >> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically") >> > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com> >> > Signed-off-by: MengEn Sun <mengensun@tencent.com> >> > --- >> > changelog: >> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t >> > v2: Make the commit message clearer by adding some comments. >> > --- >> > mm/vmstat.c | 9 +++++++++ >> > 1 file changed, 9 insertions(+) >> > >> > diff --git a/mm/vmstat.c b/mm/vmstat.c >> > index 1917c034c045..07b494b06872 100644 >> > --- a/mm/vmstat.c >> > +++ b/mm/vmstat.c >> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu) >> > >> > for_each_populated_zone(zone) { >> > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); >> > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >> > struct per_cpu_nodestat *n; >> > >> > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu >> > + * entering NOHZ full, see quiet_vmstat. so, we check pcp >> > + * high_{min,max} to determine whether it is necessary to run >> > + * decay_pcp_high on the corresponding CPU >> > + */ >> >> Please follow the comments coding style. >> >> /* >> * comments line 1 >> * comments line 2 >> */ >> > > Thank you for your suggestion. I understand and am ready to make > some changes > >> > + if (pcp->high_max > pcp->high_min) >> > + return true; >> > + >> >> We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high. >> Your code may make need_update() return true in most cases. > > You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min? > >> >> > /* >> > * The fast way of checking if there are any vmstat diffs. >> > */ -- Best Regards, Huang, Ying
© 2016 - 2024 Red Hat, Inc.