[v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

[PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Li Zhijian 3 months, 2 weeks ago

Goto-san reported confusing pgpromote statistics where
the pgpromote_success count significantly exceeded pgpromote_candidate.

On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
 # Enable demotion only
 echo 1 > /sys/kernel/mm/numa/demotion_enabled
 numactl -m 0-1 memhog -r200 3500M >/dev/null &
 pid=$!
 sleep 2
 numactl memhog -r100 2500M >/dev/null &
 sleep 10
 kill -9 $pid # terminate the 1st memhog
 # Enable promotion
 echo 2 > /proc/sys/kernel/numa_balancing

After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 0

In this scenario, after terminating the first memhog, the conditions for
pgdat_free_space_enough() are quickly met, triggering promotion.
However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS,
not in PGPROMOTE_CANDIDATE.

This update increments PGPROMOTE_CANDIDATE within the free space branch
when a promotion decision is made, which may alter the mechanism of the
rate limit. Consequently, it becomes easier to reach the rate limit than
it was previously.

For example:
Rate Limit = 100 pages/sec
Scenario:
  T0: 90 free-space migrations
  T0+100ms: 20-page migration request

Before:
  Rate limit is *not* reached: 0 + 20 = 20 < 100
  PGPROMOTE_CANDIDATE: 20
After:
  Rate limit is reached: 90 + 20 = 110 > 100
  PGPROMOTE_CANDIDATE: 110

Due to the fact that the rate limit mechanism recalculates every second,
theoretically, only within that one second can the transition from
pgdat_free_space_enough() to !pgdat_free_space_enough() in top-tier
remaining memory be affected.

Moreover, previously, within this one-second span, promotions caused by
pgdat_free_space_enough() are not restricted by rate limits.
This theoretically makes it easier to cause application latency.

The current modification can better control the rate limit in cases of
transition from pgdat_free_space_enough() to !pgdat_free_space_enough()
within one second.

Cc: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
V2:
Fix compiling error # Reported by LKP

As Ying suggested, we need to assess whether this change causes regression.
However, considering the stringent conditions this patch involves,
properly evaluating it may be challenging, as the outcomes depend on your
perspective. Much like in a zero-sum game, if someone benefits, another
might lose.

If there are subsequent results, I will update them here.

Cc: lkp@intel.com
Here, I hope to leverage the existing LKP benchmark to evaluate the
potential impacts. The ideal evaluation conditions are:
1. Installed with DRAM + NVDIMM (which can be simulated).
2. NVDIMM is used as system RAM (configurable via daxctl).
3. Promotion is enabled (`echo 2 > /proc/sys/kernel/numa_balancing`).

Alternative:
We can indeed eliminate the potential impact within
pgdat_free_space_enough(), so that the rate limit behavior remains as
before.

For instance, consider the following change:
                if (pgdat_free_space_enough(pgdat)) {
                        /* workload changed, reset hot threshold */
                        pgdat->nbp_threshold = 0;
+                        pgdat->nbp_rl_nr_cand += nr;
                        mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
                        return true;
                }

RFC:
I am uncertain whether we originally intended for this discrepancy or if
it was overlooked.

However, the current situation where pgpromote_candidate < pgpromote_success
is indeed confusing when interpreted literally.
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..505b40f8897a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		struct pglist_data *pgdat;
 		unsigned long rate_limit;
 		unsigned int latency, th, def_th;
+		long nr = folio_nr_pages(folio);
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat)) {
 			/* workload changed, reset hot threshold */
 			pgdat->nbp_threshold = 0;
+			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
 			return true;
 		}
 
@@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		if (latency >= th)
 			return false;
 
-		return !numa_promotion_rate_limit(pgdat, rate_limit,
-						  folio_nr_pages(folio));
+		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
 	}
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
-- 
2.41.0

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Zhijian Li (Fujitsu) 3 months, 1 week ago

Hi,


On 25/06/2025 10:13, Li Zhijian wrote:
> V2:
> Fix compiling error # Reported by LKP
> 
> As Ying suggested, we need to assess whether this change causes regression.
> However, considering the stringent conditions this patch involves,
> properly evaluating it may be challenging, as the outcomes depend on your
> perspective. Much like in a zero-sum game, if someone benefits, another
> might lose.
> 
> If there are subsequent results, I will update them here.

I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).

The results show an approximate 4% performance increase in pmbench after applying this patch.

Average     pmbench-access            max-promotion-rate
Before:     7956805 pages/sec                168301 pages/sec
After:      8313666 pages/sec (+4.4%)        207149 pages/sec

The detailed logs are available at [2].

[1] https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/reproduce.sh
[2] https://github.com/zhijianli88/misc/tree/main/20250627/promotion-evaluation

Thanks
Zhijian
> 
> Cc:lkp@intel.com
> Here, I hope to leverage the existing LKP benchmark to evaluate the
> potential impacts. The ideal evaluation conditions are:
> 1. Installed with DRAM + NVDIMM (which can be simulated).
> 2. NVDIMM is used as system RAM (configurable via daxctl).
> 3. Promotion is enabled (`echo 2 > /proc/sys/kernel/numa_balancing`).
> 
> Alternative:
> We can indeed eliminate the potential impact within
> pgdat_free_space_enough(), so that the rate limit behavior remains as
> before.
> 
> For instance, consider the following change:
>                  if (pgdat_free_space_enough(pgdat)) {
>                          /* workload changed, reset hot threshold */
>                          pgdat->nbp_threshold = 0;
> +                        pgdat->nbp_rl_nr_cand += nr;
>                          mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>                          return true;
>                  }
> 
> RFC:

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Huang, Ying 3 months ago

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> Hi,
>
>
> On 25/06/2025 10:13, Li Zhijian wrote:
>> V2:
>> Fix compiling error # Reported by LKP
>> 
>> As Ying suggested, we need to assess whether this change causes regression.
>> However, considering the stringent conditions this patch involves,
>> properly evaluating it may be challenging, as the outcomes depend on your
>> perspective. Much like in a zero-sum game, if someone benefits, another
>> might lose.
>> 
>> If there are subsequent results, I will update them here.
>
> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>
> The results show an approximate 4% performance increase in pmbench after applying this patch.
>
> Average     pmbench-access            max-promotion-rate
> Before:     7956805 pages/sec                168301 pages/sec
> After:      8313666 pages/sec (+4.4%)        207149 pages/sec

It's hard for me to understand why performance increases because of
higher promotion rate, while the expected behavior is more promotion
rate limiting.

> The detailed logs are available at [2].
>
> [1] https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/reproduce.sh
> [2] https://github.com/zhijianli88/misc/tree/main/20250627/promotion-evaluation

[snip]

---
Best Regards,
Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Zhijian Li (Fujitsu) 3 months ago


On 08/07/2025 09:14, Huang, Ying wrote:
> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> 
>> Hi,
>>
>>
>> On 25/06/2025 10:13, Li Zhijian wrote:
>>> V2:
>>> Fix compiling error # Reported by LKP
>>>
>>> As Ying suggested, we need to assess whether this change causes regression.
>>> However, considering the stringent conditions this patch involves,
>>> properly evaluating it may be challenging, as the outcomes depend on your
>>> perspective. Much like in a zero-sum game, if someone benefits, another
>>> might lose.
>>>
>>> If there are subsequent results, I will update them here.
>>
>> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>>
>> The results show an approximate 4% performance increase in pmbench after applying this patch.
>>
>> Average     pmbench-access            max-promotion-rate
>> Before:     7956805 pages/sec                168301 pages/sec
>> After:      8313666 pages/sec (+4.4%)        207149 pages/sec
> 
> It's hard for me to understand why performance increases because of
> higher promotion rate, while the expected behavior is more promotion
> rate limiting.

Good question.

Above max-promotion-rate means the maximum rate during the WHOLE pmbench period which
can not indicate the total promoted pages.

Allow me to present each sample [0] recorded per second during the pmbench duration, as exemplified below:


             |       AFTER             |VS |           BEFORE       |
------------+-------------------------+++++------------------------|
| Timestamp |  pgprom/s   |  pgdem/s  |   |  pgprom/s  |  pgdem/s  |
|-----------|-------------|-----------|---|------------|-----------|
|     1     |   122977    |     0     |   |   123051   |     0     |
|     2     |   50171     |     0     |   |   50159    |     0     |
|     3     |     18      |     0     |   |     28     |     0     |
|     4     |   16647     |     0     |   |     0      |     0     |
|     5     | 207149.5    |     0     |   |   78895    |     0     |
|     6     | 193411      | 161521    |   |  168301    |   8702    |
|     7     |  52464      |  53989    |   |   42294    |  39108    |
|     8     |   5133      |   2627    |   |     0      |     0     |
|     9     |     24      |     8     |   |   3875     |   6213    |
|    10     |     0       |     0     |   |  45513     |  43260    |
|    11     |     0       |     0     |   |  36600     |  44982    |
|    12     |     0       |     0     |   |  21091     |  11631    |
|    13     |     0       |     0     |   |  12276     |  10719    |
|    14     |     0       |     0     |   | 149699     | 149400    |
|    15     |     0       |     0     |   |   4026     |   4933    |
|    16     |     0       |     0     |   |   3780     |     0     |
|    17     |     0       |     0     |   |     2      |     0     |
|    18     |     0       |     0     |   |     0      |     0     |
|    19     |     0       |     0     |   |     0      |     0     |
|    20     |     0       |     0     |   |     0      |     0     |
|    21     |     0       |     0     |   |    62      |     0     |
|    22     |     0       |     0     |   |   2016     |     0     |
|    23     |     0       |     0     |   |     0      |     0     |
|    24     |     0       |     0     |   |    62      |     0     |
|    25     |   8308      |     0     |   |     1      |     0     |
|    26     |   220       |     0     |   |     0      |     0     |
|    27     |     0       |     0     |   |  1995.05   |     0     |
|    28     |     0       |     0     |   |     1      |     0     |
|    29     |   5791      |     0     |   |     0      |     0     |
|    30     |     0       |     0     |   |    62      |     0     |
------------+-------------------------+++++------------------------|
|   total   | 662313.5    | 218145    |   | 743789.05  | 318948    |
|    max    | 207149.5    | 161521    |   |  168301    | 149400    |
------------+-------------------------+++++------------------------|
|   pmbench |        8416250          |VS |        8079500         |


As far as I can tell, the higher pmbench scores applied-patch may be attributed to
a reduction in the total number of promoted pages in the entire pmbench execution period.
(Similar circumstances were observed in the results of other tests conducted)



[0]
before:
https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/pmbench-1750988862.log
https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/sar-1750988862.log
after:
https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/pmbench-1750988291.log
https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/sar-1750988291.log


Thanks
Zhijian

> 
>> The detailed logs are available at [2].
>>
>> [1] https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/reproduce.sh
>> [2] https://github.com/zhijianli88/misc/tree/main/20250627/promotion-evaluation
> 
> [snip]
> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Huang, Ying 3 months ago

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 08/07/2025 09:14, Huang, Ying wrote:
>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>> 
>>> Hi,
>>>
>>>
>>> On 25/06/2025 10:13, Li Zhijian wrote:
>>>> V2:
>>>> Fix compiling error # Reported by LKP
>>>>
>>>> As Ying suggested, we need to assess whether this change causes regression.
>>>> However, considering the stringent conditions this patch involves,
>>>> properly evaluating it may be challenging, as the outcomes depend on your
>>>> perspective. Much like in a zero-sum game, if someone benefits, another
>>>> might lose.
>>>>
>>>> If there are subsequent results, I will update them here.
>>>
>>> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>>>
>>> The results show an approximate 4% performance increase in pmbench after applying this patch.
>>>
>>> Average     pmbench-access            max-promotion-rate
>>> Before:     7956805 pages/sec                168301 pages/sec
>>> After:      8313666 pages/sec (+4.4%)        207149 pages/sec
>> 
>> It's hard for me to understand why performance increases because of
>> higher promotion rate, while the expected behavior is more promotion
>> rate limiting.
>
> Good question.
>
> Above max-promotion-rate means the maximum rate during the WHOLE pmbench period which
> can not indicate the total promoted pages.
>
> Allow me to present each sample [0] recorded per second during the pmbench duration, as exemplified below:
>
>
>              |       AFTER             |VS |           BEFORE       |
> ------------+-------------------------+++++------------------------|
> | Timestamp |  pgprom/s   |  pgdem/s  |   |  pgprom/s  |  pgdem/s  |
> |-----------|-------------|-----------|---|------------|-----------|
> |     1     |   122977    |     0     |   |   123051   |     0     |
> |     2     |   50171     |     0     |   |   50159    |     0     |
> |     3     |     18      |     0     |   |     28     |     0     |
> |     4     |   16647     |     0     |   |     0      |     0     |
> |     5     | 207149.5    |     0     |   |   78895    |     0     |
> |     6     | 193411      | 161521    |   |  168301    |   8702    |
> |     7     |  52464      |  53989    |   |   42294    |  39108    |
> |     8     |   5133      |   2627    |   |     0      |     0     |
> |     9     |     24      |     8     |   |   3875     |   6213    |
> |    10     |     0       |     0     |   |  45513     |  43260    |
> |    11     |     0       |     0     |   |  36600     |  44982    |
> |    12     |     0       |     0     |   |  21091     |  11631    |
> |    13     |     0       |     0     |   |  12276     |  10719    |
> |    14     |     0       |     0     |   | 149699     | 149400    |
> |    15     |     0       |     0     |   |   4026     |   4933    |
> |    16     |     0       |     0     |   |   3780     |     0     |
> |    17     |     0       |     0     |   |     2      |     0     |
> |    18     |     0       |     0     |   |     0      |     0     |
> |    19     |     0       |     0     |   |     0      |     0     |
> |    20     |     0       |     0     |   |     0      |     0     |
> |    21     |     0       |     0     |   |    62      |     0     |
> |    22     |     0       |     0     |   |   2016     |     0     |
> |    23     |     0       |     0     |   |     0      |     0     |
> |    24     |     0       |     0     |   |    62      |     0     |
> |    25     |   8308      |     0     |   |     1      |     0     |
> |    26     |   220       |     0     |   |     0      |     0     |
> |    27     |     0       |     0     |   |  1995.05   |     0     |
> |    28     |     0       |     0     |   |     1      |     0     |
> |    29     |   5791      |     0     |   |     0      |     0     |
> |    30     |     0       |     0     |   |    62      |     0     |
> ------------+-------------------------+++++------------------------|
> |   total   | 662313.5    | 218145    |   | 743789.05  | 318948    |
> |    max    | 207149.5    | 161521    |   |  168301    | 149400    |
> ------------+-------------------------+++++------------------------|
> |   pmbench |        8416250          |VS |        8079500         |
>
>
> As far as I can tell, the higher pmbench scores applied-patch may be attributed to
> a reduction in the total number of promoted pages in the entire pmbench execution period.
> (Similar circumstances were observed in the results of other tests conducted)
>
>
>
> [0]
> before:
> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/pmbench-1750988862.log
> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/sar-1750988862.log
> after:
> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/pmbench-1750988291.log
> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/sar-1750988291.log
>

Check the usage of PGPROMOTE_CANDIDATE again.  It is used not only by
rate limiting, but also promotion threshold adjustment, please take a
look at numa_promotion_adjust_threshold().  Which may have larger
influence on performance.

After checking the threshold adjustment code, I think the changes in
this patch may confuse threshold adjustment.

[snip]

---
Best Regards,
Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Zhijian Li (Fujitsu) 3 months ago


On 08/07/2025 10:47, Huang, Ying wrote:
> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> 
>> On 08/07/2025 09:14, Huang, Ying wrote:
>>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>>>
>>>> Hi,
>>>>
>>>>
>>>> On 25/06/2025 10:13, Li Zhijian wrote:
>>>>> V2:
>>>>> Fix compiling error # Reported by LKP
>>>>>
>>>>> As Ying suggested, we need to assess whether this change causes regression.
>>>>> However, considering the stringent conditions this patch involves,
>>>>> properly evaluating it may be challenging, as the outcomes depend on your
>>>>> perspective. Much like in a zero-sum game, if someone benefits, another
>>>>> might lose.
>>>>>
>>>>> If there are subsequent results, I will update them here.
>>>>
>>>> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>>>>
>>>> The results show an approximate 4% performance increase in pmbench after applying this patch.
>>>>
>>>> Average     pmbench-access            max-promotion-rate
>>>> Before:     7956805 pages/sec                168301 pages/sec
>>>> After:      8313666 pages/sec (+4.4%)        207149 pages/sec
>>>
>>> It's hard for me to understand why performance increases because of
>>> higher promotion rate, while the expected behavior is more promotion
>>> rate limiting.
>>
>> Good question.
>>
>> Above max-promotion-rate means the maximum rate during the WHOLE pmbench period which
>> can not indicate the total promoted pages.
>>
>> Allow me to present each sample [0] recorded per second during the pmbench duration, as exemplified below:
>>
>>
>>               |       AFTER             |VS |           BEFORE       |
>> ------------+-------------------------+++++------------------------|
>> | Timestamp |  pgprom/s   |  pgdem/s  |   |  pgprom/s  |  pgdem/s  |
>> |-----------|-------------|-----------|---|------------|-----------|
>> |     1     |   122977    |     0     |   |   123051   |     0     |
>> |     2     |   50171     |     0     |   |   50159    |     0     |
>> |     3     |     18      |     0     |   |     28     |     0     |
>> |     4     |   16647     |     0     |   |     0      |     0     |
>> |     5     | 207149.5    |     0     |   |   78895    |     0     |
>> |     6     | 193411      | 161521    |   |  168301    |   8702    |
>> |     7     |  52464      |  53989    |   |   42294    |  39108    |
>> |     8     |   5133      |   2627    |   |     0      |     0     |
>> |     9     |     24      |     8     |   |   3875     |   6213    |
>> |    10     |     0       |     0     |   |  45513     |  43260    |
>> |    11     |     0       |     0     |   |  36600     |  44982    |
>> |    12     |     0       |     0     |   |  21091     |  11631    |
>> |    13     |     0       |     0     |   |  12276     |  10719    |
>> |    14     |     0       |     0     |   | 149699     | 149400    |
>> |    15     |     0       |     0     |   |   4026     |   4933    |
>> |    16     |     0       |     0     |   |   3780     |     0     |
>> |    17     |     0       |     0     |   |     2      |     0     |
>> |    18     |     0       |     0     |   |     0      |     0     |
>> |    19     |     0       |     0     |   |     0      |     0     |
>> |    20     |     0       |     0     |   |     0      |     0     |
>> |    21     |     0       |     0     |   |    62      |     0     |
>> |    22     |     0       |     0     |   |   2016     |     0     |
>> |    23     |     0       |     0     |   |     0      |     0     |
>> |    24     |     0       |     0     |   |    62      |     0     |
>> |    25     |   8308      |     0     |   |     1      |     0     |
>> |    26     |   220       |     0     |   |     0      |     0     |
>> |    27     |     0       |     0     |   |  1995.05   |     0     |
>> |    28     |     0       |     0     |   |     1      |     0     |
>> |    29     |   5791      |     0     |   |     0      |     0     |
>> |    30     |     0       |     0     |   |    62      |     0     |
>> ------------+-------------------------+++++------------------------|
>> |   total   | 662313.5    | 218145    |   | 743789.05  | 318948    |
>> |    max    | 207149.5    | 161521    |   |  168301    | 149400    |
>> ------------+-------------------------+++++------------------------|
>> |   pmbench |        8416250          |VS |        8079500         |
>>
>>
>> As far as I can tell, the higher pmbench scores applied-patch may be attributed to
>> a reduction in the total number of promoted pages in the entire pmbench execution period.
>> (Similar circumstances were observed in the results of other tests conducted)
>>
>>
>>
>> [0]
>> before:
>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/pmbench-1750988862.log
>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/sar-1750988862.log
>> after:
>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/pmbench-1750988291.log
>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/sar-1750988291.log
>>
> 
> Check the usage of PGPROMOTE_CANDIDATE again.  It is used not only by
> rate limiting, but also promotion threshold adjustment, please take a
> look at numa_promotion_adjust_threshold().  Which may have larger
> influence on performance.
> 
> After checking the threshold adjustment code, I think the changes in
> this patch may confuse threshold adjustment.


Indeed, I misunderstood the comment in the previous code:
/* workload changed, reset hot threshold */.

Originally, this logic only reset the threshold for the current interval.
For the next cycle (60 seconds by default), the threshold is
re-evaluated based on the historical PGPROMOTE_CANDIDATE counts.
Therefore, the current change may affect threshold adjustment in subsequent cycles.


Do you think there's still a case to push for this patch?

For example, by collecting more data with longer pmbench runs (over two threshold cycles),
or explicitly compensating nbp_rl_nr_cand and nbp_th_nr_cand to maintain existing
behavior for both the rate limit and threshold logic? something like:

if (pgdat_free_space_enough(pgdat)) {
     /* workload changed, reset hot threshold */
     pgdat->nbp_threshold = 0;
     
     mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
     // compensation for rate limit and threshold
     pgdat->nbp_rl_nr_cand += nr;
     pgdat->nbp_th_nr_cand += nr;
     
     return true;
}

Thanks
Zhijian

> 
> [snip]
> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Huang, Ying 3 months ago

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 08/07/2025 10:47, Huang, Ying wrote:
>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>> 
>>> On 08/07/2025 09:14, Huang, Ying wrote:
>>>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> On 25/06/2025 10:13, Li Zhijian wrote:
>>>>>> V2:
>>>>>> Fix compiling error # Reported by LKP
>>>>>>
>>>>>> As Ying suggested, we need to assess whether this change causes regression.
>>>>>> However, considering the stringent conditions this patch involves,
>>>>>> properly evaluating it may be challenging, as the outcomes depend on your
>>>>>> perspective. Much like in a zero-sum game, if someone benefits, another
>>>>>> might lose.
>>>>>>
>>>>>> If there are subsequent results, I will update them here.
>>>>>
>>>>> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>>>>>
>>>>> The results show an approximate 4% performance increase in pmbench after applying this patch.
>>>>>
>>>>> Average     pmbench-access            max-promotion-rate
>>>>> Before:     7956805 pages/sec                168301 pages/sec
>>>>> After:      8313666 pages/sec (+4.4%)        207149 pages/sec
>>>>
>>>> It's hard for me to understand why performance increases because of
>>>> higher promotion rate, while the expected behavior is more promotion
>>>> rate limiting.
>>>
>>> Good question.
>>>
>>> Above max-promotion-rate means the maximum rate during the WHOLE pmbench period which
>>> can not indicate the total promoted pages.
>>>
>>> Allow me to present each sample [0] recorded per second during the pmbench duration, as exemplified below:
>>>
>>>
>>>               |       AFTER             |VS |           BEFORE       |
>>> ------------+-------------------------+++++------------------------|
>>> | Timestamp |  pgprom/s   |  pgdem/s  |   |  pgprom/s  |  pgdem/s  |
>>> |-----------|-------------|-----------|---|------------|-----------|
>>> |     1     |   122977    |     0     |   |   123051   |     0     |
>>> |     2     |   50171     |     0     |   |   50159    |     0     |
>>> |     3     |     18      |     0     |   |     28     |     0     |
>>> |     4     |   16647     |     0     |   |     0      |     0     |
>>> |     5     | 207149.5    |     0     |   |   78895    |     0     |
>>> |     6     | 193411      | 161521    |   |  168301    |   8702    |
>>> |     7     |  52464      |  53989    |   |   42294    |  39108    |
>>> |     8     |   5133      |   2627    |   |     0      |     0     |
>>> |     9     |     24      |     8     |   |   3875     |   6213    |
>>> |    10     |     0       |     0     |   |  45513     |  43260    |
>>> |    11     |     0       |     0     |   |  36600     |  44982    |
>>> |    12     |     0       |     0     |   |  21091     |  11631    |
>>> |    13     |     0       |     0     |   |  12276     |  10719    |
>>> |    14     |     0       |     0     |   | 149699     | 149400    |
>>> |    15     |     0       |     0     |   |   4026     |   4933    |
>>> |    16     |     0       |     0     |   |   3780     |     0     |
>>> |    17     |     0       |     0     |   |     2      |     0     |
>>> |    18     |     0       |     0     |   |     0      |     0     |
>>> |    19     |     0       |     0     |   |     0      |     0     |
>>> |    20     |     0       |     0     |   |     0      |     0     |
>>> |    21     |     0       |     0     |   |    62      |     0     |
>>> |    22     |     0       |     0     |   |   2016     |     0     |
>>> |    23     |     0       |     0     |   |     0      |     0     |
>>> |    24     |     0       |     0     |   |    62      |     0     |
>>> |    25     |   8308      |     0     |   |     1      |     0     |
>>> |    26     |   220       |     0     |   |     0      |     0     |
>>> |    27     |     0       |     0     |   |  1995.05   |     0     |
>>> |    28     |     0       |     0     |   |     1      |     0     |
>>> |    29     |   5791      |     0     |   |     0      |     0     |
>>> |    30     |     0       |     0     |   |    62      |     0     |
>>> ------------+-------------------------+++++------------------------|
>>> |   total   | 662313.5    | 218145    |   | 743789.05  | 318948    |
>>> |    max    | 207149.5    | 161521    |   |  168301    | 149400    |
>>> ------------+-------------------------+++++------------------------|
>>> |   pmbench |        8416250          |VS |        8079500         |
>>>
>>>
>>> As far as I can tell, the higher pmbench scores applied-patch may be attributed to
>>> a reduction in the total number of promoted pages in the entire pmbench execution period.
>>> (Similar circumstances were observed in the results of other tests conducted)
>>>
>>>
>>>
>>> [0]
>>> before:
>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/pmbench-1750988862.log
>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/sar-1750988862.log
>>> after:
>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/pmbench-1750988291.log
>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/sar-1750988291.log
>>>
>> 
>> Check the usage of PGPROMOTE_CANDIDATE again.  It is used not only by
>> rate limiting, but also promotion threshold adjustment, please take a
>> look at numa_promotion_adjust_threshold().  Which may have larger
>> influence on performance.
>> 
>> After checking the threshold adjustment code, I think the changes in
>> this patch may confuse threshold adjustment.
>
>
> Indeed, I misunderstood the comment in the previous code:
> /* workload changed, reset hot threshold */.
>
> Originally, this logic only reset the threshold for the current interval.
> For the next cycle (60 seconds by default), the threshold is
> re-evaluated based on the historical PGPROMOTE_CANDIDATE counts.
> Therefore, the current change may affect threshold adjustment in subsequent cycles.
>
>
> Do you think there's still a case to push for this patch?
>
> For example, by collecting more data with longer pmbench runs (over two threshold cycles),
> or explicitly compensating nbp_rl_nr_cand and nbp_th_nr_cand to maintain existing
> behavior for both the rate limit and threshold logic? something like:
>
> if (pgdat_free_space_enough(pgdat)) {
>      /* workload changed, reset hot threshold */
>      pgdat->nbp_threshold = 0;
>      
>      mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>      // compensation for rate limit and threshold
>      pgdat->nbp_rl_nr_cand += nr;
>      pgdat->nbp_th_nr_cand += nr;
>      
>      return true;
> }

I don't think that it's necessary to make the algorithm harder to be
understood.

If you think that the original stat really makes people confusing, I
guess that we can add a new stat (say PGPROMOTE_CANDIDATE_OTHER).

---
Best Regards,
Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Zhijian Li (Fujitsu) 3 months ago


On 08/07/2025 16:56, Huang, Ying wrote:
> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
> 
>> On 08/07/2025 10:47, Huang, Ying wrote:
>>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>>>
>>>> On 08/07/2025 09:14, Huang, Ying wrote:
>>>>> "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On 25/06/2025 10:13, Li Zhijian wrote:
>>>>>>> V2:
>>>>>>> Fix compiling error # Reported by LKP
>>>>>>>
>>>>>>> As Ying suggested, we need to assess whether this change causes regression.
>>>>>>> However, considering the stringent conditions this patch involves,
>>>>>>> properly evaluating it may be challenging, as the outcomes depend on your
>>>>>>> perspective. Much like in a zero-sum game, if someone benefits, another
>>>>>>> might lose.
>>>>>>>
>>>>>>> If there are subsequent results, I will update them here.
>>>>>>
>>>>>> I ran memhog + pmbench to evaluate the impact of the patch(3 runs [1] for each kernel).
>>>>>>
>>>>>> The results show an approximate 4% performance increase in pmbench after applying this patch.
>>>>>>
>>>>>> Average     pmbench-access            max-promotion-rate
>>>>>> Before:     7956805 pages/sec                168301 pages/sec
>>>>>> After:      8313666 pages/sec (+4.4%)        207149 pages/sec
>>>>>
>>>>> It's hard for me to understand why performance increases because of
>>>>> higher promotion rate, while the expected behavior is more promotion
>>>>> rate limiting.
>>>>
>>>> Good question.
>>>>
>>>> Above max-promotion-rate means the maximum rate during the WHOLE pmbench period which
>>>> can not indicate the total promoted pages.
>>>>
>>>> Allow me to present each sample [0] recorded per second during the pmbench duration, as exemplified below:
>>>>
>>>>
>>>>                |       AFTER             |VS |           BEFORE       |
>>>> ------------+-------------------------+++++------------------------|
>>>> | Timestamp |  pgprom/s   |  pgdem/s  |   |  pgprom/s  |  pgdem/s  |
>>>> |-----------|-------------|-----------|---|------------|-----------|
>>>> |     1     |   122977    |     0     |   |   123051   |     0     |
>>>> |     2     |   50171     |     0     |   |   50159    |     0     |
>>>> |     3     |     18      |     0     |   |     28     |     0     |
>>>> |     4     |   16647     |     0     |   |     0      |     0     |
>>>> |     5     | 207149.5    |     0     |   |   78895    |     0     |
>>>> |     6     | 193411      | 161521    |   |  168301    |   8702    |
>>>> |     7     |  52464      |  53989    |   |   42294    |  39108    |
>>>> |     8     |   5133      |   2627    |   |     0      |     0     |
>>>> |     9     |     24      |     8     |   |   3875     |   6213    |
>>>> |    10     |     0       |     0     |   |  45513     |  43260    |
>>>> |    11     |     0       |     0     |   |  36600     |  44982    |
>>>> |    12     |     0       |     0     |   |  21091     |  11631    |
>>>> |    13     |     0       |     0     |   |  12276     |  10719    |
>>>> |    14     |     0       |     0     |   | 149699     | 149400    |
>>>> |    15     |     0       |     0     |   |   4026     |   4933    |
>>>> |    16     |     0       |     0     |   |   3780     |     0     |
>>>> |    17     |     0       |     0     |   |     2      |     0     |
>>>> |    18     |     0       |     0     |   |     0      |     0     |
>>>> |    19     |     0       |     0     |   |     0      |     0     |
>>>> |    20     |     0       |     0     |   |     0      |     0     |
>>>> |    21     |     0       |     0     |   |    62      |     0     |
>>>> |    22     |     0       |     0     |   |   2016     |     0     |
>>>> |    23     |     0       |     0     |   |     0      |     0     |
>>>> |    24     |     0       |     0     |   |    62      |     0     |
>>>> |    25     |   8308      |     0     |   |     1      |     0     |
>>>> |    26     |   220       |     0     |   |     0      |     0     |
>>>> |    27     |     0       |     0     |   |  1995.05   |     0     |
>>>> |    28     |     0       |     0     |   |     1      |     0     |
>>>> |    29     |   5791      |     0     |   |     0      |     0     |
>>>> |    30     |     0       |     0     |   |    62      |     0     |
>>>> ------------+-------------------------+++++------------------------|
>>>> |   total   | 662313.5    | 218145    |   | 743789.05  | 318948    |
>>>> |    max    | 207149.5    | 161521    |   |  168301    | 149400    |
>>>> ------------+-------------------------+++++------------------------|
>>>> |   pmbench |        8416250          |VS |        8079500         |
>>>>
>>>>
>>>> As far as I can tell, the higher pmbench scores applied-patch may be attributed to
>>>> a reduction in the total number of promoted pages in the entire pmbench execution period.
>>>> (Similar circumstances were observed in the results of other tests conducted)
>>>>
>>>>
>>>>
>>>> [0]
>>>> before:
>>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/pmbench-1750988862.log
>>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/without-patch/sar-1750988862.log
>>>> after:
>>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/pmbench-1750988291.log
>>>> https://github.com/zhijianli88/misc/blob/main/20250627/promotion-evaluation/with-patch/sar-1750988291.log
>>>>
>>>
>>> Check the usage of PGPROMOTE_CANDIDATE again.  It is used not only by
>>> rate limiting, but also promotion threshold adjustment, please take a
>>> look at numa_promotion_adjust_threshold().  Which may have larger
>>> influence on performance.
>>>
>>> After checking the threshold adjustment code, I think the changes in
>>> this patch may confuse threshold adjustment.
>>
>>
>> Indeed, I misunderstood the comment in the previous code:
>> /* workload changed, reset hot threshold */.
>>
>> Originally, this logic only reset the threshold for the current interval.
>> For the next cycle (60 seconds by default), the threshold is
>> re-evaluated based on the historical PGPROMOTE_CANDIDATE counts.
>> Therefore, the current change may affect threshold adjustment in subsequent cycles.
>>
>>
>> Do you think there's still a case to push for this patch?
>>
>> For example, by collecting more data with longer pmbench runs (over two threshold cycles),
>> or explicitly compensating nbp_rl_nr_cand and nbp_th_nr_cand to maintain existing
>> behavior for both the rate limit and threshold logic? something like:
>>
>> if (pgdat_free_space_enough(pgdat)) {
>>       /* workload changed, reset hot threshold */
>>       pgdat->nbp_threshold = 0;
>>       
>>       mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>>       // compensation for rate limit and threshold
>>       pgdat->nbp_rl_nr_cand += nr;
>>       pgdat->nbp_th_nr_cand += nr;
>>       
>>       return true;
>> }
> 
> I don't think that it's necessary to make the algorithm harder to be
> understood.

All right,

> 
> If you think that the original stat really makes people confusing, I
> guess that we can add a new stat (say PGPROMOTE_CANDIDATE_OTHER).

Actually, I personally don't like to introduce a new stat for this case.
Anyway, we will further discuss this approach internally first.

Thank you !

Thanks
Zhijian


> 
> ---
> Best Regards,
> Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Huang, Ying 3 months, 2 weeks ago

Li Zhijian <lizhijian@fujitsu.com> writes:

> Goto-san reported confusing pgpromote statistics where
> the pgpromote_success count significantly exceeded pgpromote_candidate.
>
> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>  # Enable demotion only
>  echo 1 > /sys/kernel/mm/numa/demotion_enabled
>  numactl -m 0-1 memhog -r200 3500M >/dev/null &
>  pid=$!
>  sleep 2
>  numactl memhog -r100 2500M >/dev/null &
>  sleep 10
>  kill -9 $pid # terminate the 1st memhog
>  # Enable promotion
>  echo 2 > /proc/sys/kernel/numa_balancing
>
> After a few seconds, we observeed `pgpromote_candidate < pgpromote_success`
> $ grep -e pgpromote /proc/vmstat
> pgpromote_success 2579
> pgpromote_candidate 0
>
> In this scenario, after terminating the first memhog, the conditions for
> pgdat_free_space_enough() are quickly met, triggering promotion.
> However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS,
> not in PGPROMOTE_CANDIDATE.
>
> This update increments PGPROMOTE_CANDIDATE within the free space branch
> when a promotion decision is made, which may alter the mechanism of the
> rate limit. Consequently, it becomes easier to reach the rate limit than
> it was previously.
>
> For example:
> Rate Limit = 100 pages/sec
> Scenario:
>   T0: 90 free-space migrations
>   T0+100ms: 20-page migration request
>
> Before:
>   Rate limit is *not* reached: 0 + 20 = 20 < 100
>   PGPROMOTE_CANDIDATE: 20
> After:
>   Rate limit is reached: 90 + 20 = 110 > 100
>   PGPROMOTE_CANDIDATE: 110
>
> Due to the fact that the rate limit mechanism recalculates every second,
> theoretically, only within that one second can the transition from
> pgdat_free_space_enough() to !pgdat_free_space_enough() in top-tier
> remaining memory be affected.
>
> Moreover, previously, within this one-second span, promotions caused by
> pgdat_free_space_enough() are not restricted by rate limits.
> This theoretically makes it easier to cause application latency.
>
> The current modification can better control the rate limit in cases of
> transition from pgdat_free_space_enough() to !pgdat_free_space_enough()
> within one second.
>
> Cc: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
> V2:
> Fix compiling error # Reported by LKP
>
> As Ying suggested, we need to assess whether this change causes regression.
> However, considering the stringent conditions this patch involves,
> properly evaluating it may be challenging, as the outcomes depend on your
> perspective. Much like in a zero-sum game, if someone benefits, another
> might lose.
>
> If there are subsequent results, I will update them here.

I understand that it's hard to identify all possible regressions.
However, at least done some test to check some common use cases?

> Cc: lkp@intel.com
> Here, I hope to leverage the existing LKP benchmark to evaluate the
> potential impacts. The ideal evaluation conditions are:
> 1. Installed with DRAM + NVDIMM (which can be simulated).
> 2. NVDIMM is used as system RAM (configurable via daxctl).
> 3. Promotion is enabled (`echo 2 > /proc/sys/kernel/numa_balancing`).
>
> Alternative:
> We can indeed eliminate the potential impact within
> pgdat_free_space_enough(), so that the rate limit behavior remains as
> before.
>
> For instance, consider the following change:
>                 if (pgdat_free_space_enough(pgdat)) {
>                         /* workload changed, reset hot threshold */
>                         pgdat->nbp_threshold = 0;
> +                        pgdat->nbp_rl_nr_cand += nr;
>                         mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>                         return true;
>                 }
>
> RFC:
> I am uncertain whether we originally intended for this discrepancy or if
> it was overlooked.
>
> However, the current situation where pgpromote_candidate < pgpromote_success
> is indeed confusing when interpreted literally.
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb..505b40f8897a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  		struct pglist_data *pgdat;
>  		unsigned long rate_limit;
>  		unsigned int latency, th, def_th;
> +		long nr = folio_nr_pages(folio);
>  
>  		pgdat = NODE_DATA(dst_nid);
>  		if (pgdat_free_space_enough(pgdat)) {
>  			/* workload changed, reset hot threshold */
>  			pgdat->nbp_threshold = 0;
> +			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>  			return true;
>  		}
>  
> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  		if (latency >= th)
>  			return false;
>  
> -		return !numa_promotion_rate_limit(pgdat, rate_limit,
> -						  folio_nr_pages(folio));
> +		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>  	}
>  
>  	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);

---
Best Regards,
Huang, Ying

Re: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting

Posted by Zhijian Li (Fujitsu) 3 months, 2 weeks ago


On 25/06/2025 14:11, Huang, Ying wrote:
>> V2:
>> Fix compiling error # Reported by LKP
>>
>> As Ying suggested, we need to assess whether this change causes regression.
>> However, considering the stringent conditions this patch involves,
>> properly evaluating it may be challenging, as the outcomes depend on your
>> perspective. Much like in a zero-sum game, if someone benefits, another
>> might lose.
>>
>> If there are subsequent results, I will update them here.
> I understand that it's hard to identify all possible regressions.
> However, at least done some test to check some common use cases?
> 

Yes, I'm working on it. I'm currently using `pmbench` for testing,
the approach[1] you took when introducing the rate limit.

And, I also hope LKP can run tests concurrently.


[1] https://lore.kernel.org/all/20220713083954.34196-1-ying.huang@intel.com/t/#u