sched/numa, mm: Skip page promotion if cpu pid is valid

[PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Donet Tom 1 week ago

If memory tiering is disabled, cpupid of slow memory pages may
contain a valid CPU and PID. If tiering is enabled at runtime,
there is a chance that in should_numa_migrate_memory(), this
valid CPU/PID is treated as a last access timestamp, leading
to unnecessary promotion.

Prevent this by skipping promotion when cpupid is valid.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b43809a3fb1..f5830a5a94d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		unsigned int latency, th, def_th;
 		long nr = folio_nr_pages(folio);
 
+		/* When tiering is enabled at runtime, last_cpupid may
+		 * hold a valid cpupid instead of an access timestamp.
+		 * If so, skip page promotion.
+		 */
+		if (cpupid_valid(folio_last_cpupid(folio)))
+			return false;
+
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat)) {
 			/* workload changed, reset hot threshold */
-- 
2.47.1

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by David Hildenbrand (Arm) 1 week ago

On 3/26/26 08:12, Donet Tom wrote:
> If memory tiering is disabled, cpupid of slow memory pages may
> contain a valid CPU and PID. If tiering is enabled at runtime,
> there is a chance that in should_numa_migrate_memory(), this
> valid CPU/PID is treated as a last access timestamp, leading
> to unnecessary promotion.

Is that measurable? Should we at least have a Fixes: ?

> 
> Prevent this by skipping promotion when cpupid is valid.
> 
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b43809a3fb1..f5830a5a94d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  		unsigned int latency, th, def_th;
>  		long nr = folio_nr_pages(folio);
>  

/*
 * When ...

> +		/* When tiering is enabled at runtime, last_cpupid may
> +		 * hold a valid cpupid instead of an access timestamp.
> +		 * If so, skip page promotion.
> +		 */
> +		if (cpupid_valid(folio_last_cpupid(folio)))
> +			return false;
> +

IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
we would no longer get false positives for cpupid_valid().
I suppose overflows are not a problem, correct?

So what we're saying is that folio_use_access_time()==true does not
imply that there is actually a valid time in there.

In numa_migrate_check() we could still use the valid cpuid I guess and
make that code a bit clearer?

diff --git a/mm/memory.c b/mm/memory.c
index 631205a384e1..ba68933a9e4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
         * For memory tiering mode, cpupid of slow memory page is used
         * to record page access time.  So use default value.
         */
-       if (folio_use_access_time(folio))
+       *last_cpupid = folio_last_cpupid(folio);
+       if (!cpupid_valid(*last_cpupid))
                *last_cpupid = (-1 & LAST_CPUPID_MASK);
-       else
-               *last_cpupid = folio_last_cpupid(folio);
 
        /* Record the current PID accessing VMA */
        vma_set_access_pid_bit(vma);


The change itself here looks reasonable to me.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Donet Tom 6 days, 1 hour ago

On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
> On 3/26/26 08:12, Donet Tom wrote:
>> If memory tiering is disabled, cpupid of slow memory pages may
>> contain a valid CPU and PID. If tiering is enabled at runtime,
>> there is a chance that in should_numa_migrate_memory(), this
>> valid CPU/PID is treated as a last access timestamp, leading
>> to unnecessary promotion.
> Is that measurable? Should we at least have a Fixes: ?
>
>> Prevent this by skipping promotion when cpupid is valid.
>>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   kernel/sched/fair.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4b43809a3fb1..f5830a5a94d5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>   		unsigned int latency, th, def_th;
>>   		long nr = folio_nr_pages(folio);
>>   
> /*
>   * When ...
>
>> +		/* When tiering is enabled at runtime, last_cpupid may
>> +		 * hold a valid cpupid instead of an access timestamp.
>> +		 * If so, skip page promotion.
>> +		 */
>> +		if (cpupid_valid(folio_last_cpupid(folio)))
>> +			return false;
>> +
> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
> we would no longer get false positives for cpupid_valid().
> I suppose overflows are not a problem, correct?

Thank you, David, for guiding me in the right direction.

I initially thought that overflows would not occur, and therefore
cpupid_valid() would not produce false positives. However,
after looking into it further, it appears that overflow can
happen when storing the access time.

The last_cpupid field is used to store the last access time.
 From the code, it appears that 21 bits are used for this
(#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).

With 21 bits, the maximum value that can be stored is
2097151ms (35Hrs) . If the access time exceeds this
range, it can overflow, which may lead to cpupid_valid()
returning false positives.

I think we need a reliable way to determine cpupid_valid() that
does not produce false positives.

-Donet


>
> So what we're saying is that folio_use_access_time()==true does not
> imply that there is actually a valid time in there.
>
> In numa_migrate_check() we could still use the valid cpuid I guess and
> make that code a bit clearer?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 631205a384e1..ba68933a9e4a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>           * For memory tiering mode, cpupid of slow memory page is used
>           * to record page access time.  So use default value.
>           */
> -       if (folio_use_access_time(folio))
> +       *last_cpupid = folio_last_cpupid(folio);
> +       if (!cpupid_valid(*last_cpupid))
>                  *last_cpupid = (-1 & LAST_CPUPID_MASK);
> -       else
> -               *last_cpupid = folio_last_cpupid(folio);
>   
>          /* Record the current PID accessing VMA */
>          vma_set_access_pid_bit(vma);
>
>
> The change itself here looks reasonable to me.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Huang, Ying 2 days, 12 hours ago

Hi, Donet,

Donet Tom <donettom@linux.ibm.com> writes:

> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>> On 3/26/26 08:12, Donet Tom wrote:
>>> If memory tiering is disabled, cpupid of slow memory pages may
>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>> there is a chance that in should_numa_migrate_memory(), this
>>> valid CPU/PID is treated as a last access timestamp, leading
>>> to unnecessary promotion.
>> Is that measurable? Should we at least have a Fixes: ?
>>
>>> Prevent this by skipping promotion when cpupid is valid.
>>>
>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>> ---
>>>   kernel/sched/fair.c | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>   		unsigned int latency, th, def_th;
>>>   		long nr = folio_nr_pages(folio);
>>>   
>> /*
>>   * When ...
>>
>>> +		/* When tiering is enabled at runtime, last_cpupid may
>>> +		 * hold a valid cpupid instead of an access timestamp.
>>> +		 * If so, skip page promotion.
>>> +		 */
>>> +		if (cpupid_valid(folio_last_cpupid(folio)))
>>> +			return false;
>>> +
>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>> we would no longer get false positives for cpupid_valid().
>> I suppose overflows are not a problem, correct?
>
> Thank you, David, for guiding me in the right direction.
>
> I initially thought that overflows would not occur, and therefore
> cpupid_valid() would not produce false positives. However,
> after looking into it further, it appears that overflow can
> happen when storing the access time.
>
> The last_cpupid field is used to store the last access time.
> From the code, it appears that 21 bits are used for this
> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>
> With 21 bits, the maximum value that can be stored is

It can be less than 21 bits, if CONFIG_NR_CPUS is small.

	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));

> 2097151ms (35Hrs) . If the access time exceeds this
> range, it can overflow, which may lead to cpupid_valid()
> returning false positives.
>
> I think we need a reliable way to determine cpupid_valid() that
> does not produce false positives.

Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
temporal performance issue at the cost of a longstanding performance
issue.  Right?

---
Best Regards,
Huang, Ying

>
>>
>> So what we're saying is that folio_use_access_time()==true does not
>> imply that there is actually a valid time in there.
>>
>> In numa_migrate_check() we could still use the valid cpuid I guess and
>> make that code a bit clearer?
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 631205a384e1..ba68933a9e4a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>>           * For memory tiering mode, cpupid of slow memory page is used
>>           * to record page access time.  So use default value.
>>           */
>> -       if (folio_use_access_time(folio))
>> +       *last_cpupid = folio_last_cpupid(folio);
>> +       if (!cpupid_valid(*last_cpupid))
>>                  *last_cpupid = (-1 & LAST_CPUPID_MASK);
>> -       else
>> -               *last_cpupid = folio_last_cpupid(folio);
>>            /* Record the current PID accessing VMA */
>>          vma_set_access_pid_bit(vma);
>>
>>
>> The change itself here looks reasonable to me.
>>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by David Hildenbrand (Arm) 2 days, 10 hours ago

On 3/31/26 10:33, Huang, Ying wrote:
> Hi, Donet,
> 
> Donet Tom <donettom@linux.ibm.com> writes:
> 
>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>> Is that measurable? Should we at least have a Fixes: ?
>>>
>>> /*
>>>   * When ...
>>>
>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>> we would no longer get false positives for cpupid_valid().
>>> I suppose overflows are not a problem, correct?
>>
>> Thank you, David, for guiding me in the right direction.
>>
>> I initially thought that overflows would not occur, and therefore
>> cpupid_valid() would not produce false positives. However,
>> after looking into it further, it appears that overflow can
>> happen when storing the access time.
>>
>> The last_cpupid field is used to store the last access time.
>> From the code, it appears that 21 bits are used for this
>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>
>> With 21 bits, the maximum value that can be stored is
> 
> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
> 
> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
> 
>> 2097151ms (35Hrs) . If the access time exceeds this
>> range, it can overflow, which may lead to cpupid_valid()
>> returning false positives.
>>
>> I think we need a reliable way to determine cpupid_valid() that
>> does not produce false positives.
> 
> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
> temporal performance issue at the cost of a longstanding performance
> issue.  Right?


Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
bit for time, to we care?

Would make it all easier to get ...

-- 
Cheers,

David

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Huang, Ying 1 day, 10 hours ago

"David Hildenbrand (Arm)" <david@kernel.org> writes:

> On 3/31/26 10:33, Huang, Ying wrote:
>> Hi, Donet,
>> 
>> Donet Tom <donettom@linux.ibm.com> writes:
>> 
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>> /*
>>>>   * When ...
>>>>
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>>
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>> From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>> 
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>> 
>> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>> 
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>> 
>> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue.  Right?
>
>
> Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
> bit for time, to we care?

Do we need one more bit for time and cpupid?  However, page flags are
precious resources.

> Would make it all easier to get ...

---
Best Regards,
Huang, Ying

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Donet Tom 2 days, 5 hours ago

On 3/31/26 3:34 PM, David Hildenbrand (Arm) wrote:
> On 3/31/26 10:33, Huang, Ying wrote:
>> Hi, Donet,
>>
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>> /*
>>>>    * When ...
>>>>
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>>  From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>
>> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue.  Right?
>
> Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one
> bit for time, to we care?


Thank you, David, for the input. This sounds like a good idea—I'll give 
it a try.

-Donet


>
> Would make it all easier to get ...
>

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Donet Tom 2 days, 11 hours ago

Hi

On 3/31/26 2:03 PM, Huang, Ying wrote:
> Hi, Donet,
>
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>> On 3/26/26 08:12, Donet Tom wrote:
>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>> there is a chance that in should_numa_migrate_memory(), this
>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>> to unnecessary promotion.
>>> Is that measurable? Should we at least have a Fixes: ?
>>>
>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>
>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>> ---
>>>>    kernel/sched/fair.c | 7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>>    		unsigned int latency, th, def_th;
>>>>    		long nr = folio_nr_pages(folio);
>>>>    
>>> /*
>>>    * When ...
>>>
>>>> +		/* When tiering is enabled at runtime, last_cpupid may
>>>> +		 * hold a valid cpupid instead of an access timestamp.
>>>> +		 * If so, skip page promotion.
>>>> +		 */
>>>> +		if (cpupid_valid(folio_last_cpupid(folio)))
>>>> +			return false;
>>>> +
>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>> we would no longer get false positives for cpupid_valid().
>>> I suppose overflows are not a problem, correct?
>> Thank you, David, for guiding me in the right direction.
>>
>> I initially thought that overflows would not occur, and therefore
>> cpupid_valid() would not produce false positives. However,
>> after looking into it further, it appears that overflow can
>> happen when storing the access time.
>>
>> The last_cpupid field is used to store the last access time.
>>  From the code, it appears that 21 bits are used for this
>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>
>> With 21 bits, the maximum value that can be stored is
> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>
> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>
>> 2097151ms (35Hrs) . If the access time exceeds this
>> range, it can overflow, which may lead to cpupid_valid()
>> returning false positives.
>>
>> I think we need a reliable way to determine cpupid_valid() that
>> does not produce false positives.
> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
> temporal performance issue at the cost of a longstanding performance
> issue.  Right?


I was trying to fix a functional issue. When memory tiering is

enabled at runtime, treating last_cpupid as access time is incorrect, right?

-Donet


> ---
> Best Regards,
> Huang, Ying
>
>>> So what we're saying is that folio_use_access_time()==true does not
>>> imply that there is actually a valid time in there.
>>>
>>> In numa_migrate_check() we could still use the valid cpuid I guess and
>>> make that code a bit clearer?
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 631205a384e1..ba68933a9e4a 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>>>            * For memory tiering mode, cpupid of slow memory page is used
>>>            * to record page access time.  So use default value.
>>>            */
>>> -       if (folio_use_access_time(folio))
>>> +       *last_cpupid = folio_last_cpupid(folio);
>>> +       if (!cpupid_valid(*last_cpupid))
>>>                   *last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> -       else
>>> -               *last_cpupid = folio_last_cpupid(folio);
>>>             /* Record the current PID accessing VMA */
>>>           vma_set_access_pid_bit(vma);
>>>
>>>
>>> The change itself here looks reasonable to me.
>>>
>>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>>

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Huang, Ying 2 days, 11 hours ago

Donet Tom <donettom@linux.ibm.com> writes:

> Hi
>
> On 3/31/26 2:03 PM, Huang, Ying wrote:
>> Hi, Donet,
>>
>> Donet Tom <donettom@linux.ibm.com> writes:
>>
>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>> On 3/26/26 08:12, Donet Tom wrote:
>>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>>> there is a chance that in should_numa_migrate_memory(), this
>>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>>> to unnecessary promotion.
>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>
>>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>>
>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>> ---
>>>>>    kernel/sched/fair.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>>>    		unsigned int latency, th, def_th;
>>>>>    		long nr = folio_nr_pages(folio);
>>>>>    
>>>> /*
>>>>    * When ...
>>>>
>>>>> +		/* When tiering is enabled at runtime, last_cpupid may
>>>>> +		 * hold a valid cpupid instead of an access timestamp.
>>>>> +		 * If so, skip page promotion.
>>>>> +		 */
>>>>> +		if (cpupid_valid(folio_last_cpupid(folio)))
>>>>> +			return false;
>>>>> +
>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>> we would no longer get false positives for cpupid_valid().
>>>> I suppose overflows are not a problem, correct?
>>> Thank you, David, for guiding me in the right direction.
>>>
>>> I initially thought that overflows would not occur, and therefore
>>> cpupid_valid() would not produce false positives. However,
>>> after looking into it further, it appears that overflow can
>>> happen when storing the access time.
>>>
>>> The last_cpupid field is used to store the last access time.
>>>  From the code, it appears that 21 bits are used for this
>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>
>>> With 21 bits, the maximum value that can be stored is
>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>
>> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>
>>> 2097151ms (35Hrs) . If the access time exceeds this
>>> range, it can overflow, which may lead to cpupid_valid()
>>> returning false positives.
>>>
>>> I think we need a reliable way to determine cpupid_valid() that
>>> does not produce false positives.
>> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
>> temporal performance issue at the cost of a longstanding performance
>> issue.  Right?
>
>
> I was trying to fix a functional issue. When memory tiering is
>
> enabled at runtime, treating last_cpupid as access time is incorrect, right?

I don't think that it's a functional issue.  It has only performance
impact.  Did you find any functionality bug?

---
Best Regards,
Huang, Ying

Re: [PATCH] sched/numa, mm: Skip page promotion if cpu pid is valid

Posted by Donet Tom 2 days, 10 hours ago

On 3/31/26 2:47 PM, Huang, Ying wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> Hi
>>
>> On 3/31/26 2:03 PM, Huang, Ying wrote:
>>> Hi, Donet,
>>>
>>> Donet Tom <donettom@linux.ibm.com> writes:
>>>
>>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote:
>>>>> On 3/26/26 08:12, Donet Tom wrote:
>>>>>> If memory tiering is disabled, cpupid of slow memory pages may
>>>>>> contain a valid CPU and PID. If tiering is enabled at runtime,
>>>>>> there is a chance that in should_numa_migrate_memory(), this
>>>>>> valid CPU/PID is treated as a last access timestamp, leading
>>>>>> to unnecessary promotion.
>>>>> Is that measurable? Should we at least have a Fixes: ?
>>>>>
>>>>>> Prevent this by skipping promotion when cpupid is valid.
>>>>>>
>>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>>>>>> ---
>>>>>>     kernel/sched/fair.c | 7 +++++++
>>>>>>     1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 4b43809a3fb1..f5830a5a94d5 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>>>>>     		unsigned int latency, th, def_th;
>>>>>>     		long nr = folio_nr_pages(folio);
>>>>>>     
>>>>> /*
>>>>>     * When ...
>>>>>
>>>>>> +		/* When tiering is enabled at runtime, last_cpupid may
>>>>>> +		 * hold a valid cpupid instead of an access timestamp.
>>>>>> +		 * If so, skip page promotion.
>>>>>> +		 */
>>>>>> +		if (cpupid_valid(folio_last_cpupid(folio)))
>>>>>> +			return false;
>>>>>> +
>>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
>>>>> we would no longer get false positives for cpupid_valid().
>>>>> I suppose overflows are not a problem, correct?
>>>> Thank you, David, for guiding me in the right direction.
>>>>
>>>> I initially thought that overflows would not occur, and therefore
>>>> cpupid_valid() would not produce false positives. However,
>>>> after looking into it further, it appears that overflow can
>>>> happen when storing the access time.
>>>>
>>>> The last_cpupid field is used to store the last access time.
>>>>   From the code, it appears that 21 bits are used for this
>>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)).
>>>>
>>>> With 21 bits, the maximum value that can be stored is
>>> It can be less than 21 bits, if CONFIG_NR_CPUS is small.
>>>
>>> 	DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS));
>>>
>>>> 2097151ms (35Hrs) . If the access time exceeds this
>>>> range, it can overflow, which may lead to cpupid_valid()
>>>> returning false positives.
>>>>
>>>> I think we need a reliable way to determine cpupid_valid() that
>>>> does not produce false positives.
>>> Yes.  IMHO, false positives is unavoidable.  So, the patch fixes a
>>> temporal performance issue at the cost of a longstanding performance
>>> issue.  Right?
>>
>> I was trying to fix a functional issue. When memory tiering is
>>
>> enabled at runtime, treating last_cpupid as access time is incorrect, right?
> I don't think that it's a functional issue.  It has only performance
> impact.  Did you find any functionality bug?
>

Thank you for the confirmation. I thought this was a functional issue. 
In that case, we can drop this patch.

-Donet


>
> ---
> Best Regards,
> Huang, Ying