kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+)
If memory tiering is disabled, cpupid of slow memory pages may
contain a valid CPU and PID. If tiering is enabled at runtime,
there is a chance that in should_numa_migrate_memory(), this
valid CPU/PID is treated as a last access timestamp, leading
to unnecessary promotion.
Prevent this by skipping promotion when cpupid is valid.
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b43809a3fb1..f5830a5a94d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
unsigned int latency, th, def_th;
long nr = folio_nr_pages(folio);
+ /* When tiering is enabled at runtime, last_cpupid may
+ * hold a valid cpupid instead of an access timestamp.
+ * If so, skip page promotion.
+ */
+ if (cpupid_valid(folio_last_cpupid(folio)))
+ return false;
+
pgdat = NODE_DATA(dst_nid);
if (pgdat_free_space_enough(pgdat)) {
/* workload changed, reset hot threshold */
--
2.47.1
On 3/26/26 08:12, Donet Tom wrote:
> If memory tiering is disabled, cpupid of slow memory pages may
> contain a valid CPU and PID. If tiering is enabled at runtime,
> there is a chance that in should_numa_migrate_memory(), this
> valid CPU/PID is treated as a last access timestamp, leading
> to unnecessary promotion.
Is that measurable? Should we at least have a Fixes: ?
>
> Prevent this by skipping promotion when cpupid is valid.
>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b43809a3fb1..f5830a5a94d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> unsigned int latency, th, def_th;
> long nr = folio_nr_pages(folio);
>
/*
* When ...
> + /* When tiering is enabled at runtime, last_cpupid may
> + * hold a valid cpupid instead of an access timestamp.
> + * If so, skip page promotion.
> + */
> + if (cpupid_valid(folio_last_cpupid(folio)))
> + return false;
> +
IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup,
we would no longer get false positives for cpupid_valid().
I suppose overflows are not a problem, correct?
So what we're saying is that folio_use_access_time()==true does not
imply that there is actually a valid time in there.
In numa_migrate_check() we could still use the valid cpuid I guess and
make that code a bit clearer?
diff --git a/mm/memory.c b/mm/memory.c
index 631205a384e1..ba68933a9e4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
* For memory tiering mode, cpupid of slow memory page is used
* to record page access time. So use default value.
*/
- if (folio_use_access_time(folio))
+ *last_cpupid = folio_last_cpupid(folio);
+ if (!cpupid_valid(*last_cpupid))
*last_cpupid = (-1 & LAST_CPUPID_MASK);
- else
- *last_cpupid = folio_last_cpupid(folio);
/* Record the current PID accessing VMA */
vma_set_access_pid_bit(vma);
The change itself here looks reasonable to me.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: > On 3/26/26 08:12, Donet Tom wrote: >> If memory tiering is disabled, cpupid of slow memory pages may >> contain a valid CPU and PID. If tiering is enabled at runtime, >> there is a chance that in should_numa_migrate_memory(), this >> valid CPU/PID is treated as a last access timestamp, leading >> to unnecessary promotion. > Is that measurable? Should we at least have a Fixes: ? > >> Prevent this by skipping promotion when cpupid is valid. >> >> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >> --- >> kernel/sched/fair.c | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 4b43809a3fb1..f5830a5a94d5 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> unsigned int latency, th, def_th; >> long nr = folio_nr_pages(folio); >> > /* > * When ... > >> + /* When tiering is enabled at runtime, last_cpupid may >> + * hold a valid cpupid instead of an access timestamp. >> + * If so, skip page promotion. >> + */ >> + if (cpupid_valid(folio_last_cpupid(folio))) >> + return false; >> + > IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, > we would no longer get false positives for cpupid_valid(). > I suppose overflows are not a problem, correct? Thank you, David, for guiding me in the right direction. I initially thought that overflows would not occur, and therefore cpupid_valid() would not produce false positives. However, after looking into it further, it appears that overflow can happen when storing the access time. The last_cpupid field is used to store the last access time. From the code, it appears that 21 bits are used for this (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). With 21 bits, the maximum value that can be stored is 2097151ms (35Hrs) . If the access time exceeds this range, it can overflow, which may lead to cpupid_valid() returning false positives. I think we need a reliable way to determine cpupid_valid() that does not produce false positives. -Donet > > So what we're saying is that folio_use_access_time()==true does not > imply that there is actually a valid time in there. > > In numa_migrate_check() we could still use the valid cpuid I guess and > make that code a bit clearer? > > diff --git a/mm/memory.c b/mm/memory.c > index 631205a384e1..ba68933a9e4a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, > * For memory tiering mode, cpupid of slow memory page is used > * to record page access time. So use default value. > */ > - if (folio_use_access_time(folio)) > + *last_cpupid = folio_last_cpupid(folio); > + if (!cpupid_valid(*last_cpupid)) > *last_cpupid = (-1 & LAST_CPUPID_MASK); > - else > - *last_cpupid = folio_last_cpupid(folio); > > /* Record the current PID accessing VMA */ > vma_set_access_pid_bit(vma); > > > The change itself here looks reasonable to me. > > Acked-by: David Hildenbrand (Arm) <david@kernel.org> >
Hi, Donet, Donet Tom <donettom@linux.ibm.com> writes: > On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >> On 3/26/26 08:12, Donet Tom wrote: >>> If memory tiering is disabled, cpupid of slow memory pages may >>> contain a valid CPU and PID. If tiering is enabled at runtime, >>> there is a chance that in should_numa_migrate_memory(), this >>> valid CPU/PID is treated as a last access timestamp, leading >>> to unnecessary promotion. >> Is that measurable? Should we at least have a Fixes: ? >> >>> Prevent this by skipping promotion when cpupid is valid. >>> >>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>> --- >>> kernel/sched/fair.c | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 4b43809a3fb1..f5830a5a94d5 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>> unsigned int latency, th, def_th; >>> long nr = folio_nr_pages(folio); >>> >> /* >> * When ... >> >>> + /* When tiering is enabled at runtime, last_cpupid may >>> + * hold a valid cpupid instead of an access timestamp. >>> + * If so, skip page promotion. >>> + */ >>> + if (cpupid_valid(folio_last_cpupid(folio))) >>> + return false; >>> + >> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >> we would no longer get false positives for cpupid_valid(). >> I suppose overflows are not a problem, correct? > > Thank you, David, for guiding me in the right direction. > > I initially thought that overflows would not occur, and therefore > cpupid_valid() would not produce false positives. However, > after looking into it further, it appears that overflow can > happen when storing the access time. > > The last_cpupid field is used to store the last access time. > From the code, it appears that 21 bits are used for this > (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). > > With 21 bits, the maximum value that can be stored is It can be less than 21 bits, if CONFIG_NR_CPUS is small. DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > 2097151ms (35Hrs) . If the access time exceeds this > range, it can overflow, which may lead to cpupid_valid() > returning false positives. > > I think we need a reliable way to determine cpupid_valid() that > does not produce false positives. Yes. IMHO, false positives is unavoidable. So, the patch fixes a temporal performance issue at the cost of a longstanding performance issue. Right? --- Best Regards, Huang, Ying > >> >> So what we're saying is that folio_use_access_time()==true does not >> imply that there is actually a valid time in there. >> >> In numa_migrate_check() we could still use the valid cpuid I guess and >> make that code a bit clearer? >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 631205a384e1..ba68933a9e4a 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, >> * For memory tiering mode, cpupid of slow memory page is used >> * to record page access time. So use default value. >> */ >> - if (folio_use_access_time(folio)) >> + *last_cpupid = folio_last_cpupid(folio); >> + if (!cpupid_valid(*last_cpupid)) >> *last_cpupid = (-1 & LAST_CPUPID_MASK); >> - else >> - *last_cpupid = folio_last_cpupid(folio); >> /* Record the current PID accessing VMA */ >> vma_set_access_pid_bit(vma); >> >> >> The change itself here looks reasonable to me. >> >> Acked-by: David Hildenbrand (Arm) <david@kernel.org> >>
On 3/31/26 10:33, Huang, Ying wrote: > Hi, Donet, > > Donet Tom <donettom@linux.ibm.com> writes: > >> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>> Is that measurable? Should we at least have a Fixes: ? >>> >>> /* >>> * When ... >>> >>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>> we would no longer get false positives for cpupid_valid(). >>> I suppose overflows are not a problem, correct? >> >> Thank you, David, for guiding me in the right direction. >> >> I initially thought that overflows would not occur, and therefore >> cpupid_valid() would not produce false positives. However, >> after looking into it further, it appears that overflow can >> happen when storing the access time. >> >> The last_cpupid field is used to store the last access time. >> From the code, it appears that 21 bits are used for this >> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >> >> With 21 bits, the maximum value that can be stored is > > It can be less than 21 bits, if CONFIG_NR_CPUS is small. > > DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > >> 2097151ms (35Hrs) . If the access time exceeds this >> range, it can overflow, which may lead to cpupid_valid() >> returning false positives. >> >> I think we need a reliable way to determine cpupid_valid() that >> does not produce false positives. > > Yes. IMHO, false positives is unavoidable. So, the patch fixes a > temporal performance issue at the cost of a longstanding performance > issue. Right? Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one bit for time, to we care? Would make it all easier to get ... -- Cheers, David
"David Hildenbrand (Arm)" <david@kernel.org> writes: > On 3/31/26 10:33, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>> /* >>>> * When ... >>>> >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > > Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one > bit for time, to we care? Do we need one more bit for time and cpupid? However, page flags are precious resources. > Would make it all easier to get ... --- Best Regards, Huang, Ying
On 3/31/26 3:34 PM, David Hildenbrand (Arm) wrote: > On 3/31/26 10:33, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>> /* >>>> * When ... >>>> >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > Could we set aside a bit to indicate "cpuid vs. time" ? We'd lose one > bit for time, to we care? Thank you, David, for the input. This sounds like a good idea—I'll give it a try. -Donet > > Would make it all easier to get ... >
Hi On 3/31/26 2:03 PM, Huang, Ying wrote: > Hi, Donet, > > Donet Tom <donettom@linux.ibm.com> writes: > >> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>> On 3/26/26 08:12, Donet Tom wrote: >>>> If memory tiering is disabled, cpupid of slow memory pages may >>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>> there is a chance that in should_numa_migrate_memory(), this >>>> valid CPU/PID is treated as a last access timestamp, leading >>>> to unnecessary promotion. >>> Is that measurable? Should we at least have a Fixes: ? >>> >>>> Prevent this by skipping promotion when cpupid is valid. >>>> >>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>> --- >>>> kernel/sched/fair.c | 7 +++++++ >>>> 1 file changed, 7 insertions(+) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>> unsigned int latency, th, def_th; >>>> long nr = folio_nr_pages(folio); >>>> >>> /* >>> * When ... >>> >>>> + /* When tiering is enabled at runtime, last_cpupid may >>>> + * hold a valid cpupid instead of an access timestamp. >>>> + * If so, skip page promotion. >>>> + */ >>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>> + return false; >>>> + >>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>> we would no longer get false positives for cpupid_valid(). >>> I suppose overflows are not a problem, correct? >> Thank you, David, for guiding me in the right direction. >> >> I initially thought that overflows would not occur, and therefore >> cpupid_valid() would not produce false positives. However, >> after looking into it further, it appears that overflow can >> happen when storing the access time. >> >> The last_cpupid field is used to store the last access time. >> From the code, it appears that 21 bits are used for this >> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >> >> With 21 bits, the maximum value that can be stored is > It can be less than 21 bits, if CONFIG_NR_CPUS is small. > > DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); > >> 2097151ms (35Hrs) . If the access time exceeds this >> range, it can overflow, which may lead to cpupid_valid() >> returning false positives. >> >> I think we need a reliable way to determine cpupid_valid() that >> does not produce false positives. > Yes. IMHO, false positives is unavoidable. So, the patch fixes a > temporal performance issue at the cost of a longstanding performance > issue. Right? I was trying to fix a functional issue. When memory tiering is enabled at runtime, treating last_cpupid as access time is incorrect, right? -Donet > --- > Best Regards, > Huang, Ying > >>> So what we're saying is that folio_use_access_time()==true does not >>> imply that there is actually a valid time in there. >>> >>> In numa_migrate_check() we could still use the valid cpuid I guess and >>> make that code a bit clearer? >>> >>> diff --git a/mm/memory.c b/mm/memory.c >>> index 631205a384e1..ba68933a9e4a 100644 >>> --- a/mm/memory.c >>> +++ b/mm/memory.c >>> @@ -6119,10 +6119,9 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf, >>> * For memory tiering mode, cpupid of slow memory page is used >>> * to record page access time. So use default value. >>> */ >>> - if (folio_use_access_time(folio)) >>> + *last_cpupid = folio_last_cpupid(folio); >>> + if (!cpupid_valid(*last_cpupid)) >>> *last_cpupid = (-1 & LAST_CPUPID_MASK); >>> - else >>> - *last_cpupid = folio_last_cpupid(folio); >>> /* Record the current PID accessing VMA */ >>> vma_set_access_pid_bit(vma); >>> >>> >>> The change itself here looks reasonable to me. >>> >>> Acked-by: David Hildenbrand (Arm) <david@kernel.org> >>>
Donet Tom <donettom@linux.ibm.com> writes: > Hi > > On 3/31/26 2:03 PM, Huang, Ying wrote: >> Hi, Donet, >> >> Donet Tom <donettom@linux.ibm.com> writes: >> >>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>> On 3/26/26 08:12, Donet Tom wrote: >>>>> If memory tiering is disabled, cpupid of slow memory pages may >>>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>>> there is a chance that in should_numa_migrate_memory(), this >>>>> valid CPU/PID is treated as a last access timestamp, leading >>>>> to unnecessary promotion. >>>> Is that measurable? Should we at least have a Fixes: ? >>>> >>>>> Prevent this by skipping promotion when cpupid is valid. >>>>> >>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>> --- >>>>> kernel/sched/fair.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>>> unsigned int latency, th, def_th; >>>>> long nr = folio_nr_pages(folio); >>>>> >>>> /* >>>> * When ... >>>> >>>>> + /* When tiering is enabled at runtime, last_cpupid may >>>>> + * hold a valid cpupid instead of an access timestamp. >>>>> + * If so, skip page promotion. >>>>> + */ >>>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>>> + return false; >>>>> + >>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>> we would no longer get false positives for cpupid_valid(). >>>> I suppose overflows are not a problem, correct? >>> Thank you, David, for guiding me in the right direction. >>> >>> I initially thought that overflows would not occur, and therefore >>> cpupid_valid() would not produce false positives. However, >>> after looking into it further, it appears that overflow can >>> happen when storing the access time. >>> >>> The last_cpupid field is used to store the last access time. >>> From the code, it appears that 21 bits are used for this >>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>> >>> With 21 bits, the maximum value that can be stored is >> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >> >> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >> >>> 2097151ms (35Hrs) . If the access time exceeds this >>> range, it can overflow, which may lead to cpupid_valid() >>> returning false positives. >>> >>> I think we need a reliable way to determine cpupid_valid() that >>> does not produce false positives. >> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >> temporal performance issue at the cost of a longstanding performance >> issue. Right? > > > I was trying to fix a functional issue. When memory tiering is > > enabled at runtime, treating last_cpupid as access time is incorrect, right? I don't think that it's a functional issue. It has only performance impact. Did you find any functionality bug? --- Best Regards, Huang, Ying
On 3/31/26 2:47 PM, Huang, Ying wrote: > Donet Tom <donettom@linux.ibm.com> writes: > >> Hi >> >> On 3/31/26 2:03 PM, Huang, Ying wrote: >>> Hi, Donet, >>> >>> Donet Tom <donettom@linux.ibm.com> writes: >>> >>>> On 3/26/26 3:59 PM, David Hildenbrand (Arm) wrote: >>>>> On 3/26/26 08:12, Donet Tom wrote: >>>>>> If memory tiering is disabled, cpupid of slow memory pages may >>>>>> contain a valid CPU and PID. If tiering is enabled at runtime, >>>>>> there is a chance that in should_numa_migrate_memory(), this >>>>>> valid CPU/PID is treated as a last access timestamp, leading >>>>>> to unnecessary promotion. >>>>> Is that measurable? Should we at least have a Fixes: ? >>>>> >>>>>> Prevent this by skipping promotion when cpupid is valid. >>>>>> >>>>>> Signed-off-by: Donet Tom <donettom@linux.ibm.com> >>>>>> --- >>>>>> kernel/sched/fair.c | 7 +++++++ >>>>>> 1 file changed, 7 insertions(+) >>>>>> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>> index 4b43809a3fb1..f5830a5a94d5 100644 >>>>>> --- a/kernel/sched/fair.c >>>>>> +++ b/kernel/sched/fair.c >>>>>> @@ -2001,6 +2001,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >>>>>> unsigned int latency, th, def_th; >>>>>> long nr = folio_nr_pages(folio); >>>>>> >>>>> /* >>>>> * When ... >>>>> >>>>>> + /* When tiering is enabled at runtime, last_cpupid may >>>>>> + * hold a valid cpupid instead of an access timestamp. >>>>>> + * If so, skip page promotion. >>>>>> + */ >>>>>> + if (cpupid_valid(folio_last_cpupid(folio))) >>>>>> + return false; >>>>>> + >>>>> IIUC, as timestamp we use jiffies_to_msecs(). So, soon after bootup, >>>>> we would no longer get false positives for cpupid_valid(). >>>>> I suppose overflows are not a problem, correct? >>>> Thank you, David, for guiding me in the right direction. >>>> >>>> I initially thought that overflows would not occur, and therefore >>>> cpupid_valid() would not produce false positives. However, >>>> after looking into it further, it appears that overflow can >>>> happen when storing the access time. >>>> >>>> The last_cpupid field is used to store the last access time. >>>> From the code, it appears that 21 bits are used for this >>>> (#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT + LAST__CPU_SHIFT)). >>>> >>>> With 21 bits, the maximum value that can be stored is >>> It can be less than 21 bits, if CONFIG_NR_CPUS is small. >>> >>> DEFINE(NR_CPUS_BITS, order_base_2(CONFIG_NR_CPUS)); >>> >>>> 2097151ms (35Hrs) . If the access time exceeds this >>>> range, it can overflow, which may lead to cpupid_valid() >>>> returning false positives. >>>> >>>> I think we need a reliable way to determine cpupid_valid() that >>>> does not produce false positives. >>> Yes. IMHO, false positives is unavoidable. So, the patch fixes a >>> temporal performance issue at the cost of a longstanding performance >>> issue. Right? >> >> I was trying to fix a functional issue. When memory tiering is >> >> enabled at runtime, treating last_cpupid as access time is incorrect, right? > I don't think that it's a functional issue. It has only performance > impact. Did you find any functionality bug? > Thank you for the confirmation. I thought this was a functional issue. In that case, we can drop this patch. -Donet > > --- > Best Regards, > Huang, Ying
© 2016 - 2026 Red Hat, Inc.