sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

[PATCH] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 11 months ago

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d27952431..ec4749a7be33a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3315,6 +3315,13 @@ static void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	/*
+	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
+	 * no page can be migrated.
+	 */
+	if (nodes_weight(cpuset_current_mems_allowed) == 1)
+		return;
+
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-- 
2.43.5

Re: [PATCH] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Peter Zijlstra 11 months ago

On Tue, Mar 11, 2025 at 09:04:47AM -0700, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.

Its been a while since I looked at all this, but if we don't scan these
pages, then it will not account for these pages, and the pinned memory
will not become an attractor for the tasks that use this memory, right?

> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d27952431..ec4749a7be33a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3315,6 +3315,13 @@ static void task_numa_work(struct callback_head *work)
>  	if (p->flags & PF_EXITING)
>  		return;
>  
> +	/*
> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> +	 * no page can be migrated.
> +	 */
> +	if (nodes_weight(cpuset_current_mems_allowed) == 1)
> +		return;
> +
>  	if (!mm->numa_next_scan) {
>  		mm->numa_next_scan = now +
>  			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> -- 
> 2.43.5
>

Re: [PATCH] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 11 months ago

On 3/12/25 01:01, Peter Zijlstra wrote:
> On Tue, Mar 11, 2025 at 09:04:47AM -0700, Libo Chen wrote:
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
> Its been a while since I looked at all this, but if we don't scan these
> pages, then it will not account for these pages, and the pinned memory
> will not become an attractor for the tasks that use this memory, right?

Hi Peter,

Yes, you are absolutely right. It will skip change_prot_numa() which 
marks an
VMA to be inaccessible (not PROT_NONE though) by setting MM_CP_PROT_NUMA 
flag,
without that there will be no numa faults on those pages, hence no task 
migrations
towards those pages. But that's how similar cases are handled such as 
hugetlb or
MPOL_BIND pages. If you look at the checks !vma_migratable() || 
!vma_policy_mof(),
etc, the rest of the loop will skipped if one of those is true. I am not 
sure why
it's handle that way, maybe there are some old discussions I am not 
aware of, I
personally think task migrations should be allowed even if the pages aren't.

Consider in most cases if cpuset.mems is set to one node, the same node 
assuming
it's not a memory-only node is probably set in cpuset.cpus as well. I 
think I can
add an additional check on CPU affinity, probably needs to add an helper
cpumask_to_nodemask(), so something like below?

+	cpumask_to_nodemask(&current->cpus_mask, &allow_cpu_nodes);
+	if (nodes_weight(cpuset_current_mems_allowed) == 1 && nodes_weight(allow_cpu_nodes) == 1)
+		return;

Libo

>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>> ---
>>   kernel/sched/fair.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c798d27952431..ec4749a7be33a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3315,6 +3315,13 @@ static void task_numa_work(struct callback_head *work)
>>   	if (p->flags & PF_EXITING)
>>   		return;
>>   
>> +	/*
>> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>> +	 * no page can be migrated.
>> +	 */
>> +	if (nodes_weight(cpuset_current_mems_allowed) == 1)
>> +		return;
>> +
>>   	if (!mm->numa_next_scan) {
>>   		mm->numa_next_scan = now +
>>   			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
>> -- 
>> 2.43.5
>>

Re: [PATCH] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Waiman Long 11 months ago

On 3/11/25 12:04 PM, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
>
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
>   kernel/sched/fair.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d27952431..ec4749a7be33a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3315,6 +3315,13 @@ static void task_numa_work(struct callback_head *work)
>   	if (p->flags & PF_EXITING)
>   		return;
>   
> +	/*
> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> +	 * no page can be migrated.
> +	 */
> +	if (nodes_weight(cpuset_current_mems_allowed) == 1)
> +		return;
> +
>   	if (!mm->numa_next_scan) {
>   		mm->numa_next_scan = now +
>   			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);

Do you have any performance improvement data that can be included in the 
commit log?

Cheers,
Longman

Re: [PATCH] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 11 months ago

On 3/11/25 10:42, Waiman Long wrote:
> On 3/11/25 12:04 PM, Libo Chen wrote:
>> When the memory of the current task is pinned to one NUMA node by 
>> cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting 
>> page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>> ---
>>   kernel/sched/fair.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c798d27952431..ec4749a7be33a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3315,6 +3315,13 @@ static void task_numa_work(struct 
>> callback_head *work)
>>       if (p->flags & PF_EXITING)
>>           return;
>>   +    /*
>> +     * Memory is pinned to only one NUMA node via cpuset.mems, 
>> naturally
>> +     * no page can be migrated.
>> +     */
>> +    if (nodes_weight(cpuset_current_mems_allowed) == 1)
>> +        return;
>> +
>>       if (!mm->numa_next_scan) {
>>           mm->numa_next_scan = now +
>> msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
>
> Do you have any performance improvement data that can be included in 
> the commit log?
>
Yes, will put out some numbers in v2.


Thanks,
Libo
> Cheers,
> Longman
>