sched/numa: skip VMA scanning on memory pinned

[PATCH v2 1/2] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 10 months, 2 weeks ago

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e5807..6f405e00c9c7e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	/*
+	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
+	 * no page can be migrated.
+	 */
+	if (nodes_weight(cpuset_current_mems_allowed) == 1)
+		return;
+
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-- 
2.43.5

Re: [PATCH v2 1/2] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Mel Gorman 10 months, 1 week ago

On Wed, Mar 26, 2025 at 05:23:51PM -0700, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
> 
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
> 
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807..6f405e00c9c7e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>  	if (p->flags & PF_EXITING)
>  		return;
>  
> +	/*
> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> +	 * no page can be migrated.
> +	 */
> +	if (nodes_weight(cpuset_current_mems_allowed) == 1)
> +		return;
> +

Check cpusets_enabled() first?

-- 
Mel Gorman
SUSE Labs

Re: [PATCH v2 1/2] sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 10 months, 1 week ago


On 4/1/25 06:23, Mel Gorman wrote:
> On Wed, Mar 26, 2025 at 05:23:51PM -0700, Libo Chen wrote:
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> We have seen up to a 6x improvement on a typical java workload running on
>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>> platform, we have seen 20% improvment in a microbench that creates a
>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>> pages in a fixed number of loops.
>>
>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>> ---
>>  kernel/sched/fair.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e43993a4e5807..6f405e00c9c7e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>>  	if (p->flags & PF_EXITING)
>>  		return;
>>  
>> +	/*
>> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>> +	 * no page can be migrated.
>> +	 */
>> +	if (nodes_weight(cpuset_current_mems_allowed) == 1)
>> +		return;
>> +
> 
> Check cpusets_enabled() first?
> 

Hi Mel,

Yeah, can add that but isn't a bit redundant since nodes_weight(cpuset_current_mems_allowed) will just return #nodes which doesn't equal to 1 when !cpusets_enabled() and there are >= 2 numa nodes? 


Libo