sched/numa: Skip VMA scanning on memory pinned

[PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 9 months, 3 weeks ago

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e5807..c9903b1b39487 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	/*
+	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
+	 * no page can be migrated.
+	 */
+	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+		return;
+
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
 			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
-- 
2.43.5

Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Chen, Yu C 9 months, 3 weeks ago

Hi Libo,

On 4/18/2025 3:15 AM, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
> 
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
> 
> Signed-off-by: Libo Chen <libo.chen@oracle.com>

I think this is a promising change that we can perform fine-grain NUMA
balance control on a per-cgroup basis rather than system-wide NUMA
balance for every task, which is costly.

> ---
>   kernel/sched/fair.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807..c9903b1b39487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>   	if (p->flags & PF_EXITING)
>   		return;
>   
> +	/*
> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> +	 * no page can be migrated.
> +	 */
> +	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
> +		return;
> +

I found that you had a proposal in V1 to address Peter's concern[1]:
Allow the task to be migrated to its preferred Node, even if the task's
memory policy is restricted to 1 Node. In your previous proposal, only 
if the task's cpumask is bound to the same Node as its memory policy 
node, the NUMA balance scanning is skipped, because a cgroup usually 
binds its tasks and memory allocation policy to the same node. Not sure 
if that could be turned into:

If the task's memory policy node's CPU mask is a subset of the task's 
cpumask, the NUMA balance scan is allowed.

For example,
Suppose p's memory is only allocated on node0, which contains CPU2, CPU3.
1. If p's CPU affinity is CPU0, CPU1, there is no need to do NUMA 
balancing scanning, because CPU0,1 are not in p's legitimate cpumask.
2. If p's CPU affinity is CPU3, there is no need to do NUMA balancing 
scanning. p is already on its preferred node.
3. But if p's CPU affinity is CPU2, CPU3, CPU6, the NUMA balancing scan 
should be allowed. Because it is possible to migrate p from CPU6 to 
either CPU2 or CPU3.

What I'm thinking of is something as follows(untested):
if (cpusets_enabled() &&
     nodes_weight(cpuset_current_mems_allowed) == 1 &&
     !cpumask_subset(cpumask_of_node(cpuset_current_mems_allowed),
		    p->cpus_ptr))
	return;

I tested your patch on top of the latest sched/core,
binding task CPU affinity to Node1 and memory allocation node on
Node1:
echo "8-15" > /sys/fs/cgroup/mytest/cpuset.cpus
echo "1" > /sys/fs/cgroup/mytest/cpuset.mems
cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config 
config-numa skip_scan

And it works as expected:
# bpftrace numa_trace.bt

@sched_skip_cpuset_numa: 133

thanks,
Chenyu

[1] 
https://lore.kernel.org/lkml/cde7af54-5481-499e-8a42-0111f555f2b1@oracle.com/

>   	if (!mm->numa_next_scan) {
>   		mm->numa_next_scan = now +
>   			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);

Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 9 months, 3 weeks ago

Hi Yu

On 4/19/25 04:16, Chen, Yu C wrote:
> Hi Libo,
> 
> On 4/18/2025 3:15 AM, Libo Chen wrote:
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> We have seen up to a 6x improvement on a typical java workload running on
>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>> platform, we have seen 20% improvment in a microbench that creates a
>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>> pages in a fixed number of loops.
>>
>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> 
> I think this is a promising change that we can perform fine-grain NUMA
> balance control on a per-cgroup basis rather than system-wide NUMA
> balance for every task, which is costly.
> 

Yes indeed, the cost, from we have seen, can be quite astonishing 

>> ---
>>   kernel/sched/fair.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e43993a4e5807..c9903b1b39487 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>>       if (p->flags & PF_EXITING)
>>           return;
>>   +    /*
>> +     * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>> +     * no page can be migrated.
>> +     */
>> +    if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
>> +        return;
>> +
> 
> I found that you had a proposal in V1 to address Peter's concern[1]:
> Allow the task to be migrated to its preferred Node, even if the task's
> memory policy is restricted to 1 Node. In your previous proposal, only if the task's cpumask is bound to the same Node as its memory policy node, the NUMA balance scanning is skipped, because a cgroup usually binds its tasks and memory allocation policy to the same node. Not sure if that could be turned into:
> 
> If the task's memory policy node's CPU mask is a subset of the task's cpumask, the NUMA balance scan is allowed.
> 

I guess fundamentally is this really worth it? Do the benefits of NUMA task migrations only outweigh the overheads of VMA scanning, PTE updates and page faults etc? I suppose this is workload-dependent, but what about the best-case scenario? I think we probably need more data.  Also if we do that, we also need to do the same for other VMA skipping scenarios.

Thanks,
Libo 

> For example,
> Suppose p's memory is only allocated on node0, which contains CPU2, CPU3.
> 1. If p's CPU affinity is CPU0, CPU1, there is no need to do NUMA balancing scanning, because CPU0,1 are not in p's legitimate cpumask.
> 2. If p's CPU affinity is CPU3, there is no need to do NUMA balancing scanning. p is already on its preferred node.
> 3. But if p's CPU affinity is CPU2, CPU3, CPU6, the NUMA balancing scan should be allowed. Because it is possible to migrate p from CPU6 to either CPU2 or CPU3.
> 
> What I'm thinking of is something as follows(untested):
> if (cpusets_enabled() &&
>     nodes_weight(cpuset_current_mems_allowed) == 1 &&
>     !cpumask_subset(cpumask_of_node(cpuset_current_mems_allowed),
>             p->cpus_ptr))
>     return;
> 
> 
> I tested your patch on top of the latest sched/core,
> binding task CPU affinity to Node1 and memory allocation node on
> Node1:
> echo "8-15" > /sys/fs/cgroup/mytest/cpuset.cpus
> echo "1" > /sys/fs/cgroup/mytest/cpuset.mems
> cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config config-numa skip_scan
> 
> And it works as expected:
> # bpftrace numa_trace.bt
> 
> @sched_skip_cpuset_numa: 133
> 
> 
> thanks,
> Chenyu
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/lkml/cde7af54-5481-499e-8a42-0111f555f2b1@oracle.com/__;!!ACWV5N9M2RV99hQ!OvO0A__dCkeSB4eze2TYZYDHWGg0ubi04-u8lW5NCQGRE6vZkCGahdWMzHtpKMMDSt-L1wCkM8ILMIP3YA$
> 
>>       if (!mm->numa_next_scan) {
>>           mm->numa_next_scan = now +
>>               msecs_to_jiffies(sysctl_numa_balancing_scan_delay);

Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Chen, Yu C 9 months, 3 weeks ago

On 4/23/2025 6:20 AM, Libo Chen wrote:
> Hi Yu
> 
> On 4/19/25 04:16, Chen, Yu C wrote:
>> Hi Libo,
>>
>> On 4/18/2025 3:15 AM, Libo Chen wrote:
>>> When the memory of the current task is pinned to one NUMA node by cgroup,
>>> there is no point in continuing the rest of VMA scanning and hinting page
>>> faults as they will just be overhead. With this change, there will be no
>>> more unnecessary PTE updates or page faults in this scenario.
>>>
>>> We have seen up to a 6x improvement on a typical java workload running on
>>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>>> platform, we have seen 20% improvment in a microbench that creates a
>>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>>> pages in a fixed number of loops.
>>>
>>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>>
>> I think this is a promising change that we can perform fine-grain NUMA
>> balance control on a per-cgroup basis rather than system-wide NUMA
>> balance for every task, which is costly.
>>
> 
> Yes indeed, the cost, from we have seen, can be quite astonishing
> 
>>> ---
>>>    kernel/sched/fair.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e43993a4e5807..c9903b1b39487 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>>>        if (p->flags & PF_EXITING)
>>>            return;
>>>    +    /*
>>> +     * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>>> +     * no page can be migrated.
>>> +     */
>>> +    if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
>>> +        return;
>>> +
>>
>> I found that you had a proposal in V1 to address Peter's concern[1]:
>> Allow the task to be migrated to its preferred Node, even if the task's
>> memory policy is restricted to 1 Node. In your previous proposal, only if the task's cpumask is bound to the same Node as its memory policy node, the NUMA balance scanning is skipped, because a cgroup usually binds its tasks and memory allocation policy to the same node. Not sure if that could be turned into:
>>
>> If the task's memory policy node's CPU mask is a subset of the task's cpumask, the NUMA balance scan is allowed.
>>
> 
> I guess fundamentally is this really worth it? Do the benefits of NUMA task migrations only outweigh the overheads of VMA scanning, PTE updates and page faults etc? I suppose this is workload-dependent, but what about the best-case scenario? I think we probably need more data.  Also if we do that, we also need to do the same for other VMA skipping scenarios.
> 

Overall that can be a future work and I agree for now this patch is 
simple enough and feel free to add:

Tested-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

[PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
[PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning