sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

[PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 9 months, 2 weeks ago

v1->v2:
1. add perf improvment numbers in commit log. Yet to find perf diff on
will-it-scale, so not included here. Plan to run more workloads.
2. add tracepoint.
3. To peterz's comment, this will make it impossible to attract tasks to
those memory just like other VMA skippings. This is the current
implementation, I think we can improve that in the future, but at the
moment it's probabaly better to keep it consistent.

v2->v3:
1. add enable_cpuset() based on Mel's suggestion but again I think it's
redundant.
2. print out nodemask with %*p.. format in the tracepoint.

v3->v4:
1. fix an unsafe dereference of a pointer to content not on ring buffer,
namely mem_allowed_ptr in the tracepoint.

v4->v5:
1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
changes (particularly in size) in nodemask_t.

Libo Chen (2):
  sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
    cpuset.mems
  sched/numa: Add tracepoint that tracks the skipping of numa balancing
    due to cpuset memory pinning

 include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  9 +++++++++
 2 files changed, 42 insertions(+)

-- 
2.43.5

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Venkat Rao Bagalkote 9 months, 2 weeks ago

On 24/04/25 8:15 am, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
>    sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>      cpuset.mems
>    sched/numa: Add tracepoint that tracks the skipping of numa balancing
>      due to cpuset memory pinning
>
>   include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c          |  9 +++++++++
>   2 files changed, 42 insertions(+)
>
Tested the above patch on top of next-20250424 and it fixes the boot 
warning on IBM Power server. Hence,


Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>


Regards,

Venkat.

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Aithal, Srikanth 9 months, 2 weeks ago

On 4/24/2025 8:15 AM, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
> 
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
> 
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
> 
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
> 
> Libo Chen (2):
>    sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>      cpuset.mems
>    sched/numa: Add tracepoint that tracks the skipping of numa balancing
>      due to cpuset memory pinning
> 
>   include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c          |  9 +++++++++
>   2 files changed, 42 insertions(+)
> 

Tested on top of next-20250424. The boot warning[1] is fixed with this 
version.

Tested-by: Srikanth Aithal <sraithal@amd.com>


[1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Venkat Rao Bagalkote 9 months, 2 weeks ago

On 24/04/25 8:15 am, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.
>
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
>
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
>
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
>
> Libo Chen (2):
>    sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>      cpuset.mems
>    sched/numa: Add tracepoint that tracks the skipping of numa balancing
>      due to cpuset memory pinning
>
>   include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c          |  9 +++++++++
>   2 files changed, 42 insertions(+)
>
Hello Libo,


For some reason I am not able to apply this patch. I am trying to test 
the boot warning[1].

I am trying to apply on top of next-20250423. Below is the error. Am I 
missing anything?

[1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/

Error:

git am -i 
v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
Commit Body is:
--------------------------
sched/numa: Skip VMA scanning on memory pinned to one NUMA node via 
cpuset.mems

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--------------------------
Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA 
node via cpuset.mems
error: patch failed: kernel/sched/fair.c:3329
error: kernel/sched/fair.c: patch does not apply
Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to 
one NUMA node via cpuset.mems


Regards,

Venkat.

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Libo Chen 9 months, 2 weeks ago


On 4/24/25 00:05, Venkat Rao Bagalkote wrote:
> 
> On 24/04/25 8:15 am, Libo Chen wrote:
>> v1->v2:
>> 1. add perf improvment numbers in commit log. Yet to find perf diff on
>> will-it-scale, so not included here. Plan to run more workloads.
>> 2. add tracepoint.
>> 3. To peterz's comment, this will make it impossible to attract tasks to
>> those memory just like other VMA skippings. This is the current
>> implementation, I think we can improve that in the future, but at the
>> moment it's probabaly better to keep it consistent.
>>
>> v2->v3:
>> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
>> redundant.
>> 2. print out nodemask with %*p.. format in the tracepoint.
>>
>> v3->v4:
>> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
>> namely mem_allowed_ptr in the tracepoint.
>>
>> v4->v5:
>> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
>> changes (particularly in size) in nodemask_t.
>>
>> Libo Chen (2):
>>    sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>>      cpuset.mems
>>    sched/numa: Add tracepoint that tracks the skipping of numa balancing
>>      due to cpuset memory pinning
>>
>>   include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c          |  9 +++++++++
>>   2 files changed, 42 insertions(+)
>>
> Hello Libo,
> 
> 
> For some reason I am not able to apply this patch. I am trying to test the boot warning[1].
> 
> I am trying to apply on top of next-20250423. Below is the error. Am I missing anything?
> 
> [1]: https://urldefense.com/v3/__https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/__;!!ACWV5N9M2RV99hQ!IQpY9WDL1O3ppDekb1PpaTYJ98ehOXL6dNIkx02MPN84bCieT18zCh7WSouHctEGpwG2rtpZB42l7b5mkMFb$
> Error:
> 
> git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
> Commit Body is:
> --------------------------
> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
> 
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
> 
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
> 
> Signed-off-by: Libo Chen <libo.chen@oracle.com>
> Tested-by: Chen Yu <yu.c.chen@intel.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> --------------------------
> Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
> Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
> error: patch failed: kernel/sched/fair.c:3329
> error: kernel/sched/fair.c: patch does not apply
> Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
> 
> 
Hi Venkat,

I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the
first patch in apply order, have you made sure the second patch was not applied before the first one?

- Libo
> Regards,
> 
> Venkat.
> 
> 
>

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by Venkat Rao Bagalkote 9 months, 2 weeks ago

On 24/04/25 1:16 pm, Libo Chen wrote:
>
> On 4/24/25 00:05, Venkat Rao Bagalkote wrote:
>> On 24/04/25 8:15 am, Libo Chen wrote:
>>> v1->v2:
>>> 1. add perf improvment numbers in commit log. Yet to find perf diff on
>>> will-it-scale, so not included here. Plan to run more workloads.
>>> 2. add tracepoint.
>>> 3. To peterz's comment, this will make it impossible to attract tasks to
>>> those memory just like other VMA skippings. This is the current
>>> implementation, I think we can improve that in the future, but at the
>>> moment it's probabaly better to keep it consistent.
>>>
>>> v2->v3:
>>> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
>>> redundant.
>>> 2. print out nodemask with %*p.. format in the tracepoint.
>>>
>>> v3->v4:
>>> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
>>> namely mem_allowed_ptr in the tracepoint.
>>>
>>> v4->v5:
>>> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
>>> changes (particularly in size) in nodemask_t.
>>>
>>> Libo Chen (2):
>>>     sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>>>       cpuset.mems
>>>     sched/numa: Add tracepoint that tracks the skipping of numa balancing
>>>       due to cpuset memory pinning
>>>
>>>    include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>>>    kernel/sched/fair.c          |  9 +++++++++
>>>    2 files changed, 42 insertions(+)
>>>
>> Hello Libo,
>>
>>
>> For some reason I am not able to apply this patch. I am trying to test the boot warning[1].
>>
>> I am trying to apply on top of next-20250423. Below is the error. Am I missing anything?
>>
>> [1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/ 
>> Error:
>>
>> git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx
>> Commit Body is:
>> --------------------------
>> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>>
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> We have seen up to a 6x improvement on a typical java workload running on
>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>> platform, we have seen 20% improvment in a microbench that creates a
>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>> pages in a fixed number of loops.
>>
>> Signed-off-by: Libo Chen <libo.chen@oracle.com>
>> Tested-by: Chen Yu <yu.c.chen@intel.com>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> --------------------------
>> Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a
>> Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>> error: patch failed: kernel/sched/fair.c:3329
>> error: kernel/sched/fair.c: patch does not apply
>> Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
>>
>>
> Hi Venkat,
>
> I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the
> first patch in apply order, have you made sure the second patch was not applied before the first one?
>
> - Libo


Hi Libo,

Apolozies!!!

I freshly cloned and tried and it worked now. So, please ignore my 
earlier mail.


Regards,

Venkat.

>> Regards,
>>
>> Venkat.
>>
>>
>>
>

Re: [PATCH v5 0/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

Posted by K Prateek Nayak 9 months, 2 weeks ago

Hello Libo,

On 4/24/2025 8:15 AM, Libo Chen wrote:
> v1->v2:
> 1. add perf improvment numbers in commit log. Yet to find perf diff on
> will-it-scale, so not included here. Plan to run more workloads.
> 2. add tracepoint.
> 3. To peterz's comment, this will make it impossible to attract tasks to
> those memory just like other VMA skippings. This is the current
> implementation, I think we can improve that in the future, but at the
> moment it's probabaly better to keep it consistent.

I tested the series with hackbench running on a dual socket system with
memory pinned to one node and I could see the skip_cpuset_numa traces
being logged:

  sched-messaging-9430    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9430 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9640    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9640 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9645    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9645 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9637    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9637 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9629    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9629 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9639    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9639 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9630    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9630 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9487    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9487 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9635    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9635 tgid=9007 ngid=0 mem_nodes_allowed=0
  sched-messaging-9647    ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9647 tgid=9007 ngid=0 mem_nodes_allowed=0
  ...

Feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> 
> v2->v3:
> 1. add enable_cpuset() based on Mel's suggestion but again I think it's
> redundant.
> 2. print out nodemask with %*p.. format in the tracepoint.
> 
> v3->v4:
> 1. fix an unsafe dereference of a pointer to content not on ring buffer,
> namely mem_allowed_ptr in the tracepoint.
> 
> v4->v5:
> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
> changes (particularly in size) in nodemask_t.
> 
> Libo Chen (2):
>    sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
>      cpuset.mems
>    sched/numa: Add tracepoint that tracks the skipping of numa balancing
>      due to cpuset memory pinning
> 
>   include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c          |  9 +++++++++
>   2 files changed, 42 insertions(+)
>