include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 9 +++++++++ 2 files changed, 42 insertions(+)
v1->v2:
1. add perf improvment numbers in commit log. Yet to find perf diff on
will-it-scale, so not included here. Plan to run more workloads.
2. add tracepoint.
3. To peterz's comment, this will make it impossible to attract tasks to
those memory just like other VMA skippings. This is the current
implementation, I think we can improve that in the future, but at the
moment it's probabaly better to keep it consistent.
v2->v3:
1. add enable_cpuset() based on Mel's suggestion but again I think it's
redundant.
2. print out nodemask with %*p.. format in the tracepoint.
v3->v4:
1. fix an unsafe dereference of a pointer to content not on ring buffer,
namely mem_allowed_ptr in the tracepoint.
v4->v5:
1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future
changes (particularly in size) in nodemask_t.
Libo Chen (2):
sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems
sched/numa: Add tracepoint that tracks the skipping of numa balancing
due to cpuset memory pinning
include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 9 +++++++++
2 files changed, 42 insertions(+)
--
2.43.5
On 24/04/25 8:15 am, Libo Chen wrote: > v1->v2: > 1. add perf improvment numbers in commit log. Yet to find perf diff on > will-it-scale, so not included here. Plan to run more workloads. > 2. add tracepoint. > 3. To peterz's comment, this will make it impossible to attract tasks to > those memory just like other VMA skippings. This is the current > implementation, I think we can improve that in the future, but at the > moment it's probabaly better to keep it consistent. > > v2->v3: > 1. add enable_cpuset() based on Mel's suggestion but again I think it's > redundant. > 2. print out nodemask with %*p.. format in the tracepoint. > > v3->v4: > 1. fix an unsafe dereference of a pointer to content not on ring buffer, > namely mem_allowed_ptr in the tracepoint. > > v4->v5: > 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future > changes (particularly in size) in nodemask_t. > > Libo Chen (2): > sched/numa: Skip VMA scanning on memory pinned to one NUMA node via > cpuset.mems > sched/numa: Add tracepoint that tracks the skipping of numa balancing > due to cpuset memory pinning > > include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 9 +++++++++ > 2 files changed, 42 insertions(+) > Tested the above patch on top of next-20250424 and it fixes the boot warning on IBM Power server. Hence, Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Regards, Venkat.
On 4/24/2025 8:15 AM, Libo Chen wrote: > v1->v2: > 1. add perf improvment numbers in commit log. Yet to find perf diff on > will-it-scale, so not included here. Plan to run more workloads. > 2. add tracepoint. > 3. To peterz's comment, this will make it impossible to attract tasks to > those memory just like other VMA skippings. This is the current > implementation, I think we can improve that in the future, but at the > moment it's probabaly better to keep it consistent. > > v2->v3: > 1. add enable_cpuset() based on Mel's suggestion but again I think it's > redundant. > 2. print out nodemask with %*p.. format in the tracepoint. > > v3->v4: > 1. fix an unsafe dereference of a pointer to content not on ring buffer, > namely mem_allowed_ptr in the tracepoint. > > v4->v5: > 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future > changes (particularly in size) in nodemask_t. > > Libo Chen (2): > sched/numa: Skip VMA scanning on memory pinned to one NUMA node via > cpuset.mems > sched/numa: Add tracepoint that tracks the skipping of numa balancing > due to cpuset memory pinning > > include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 9 +++++++++ > 2 files changed, 42 insertions(+) > Tested on top of next-20250424. The boot warning[1] is fixed with this version. Tested-by: Srikanth Aithal <sraithal@amd.com> [1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/
On 24/04/25 8:15 am, Libo Chen wrote: > v1->v2: > 1. add perf improvment numbers in commit log. Yet to find perf diff on > will-it-scale, so not included here. Plan to run more workloads. > 2. add tracepoint. > 3. To peterz's comment, this will make it impossible to attract tasks to > those memory just like other VMA skippings. This is the current > implementation, I think we can improve that in the future, but at the > moment it's probabaly better to keep it consistent. > > v2->v3: > 1. add enable_cpuset() based on Mel's suggestion but again I think it's > redundant. > 2. print out nodemask with %*p.. format in the tracepoint. > > v3->v4: > 1. fix an unsafe dereference of a pointer to content not on ring buffer, > namely mem_allowed_ptr in the tracepoint. > > v4->v5: > 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future > changes (particularly in size) in nodemask_t. > > Libo Chen (2): > sched/numa: Skip VMA scanning on memory pinned to one NUMA node via > cpuset.mems > sched/numa: Add tracepoint that tracks the skipping of numa balancing > due to cpuset memory pinning > > include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 9 +++++++++ > 2 files changed, 42 insertions(+) > Hello Libo, For some reason I am not able to apply this patch. I am trying to test the boot warning[1]. I am trying to apply on top of next-20250423. Below is the error. Am I missing anything? [1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/ Error: git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx Commit Body is: -------------------------- sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Signed-off-by: Libo Chen <libo.chen@oracle.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> -------------------------- Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems error: patch failed: kernel/sched/fair.c:3329 error: kernel/sched/fair.c: patch does not apply Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems Regards, Venkat.
On 4/24/25 00:05, Venkat Rao Bagalkote wrote: > > On 24/04/25 8:15 am, Libo Chen wrote: >> v1->v2: >> 1. add perf improvment numbers in commit log. Yet to find perf diff on >> will-it-scale, so not included here. Plan to run more workloads. >> 2. add tracepoint. >> 3. To peterz's comment, this will make it impossible to attract tasks to >> those memory just like other VMA skippings. This is the current >> implementation, I think we can improve that in the future, but at the >> moment it's probabaly better to keep it consistent. >> >> v2->v3: >> 1. add enable_cpuset() based on Mel's suggestion but again I think it's >> redundant. >> 2. print out nodemask with %*p.. format in the tracepoint. >> >> v3->v4: >> 1. fix an unsafe dereference of a pointer to content not on ring buffer, >> namely mem_allowed_ptr in the tracepoint. >> >> v4->v5: >> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future >> changes (particularly in size) in nodemask_t. >> >> Libo Chen (2): >> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via >> cpuset.mems >> sched/numa: Add tracepoint that tracks the skipping of numa balancing >> due to cpuset memory pinning >> >> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ >> kernel/sched/fair.c | 9 +++++++++ >> 2 files changed, 42 insertions(+) >> > Hello Libo, > > > For some reason I am not able to apply this patch. I am trying to test the boot warning[1]. > > I am trying to apply on top of next-20250423. Below is the error. Am I missing anything? > > [1]: https://urldefense.com/v3/__https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/__;!!ACWV5N9M2RV99hQ!IQpY9WDL1O3ppDekb1PpaTYJ98ehOXL6dNIkx02MPN84bCieT18zCh7WSouHctEGpwG2rtpZB42l7b5mkMFb$ > Error: > > git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx > Commit Body is: > -------------------------- > sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems > > When the memory of the current task is pinned to one NUMA node by cgroup, > there is no point in continuing the rest of VMA scanning and hinting page > faults as they will just be overhead. With this change, there will be no > more unnecessary PTE updates or page faults in this scenario. > > We have seen up to a 6x improvement on a typical java workload running on > VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket > AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel > platform, we have seen 20% improvment in a microbench that creates a > 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB > pages in a fixed number of loops. > > Signed-off-by: Libo Chen <libo.chen@oracle.com> > Tested-by: Chen Yu <yu.c.chen@intel.com> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > -------------------------- > Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a > Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems > error: patch failed: kernel/sched/fair.c:3329 > error: kernel/sched/fair.c: patch does not apply > Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems > > Hi Venkat, I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the first patch in apply order, have you made sure the second patch was not applied before the first one? - Libo > Regards, > > Venkat. > > >
On 24/04/25 1:16 pm, Libo Chen wrote: > > On 4/24/25 00:05, Venkat Rao Bagalkote wrote: >> On 24/04/25 8:15 am, Libo Chen wrote: >>> v1->v2: >>> 1. add perf improvment numbers in commit log. Yet to find perf diff on >>> will-it-scale, so not included here. Plan to run more workloads. >>> 2. add tracepoint. >>> 3. To peterz's comment, this will make it impossible to attract tasks to >>> those memory just like other VMA skippings. This is the current >>> implementation, I think we can improve that in the future, but at the >>> moment it's probabaly better to keep it consistent. >>> >>> v2->v3: >>> 1. add enable_cpuset() based on Mel's suggestion but again I think it's >>> redundant. >>> 2. print out nodemask with %*p.. format in the tracepoint. >>> >>> v3->v4: >>> 1. fix an unsafe dereference of a pointer to content not on ring buffer, >>> namely mem_allowed_ptr in the tracepoint. >>> >>> v4->v5: >>> 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future >>> changes (particularly in size) in nodemask_t. >>> >>> Libo Chen (2): >>> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via >>> cpuset.mems >>> sched/numa: Add tracepoint that tracks the skipping of numa balancing >>> due to cpuset memory pinning >>> >>> include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ >>> kernel/sched/fair.c | 9 +++++++++ >>> 2 files changed, 42 insertions(+) >>> >> Hello Libo, >> >> >> For some reason I am not able to apply this patch. I am trying to test the boot warning[1]. >> >> I am trying to apply on top of next-20250423. Below is the error. Am I missing anything? >> >> [1]: https://lore.kernel.org/all/20250422205740.02c4893a@canb.auug.org.au/ >> Error: >> >> git am -i v5_20250423_libo_chen_sched_numa_skip_vma_scanning_on_memory_pinned_to_one_numa_node_via_cpuset_mems.mbx >> Commit Body is: >> -------------------------- >> sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems >> >> When the memory of the current task is pinned to one NUMA node by cgroup, >> there is no point in continuing the rest of VMA scanning and hinting page >> faults as they will just be overhead. With this change, there will be no >> more unnecessary PTE updates or page faults in this scenario. >> >> We have seen up to a 6x improvement on a typical java workload running on >> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket >> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel >> platform, we have seen 20% improvment in a microbench that creates a >> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB >> pages in a fixed number of loops. >> >> Signed-off-by: Libo Chen <libo.chen@oracle.com> >> Tested-by: Chen Yu <yu.c.chen@intel.com> >> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> >> -------------------------- >> Apply? [y]es/[n]o/[e]dit/[v]iew patch/[a]ccept all: a >> Applying: sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems >> error: patch failed: kernel/sched/fair.c:3329 >> error: kernel/sched/fair.c: patch does not apply >> Patch failed at 0001 sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems >> >> > Hi Venkat, > > I just did git am -i t.mbox on top of next-20250423, not sure why but the second patch was ahead of the > first patch in apply order, have you made sure the second patch was not applied before the first one? > > - Libo Hi Libo, Apolozies!!! I freshly cloned and tried and it worked now. So, please ignore my earlier mail. Regards, Venkat. >> Regards, >> >> Venkat. >> >> >> >
Hello Libo, On 4/24/2025 8:15 AM, Libo Chen wrote: > v1->v2: > 1. add perf improvment numbers in commit log. Yet to find perf diff on > will-it-scale, so not included here. Plan to run more workloads. > 2. add tracepoint. > 3. To peterz's comment, this will make it impossible to attract tasks to > those memory just like other VMA skippings. This is the current > implementation, I think we can improve that in the future, but at the > moment it's probabaly better to keep it consistent. I tested the series with hackbench running on a dual socket system with memory pinned to one node and I could see the skip_cpuset_numa traces being logged: sched-messaging-9430 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9430 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9640 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9640 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9645 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9645 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9637 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9637 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9629 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9629 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9639 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9639 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9630 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9630 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9487 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9487 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9635 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9635 tgid=9007 ngid=0 mem_nodes_allowed=0 sched-messaging-9647 ...: sched_skip_cpuset_numa: comm=sched-messaging pid=9647 tgid=9007 ngid=0 mem_nodes_allowed=0 ... Feel free to add: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> -- Thanks and Regards, Prateek > > v2->v3: > 1. add enable_cpuset() based on Mel's suggestion but again I think it's > redundant. > 2. print out nodemask with %*p.. format in the tracepoint. > > v3->v4: > 1. fix an unsafe dereference of a pointer to content not on ring buffer, > namely mem_allowed_ptr in the tracepoint. > > v4->v5: > 1. add BUILD_BUG_ON() in TP_fast_assign() to guard against future > changes (particularly in size) in nodemask_t. > > Libo Chen (2): > sched/numa: Skip VMA scanning on memory pinned to one NUMA node via > cpuset.mems > sched/numa: Add tracepoint that tracks the skipping of numa balancing > due to cpuset memory pinning > > include/trace/events/sched.h | 33 +++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 9 +++++++++ > 2 files changed, 42 insertions(+) >
© 2016 - 2026 Red Hat, Inc.