On systems with NUMA balancing enabled, it has been found
that tracking task activities resulting from NUMA balancing
is beneficial. NUMA balancing employs two mechanisms for task
migration: one is to migrate a task to an idle CPU within its
preferred node, and the other is to swap tasks located on
different nodes when they are on each other's preferred nodes.
The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
it lacks statistics regarding task migration and swapping.
Therefore, relevant counts for task migration and swapping should
be added.
The following two new fields:
numa_task_migrated
numa_task_swapped
will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat
Introducing both per-task and per-memory cgroup (memcg) NUMA
balancing statistics facilitates a rapid evaluation of the
performance and resource utilization of the target workload.
For instance, users can first identify the container with high
NUMA balancing activity and then further pinpoint a specific
task within that group, and subsequently adjust the memory policy
for that task. In short, although it is possible to iterate through
/proc/$pid/sched to locate the problematic task, the introduction
of aggregated NUMA balancing activity for tasks within each memcg
can assist users in identifying the task more efficiently through
a divide-and-conquer approach.
As Libo Chen pointed out, the memcg event relies on the text
names in vmstat_text, and /proc/vmstat generates corresponding items
based on vmstat_text. Thus, the relevant task migration and swapping
events introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in
/proc/vmstat.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
v4->v5:
no change.
v3->v4:
Populate the /prov/vmstat otherwise the items are all zero.
(Libo)
v2->v3:
Remove unnecessary p->mm check because kernel threads are
not supported by Numa Balancing. (Libo Chen)
v1->v2:
Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
---
Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
include/linux/sched.h | 4 ++++
include/linux/vm_event_item.h | 2 ++
kernel/sched/core.c | 9 +++++++--
kernel/sched/debug.c | 4 ++++
mm/memcontrol.c | 2 ++
mm/vmstat.c | 2 ++
7 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 1a16ce68a4d7..d346f3235945 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1670,6 +1670,12 @@ The following nested keys are defined.
numa_hint_faults (npn)
Number of NUMA hinting faults.
+ numa_task_migrated (npn)
+ Number of task migration by NUMA balancing.
+
+ numa_task_swapped (npn)
+ Number of task swap by NUMA balancing.
+
pgdemote_kswapd
Number of pages demoted by kswapd.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..1c50e30b5c01 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,6 +549,10 @@ struct sched_statistics {
u64 nr_failed_migrations_running;
u64 nr_failed_migrations_hot;
u64 nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+ u64 numa_task_migrated;
+ u64 numa_task_swapped;
+#endif
u64 nr_wakeups;
u64 nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..91a3ce9a2687 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
+ NUMA_TASK_MIGRATE,
+ NUMA_TASK_SWAP,
#endif
#ifdef CONFIG_MIGRATION
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..62b033199e9c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3352,6 +3352,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
#ifdef CONFIG_NUMA_BALANCING
static void __migrate_swap_task(struct task_struct *p, int cpu)
{
+ __schedstat_inc(p->stats.numa_task_swapped);
+ count_vm_numa_event(NUMA_TASK_SWAP);
+ count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
+
if (task_on_rq_queued(p)) {
struct rq *src_rq, *dst_rq;
struct rq_flags srf, drf;
@@ -7953,8 +7957,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
return -EINVAL;
- /* TODO: This is not properly updating schedstats */
-
+ __schedstat_inc(p->stats.numa_task_migrated);
+ count_vm_numa_event(NUMA_TASK_MIGRATE);
+ count_memcg_event_mm(p->mm, NUMA_TASK_MIGRATE);
trace_sched_move_numa(p, curr_cpu, target_cpu);
return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..f971c2af7912 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P_SCHEDSTAT(nr_failed_migrations_running);
P_SCHEDSTAT(nr_failed_migrations_hot);
P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+ P_SCHEDSTAT(numa_task_migrated);
+ P_SCHEDSTAT(numa_task_swapped);
+#endif
P_SCHEDSTAT(nr_wakeups);
P_SCHEDSTAT(nr_wakeups_sync);
P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf5..cdaab8a957f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
NUMA_PAGE_MIGRATE,
NUMA_PTE_UPDATES,
NUMA_HINT_FAULTS,
+ NUMA_TASK_MIGRATE,
+ NUMA_TASK_SWAP,
#endif
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4c268ce39ff2..ed08bb384ae4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
"numa_hint_faults",
"numa_hint_faults_local",
"numa_pages_migrated",
+ "numa_task_migrated",
+ "numa_task_swapped",
#endif
#ifdef CONFIG_MIGRATION
"pgmigrate_success",
--
2.25.1
On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
> On systems with NUMA balancing enabled, it has been found
> that tracking task activities resulting from NUMA balancing
> is beneficial. NUMA balancing employs two mechanisms for task
> migration: one is to migrate a task to an idle CPU within its
> preferred node, and the other is to swap tasks located on
> different nodes when they are on each other's preferred nodes.
>
> The kernel already provides NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
> it lacks statistics regarding task migration and swapping.
> Therefore, relevant counts for task migration and swapping should
> be added.
>
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
> and /proc/vmstat
Hmm these are scheduler events, how are these relevant to memory cgroup
or vmstat? Any reason to not expose these in cpu.stat?
>
> Introducing both per-task and per-memory cgroup (memcg) NUMA
> balancing statistics facilitates a rapid evaluation of the
> performance and resource utilization of the target workload.
> For instance, users can first identify the container with high
> NUMA balancing activity and then further pinpoint a specific
> task within that group, and subsequently adjust the memory policy
> for that task. In short, although it is possible to iterate through
> /proc/$pid/sched to locate the problematic task, the introduction
> of aggregated NUMA balancing activity for tasks within each memcg
> can assist users in identifying the task more efficiently through
> a divide-and-conquer approach.
>
> As Libo Chen pointed out, the memcg event relies on the text
> names in vmstat_text, and /proc/vmstat generates corresponding items
> based on vmstat_text. Thus, the relevant task migration and swapping
> events introduced in vmstat_text also need to be populated by
> count_vm_numa_event(), otherwise these values are zero in
> /proc/vmstat.
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> v4->v5:
> no change.
> v3->v4:
> Populate the /prov/vmstat otherwise the items are all zero.
> (Libo)
> v2->v3:
> Remove unnecessary p->mm check because kernel threads are
> not supported by Numa Balancing. (Libo Chen)
> v1->v2:
> Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
> ---
> Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
> include/linux/sched.h | 4 ++++
> include/linux/vm_event_item.h | 2 ++
> kernel/sched/core.c | 9 +++++++--
> kernel/sched/debug.c | 4 ++++
> mm/memcontrol.c | 2 ++
> mm/vmstat.c | 2 ++
> 7 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1a16ce68a4d7..d346f3235945 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1670,6 +1670,12 @@ The following nested keys are defined.
> numa_hint_faults (npn)
> Number of NUMA hinting faults.
>
> + numa_task_migrated (npn)
> + Number of task migration by NUMA balancing.
> +
> + numa_task_swapped (npn)
> + Number of task swap by NUMA balancing.
> +
> pgdemote_kswapd
> Number of pages demoted by kswapd.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..1c50e30b5c01 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -549,6 +549,10 @@ struct sched_statistics {
> u64 nr_failed_migrations_running;
> u64 nr_failed_migrations_hot;
> u64 nr_forced_migrations;
> +#ifdef CONFIG_NUMA_BALANCING
> + u64 numa_task_migrated;
> + u64 numa_task_swapped;
> +#endif
>
> u64 nr_wakeups;
> u64 nr_wakeups_sync;
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 9e15a088ba38..91a3ce9a2687 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NUMA_HINT_FAULTS,
> NUMA_HINT_FAULTS_LOCAL,
> NUMA_PAGE_MIGRATE,
> + NUMA_TASK_MIGRATE,
> + NUMA_TASK_SWAP,
> #endif
> #ifdef CONFIG_MIGRATION
> PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba0..62b033199e9c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3352,6 +3352,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> #ifdef CONFIG_NUMA_BALANCING
> static void __migrate_swap_task(struct task_struct *p, int cpu)
> {
> + __schedstat_inc(p->stats.numa_task_swapped);
> + count_vm_numa_event(NUMA_TASK_SWAP);
> + count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
> +
> if (task_on_rq_queued(p)) {
> struct rq *src_rq, *dst_rq;
> struct rq_flags srf, drf;
> @@ -7953,8 +7957,9 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
> if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
> return -EINVAL;
>
> - /* TODO: This is not properly updating schedstats */
> -
> + __schedstat_inc(p->stats.numa_task_migrated);
> + count_vm_numa_event(NUMA_TASK_MIGRATE);
> + count_memcg_event_mm(p->mm, NUMA_TASK_MIGRATE);
> trace_sched_move_numa(p, curr_cpu, target_cpu);
> return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> }
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 56ae54e0ce6a..f971c2af7912 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> P_SCHEDSTAT(nr_failed_migrations_running);
> P_SCHEDSTAT(nr_failed_migrations_hot);
> P_SCHEDSTAT(nr_forced_migrations);
> +#ifdef CONFIG_NUMA_BALANCING
> + P_SCHEDSTAT(numa_task_migrated);
> + P_SCHEDSTAT(numa_task_swapped);
> +#endif
> P_SCHEDSTAT(nr_wakeups);
> P_SCHEDSTAT(nr_wakeups_sync);
> P_SCHEDSTAT(nr_wakeups_migrate);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c96c1f2b9cf5..cdaab8a957f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
> NUMA_PAGE_MIGRATE,
> NUMA_PTE_UPDATES,
> NUMA_HINT_FAULTS,
> + NUMA_TASK_MIGRATE,
> + NUMA_TASK_SWAP,
> #endif
> };
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4c268ce39ff2..ed08bb384ae4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
> "numa_hint_faults",
> "numa_hint_faults_local",
> "numa_pages_migrated",
> + "numa_task_migrated",
> + "numa_task_swapped",
> #endif
> #ifdef CONFIG_MIGRATION
> "pgmigrate_success",
> --
> 2.25.1
>
On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote: > Hmm these are scheduler events, how are these relevant to memory cgroup > or vmstat? Any reason to not expose these in cpu.stat? Good point. If I take it further -- this functionality needs neither memory controller (CONFIG_MEMCG) nor CPU controller (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed in _any_ cgroup (which would be same technical solution how cpu time is counted in cpu.stat regardless of CPU controller, cpu_stat_show()). Michal
On 5/26/2025 9:35 PM, Michal Koutný wrote:
> On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> Hmm these are scheduler events, how are these relevant to memory cgroup
>> or vmstat? Any reason to not expose these in cpu.stat?
>
> Good point. If I take it further -- this functionality needs neither
> memory controller (CONFIG_MEMCG) nor CPU controller
> (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
> in _any_ cgroup (which would be same technical solution how cpu time is
> counted in cpu.stat regardless of CPU controller, cpu_stat_show()).
>
Yes, we can add it to cpu.stat. However, this might make it more difficult
for users to locate related events. Some statistics about NUMA page
migrations/faults are recorded in memory.stat, while others about NUMA task
migrations (triggered by NUMA faults periodicly) are stored in cpu.stat.
Do you recommend extending the struct cgroup_base_stat to include counters
for task_migrate/task_swap? Additionally, should we enhance
cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner
similar to cputime?
Alternatively, as Shakeel previously mentioned, could we reuse
"count_memcg_event_mm()" and related infrastructure while exposing these
statistics/events in cpu.stat? I assume Shakeel was referring to the
following
approach:
1. Skip task migration/swap in memory.stat:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdaab8a957f3..b8eea3eca46f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup
*memcg, struct seq_buf *s)
if (memcg_vm_event_stat[i] == PGPGIN ||
memcg_vm_event_stat[i] == PGPGOUT)
continue;
+#endif
+#ifdef CONFIG_NUMA_BALANCING
+ if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE ||
+ memcg_vm_event_stat[i] == NUMA_TASK_SWAP)
+ continue;
#endif
2.Skip task migration/swap in /proc/vmstat
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ed08bb384ae4..ea8a8ae1cdac 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void
*arg, loff_t *pos)
(*pos)++;
if (*pos >= NR_VMSTAT_ITEMS)
return NULL;
+#ifdef CONFIG_NUMA_BALANCING
+ if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP)
+ return NULL;
+#endif
3. Display task migration/swap events in cpu.stat:
seq_buf_printf(&s, "%s %lu\n",
+
vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]),
+ memcg_events(memcg,
memcg_vm_event_stat[NUMA_TASK_MIGRATE]));
It looks like more code is needed. Michal, Shakeel, could you please advise
which strategy is preferred, or should we keep the current version?
Thanks,
Chenyu
On Tue, May 27, 2025 at 05:20:54PM +0800, Chen, Yu C wrote: > On 5/26/2025 9:35 PM, Michal Koutný wrote: > > On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > Hmm these are scheduler events, how are these relevant to memory cgroup > > > or vmstat? Any reason to not expose these in cpu.stat? > > > > Good point. If I take it further -- this functionality needs neither > > memory controller (CONFIG_MEMCG) nor CPU controller > > (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed > > in _any_ cgroup (which would be same technical solution how cpu time is > > counted in cpu.stat regardless of CPU controller, cpu_stat_show()). > > > > Yes, we can add it to cpu.stat. However, this might make it more difficult > for users to locate related events. Some statistics about NUMA page > migrations/faults are recorded in memory.stat, while others about NUMA task > migrations (triggered by NUMA faults periodicly) are stored in cpu.stat. > > Do you recommend extending the struct cgroup_base_stat to include counters > for task_migrate/task_swap? Additionally, should we enhance > cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner > similar to cputime? > > Alternatively, as Shakeel previously mentioned, could we reuse > "count_memcg_event_mm()" and related infrastructure while exposing these > statistics/events in cpu.stat? I assume Shakeel was referring to the > following > approach: > > 1. Skip task migration/swap in memory.stat: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cdaab8a957f3..b8eea3eca46f 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup > *memcg, struct seq_buf *s) > if (memcg_vm_event_stat[i] == PGPGIN || > memcg_vm_event_stat[i] == PGPGOUT) > continue; > +#endif > +#ifdef CONFIG_NUMA_BALANCING > + if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE || > + memcg_vm_event_stat[i] == NUMA_TASK_SWAP) > + continue; > #endif > > 2.Skip task migration/swap in /proc/vmstat > diff --git a/mm/vmstat.c b/mm/vmstat.c > index ed08bb384ae4..ea8a8ae1cdac 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void > *arg, loff_t *pos) > (*pos)++; > if (*pos >= NR_VMSTAT_ITEMS) > return NULL; > +#ifdef CONFIG_NUMA_BALANCING > + if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP) > + return NULL; > +#endif > > 3. Display task migration/swap events in cpu.stat: > seq_buf_printf(&s, "%s %lu\n", > + vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]), > + memcg_events(memcg, > memcg_vm_event_stat[NUMA_TASK_MIGRATE])); > You would need to use memcg_events() and you will need to flush the memcg rstat trees as well > > It looks like more code is needed. Michal, Shakeel, could you please advise > which strategy is preferred, or should we keep the current version? I am now more inclined to keep these new stats in memory.stat as the current version is doing because: 1. Relevant stats are exposed through the same interface and we already have numa balancing stats in memory.stat. 2. There is no single good home for these new stats and exposing them in cpu.stat would require more code and even if we reuse memcg infra, we would still need to flush the memcg stats, so why not just expose in the memory.stat. 3. Though a bit far fetched, I think we may add more stats which sit at the boundary of sched and mm in future. Numa balancing is one concrete example of such stats. I am envisioning for reliable memory reclaim or overcommit, there might be some useful events as well. Anyways it is still unbaked atm. Michal, let me know your thought on this.
On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote: > I am now more inclined to keep these new stats in memory.stat as the > current version is doing because: > > 1. Relevant stats are exposed through the same interface and we already > have numa balancing stats in memory.stat. > > 2. There is no single good home for these new stats and exposing them in > cpu.stat would require more code and even if we reuse memcg infra, we > would still need to flush the memcg stats, so why not just expose in > the memory.stat. > > 3. Though a bit far fetched, I think we may add more stats which sit at > the boundary of sched and mm in future. Numa balancing is one > concrete example of such stats. I am envisioning for reliable memory > reclaim or overcommit, there might be some useful events as well. > Anyways it is still unbaked atm. > > > Michal, let me know your thought on this. I reckon users may be little bit more likely to look that info in memory.stat. Which would be OK unless threaded subtrees are considered (e.g. cpuset (NUMA affinity) has thread granularity) and these migration stats are potentially per-thread relevant. I was also pondering why cannot be misplaced container found by existing NUMA stats. Chen has explained task vs page migration in NUMA balancing. I guess mere page migration number (especially when stagnating) may not point to the the misplaced container. OK. Second thing is what is the "misplaced" container. Is it because of wrong set_mempolicy(2) or cpuset configuration? If it's the former (i.e. it requires enabled cpuset controller), it'd justify exposing this info in cpuset.stat, if it's the latter, the cgroup aggregation is not that relevant (hence /proc/<PID>/sched) is sufficient. Or is there another meaning of a misplaced container? Chen, could you please clarify? Because memory controller doesn't control NUMA, it needn't be enabled to have this statistics and it cannot be enabled in threaded groups, I'm having some doubts whether memory.stat is a good home for this field. Regards, Michal
Hi Michal, On 6/3/2025 12:53 AM, Michal Koutný wrote: > On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote: >> I am now more inclined to keep these new stats in memory.stat as the >> current version is doing because: >> >> 1. Relevant stats are exposed through the same interface and we already >> have numa balancing stats in memory.stat. >> >> 2. There is no single good home for these new stats and exposing them in >> cpu.stat would require more code and even if we reuse memcg infra, we >> would still need to flush the memcg stats, so why not just expose in >> the memory.stat. >> >> 3. Though a bit far fetched, I think we may add more stats which sit at >> the boundary of sched and mm in future. Numa balancing is one >> concrete example of such stats. I am envisioning for reliable memory >> reclaim or overcommit, there might be some useful events as well. >> Anyways it is still unbaked atm. >> >> >> Michal, let me know your thought on this. > > I reckon users may be little bit more likely to look that info in > memory.stat. > > Which would be OK unless threaded subtrees are considered (e.g. cpuset > (NUMA affinity) has thread granularity) and these migration stats are > potentially per-thread relevant. > > > I was also pondering why cannot be misplaced container found by existing > NUMA stats. Chen has explained task vs page migration in NUMA balancing. > I guess mere page migration number (especially when stagnating) may not > point to the the misplaced container. OK. > > Second thing is what is the "misplaced" container. Is it because of > wrong set_mempolicy(2) or cpuset configuration? > If it's the former (i.e. > it requires enabled cpuset controller), it'd justify exposing this info > in cpuset.stat, if it's the latter, the cgroup aggregation is not that > relevant (hence /proc/<PID>/sched) is sufficient. Or is there another > meaning of a misplaced container? Chen, could you please clarify? My understanding is that the "misplaced" container is not strictly tied to set_mempolicy or cpuset configuration, but is mainly caused by the scheduler's generic load balancer. The generic load balancer spreads tasks across different nodes to fully utilize idle CPUs, while NUMA balancing tries to pull misplaced tasks/pages back to honor NUMA locality. Regarding the threaded subtrees mode, I was previously unfamiliar with it and have been trying to understand it better. If I understand correctly, if threads within a single process are placed in different cgroups via cpuset, we might need to scan /proc/<PID>/sched to collect NUMA task migration/swap statistics. If threaded subtrees are disabled for that process, we can query memory.stat. I agree with your prior point that NUMA balancing task activity is not directly associated with either the Memory controller or the CPU controller. Although showing this data in cpu.stat might seem more appropriate, we expose it in memory.stat due to the following trade-offs(or as an exception for NUMA balancing): 1.It aligns with existing NUMA-related metrics already present in memory.stat. 2.It simplifies code implementation. thanks, Chenyu > > Because memory controller doesn't control NUMA, it needn't be enabled > to have this statistics and it cannot be enabled in threaded groups, I'm > having some doubts whether memory.stat is a good home for this field. > > Regards, > Michal
On Tue, Jun 03, 2025 at 10:46:06PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote: > My understanding is that the "misplaced" container is not strictly tied > to set_mempolicy or cpuset configuration, but is mainly caused by the > scheduler's generic load balancer. You are convincing me with this that, cpu.stat fits the concept better. Doesn't that sound like that to you? > Regarding the threaded subtrees mode, I was previously unfamiliar with > it and have been trying to understand it better. No problem. > If I understand correctly, if threads within a single process are > placed in different cgroups via cpuset, we might need to scan > /proc/<PID>/sched to collect NUMA task migration/swap statistics. The premise of your series was that you didn't want to do that :-) > I agree with your prior point that NUMA balancing task activity is not > directly > associated with either the Memory controller or the CPU controller. Although > showing this data in cpu.stat might seem more appropriate, we expose it in > memory.stat due to the following trade-offs(or as an exception for > NUMA balancing): > > 1.It aligns with existing NUMA-related metrics already present in > memory.stat. That one I'd buy into. OTOH, I'd hope this could be overcome with documentation. > 2.It simplifies code implementation. I'd say that only applies when accepting memory.stat as the better place. I think the appropriately matching API should be picked first and implementation is only secondary to that. From your reasoning above, I think that the concept is closer to be in cpu.stat ¯\_(ツ)_/¯ Michal
On 6/17/2025 5:30 PM, Michal Koutný wrote: > On Tue, Jun 03, 2025 at 10:46:06PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote: >> My understanding is that the "misplaced" container is not strictly tied >> to set_mempolicy or cpuset configuration, but is mainly caused by the >> scheduler's generic load balancer. > > You are convincing me with this that, cpu.stat fits the concept better. > Doesn't that sound like that to you? > >> Regarding the threaded subtrees mode, I was previously unfamiliar with >> it and have been trying to understand it better. > > No problem. > >> If I understand correctly, if threads within a single process are >> placed in different cgroups via cpuset, we might need to scan >> /proc/<PID>/sched to collect NUMA task migration/swap statistics. > > The premise of your series was that you didn't want to do that :-) > >> I agree with your prior point that NUMA balancing task activity is not >> directly >> associated with either the Memory controller or the CPU controller. Although >> showing this data in cpu.stat might seem more appropriate, we expose it in >> memory.stat due to the following trade-offs(or as an exception for >> NUMA balancing): >> >> 1.It aligns with existing NUMA-related metrics already present in >> memory.stat. > > That one I'd buy into. OTOH, I'd hope this could be overcome with > documentation. > >> 2.It simplifies code implementation. > > I'd say that only applies when accepting memory.stat as the better > place. I think the appropriately matching API should be picked first and > implementation is only secondary to that. Thanks for this guidance. > From your reasoning above, I think that the concept is closer to be in > cpu.stat ¯\_(ツ)_/¯ > OK. Since this change has already been addressed in upstream kernel, I can update the numa_task_migrated/numa_task_swapped fields in Documentation/admin-guide/cgroup-v2.rst to mention that, these activities are not memory related but put here because they are closer to numa balance's page statistics. Or do you want me to submit a patch to move the items from memory.stat to cpu.stat? thanks, Chenyu > Michal
On Thu, Jun 19, 2025 at 09:03:55PM +0800, "Chen, Yu C" <yu.c.chen@intel.com> wrote: > OK. Since this change has already been addressed in upstream kernel, Oh, I missed that. (Otherwise I wouldn't have bothered responding anymore in this case.) > I can update the numa_task_migrated/numa_task_swapped fields in > Documentation/admin-guide/cgroup-v2.rst to mention that, these > activities are not memory related but put here because they are > closer to numa balance's page statistics. > Or do you want me to submit a patch to move the items from > memory.stat to cpu.stat? I leave it up to you. (It's become sunk cost for me.) Michal
Hi Shakeel,
On 5/24/2025 7:42 AM, Shakeel Butt wrote:
> On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
>> On systems with NUMA balancing enabled, it has been found
>> that tracking task activities resulting from NUMA balancing
>> is beneficial. NUMA balancing employs two mechanisms for task
>> migration: one is to migrate a task to an idle CPU within its
>> preferred node, and the other is to swap tasks located on
>> different nodes when they are on each other's preferred nodes.
>>
>> The kernel already provides NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
>> it lacks statistics regarding task migration and swapping.
>> Therefore, relevant counts for task migration and swapping should
>> be added.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
>> and /proc/vmstat
>
> Hmm these are scheduler events, how are these relevant to memory cgroup
> or vmstat?
> Any reason to not expose these in cpu.stat?
>
I understand that in theory they are scheduling activities.
The reason for including these statistics here was mainly that
I assumed there is a close relationship between page migration
and task migration in Numa Balance. Specifically, task migration
is triggered when page migration fails.
Placing these statistics closer to the existing Numa Balance page
statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
may help users query relevant data from a single file, avoiding
the need to search through scattered files.
Notably, these events are associated with a task’s working set
(footprint) rather than pure CPU cycles IMO. I took a look at
the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
code is needed if we want to expose them in cpu.stat, while
reusing existing interface of count_memcg_event_mm() is simpler.
thanks,
Chenyu
On Sat, May 24, 2025 at 2:07 AM Chen, Yu C <yu.c.chen@intel.com> wrote:
>
> Hi Shakeel,
>
> On 5/24/2025 7:42 AM, Shakeel Butt wrote:
> > On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
> >> On systems with NUMA balancing enabled, it has been found
> >> that tracking task activities resulting from NUMA balancing
> >> is beneficial. NUMA balancing employs two mechanisms for task
> >> migration: one is to migrate a task to an idle CPU within its
> >> preferred node, and the other is to swap tasks located on
> >> different nodes when they are on each other's preferred nodes.
> >>
> >> The kernel already provides NUMA page migration statistics in
> >> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
> >> it lacks statistics regarding task migration and swapping.
> >> Therefore, relevant counts for task migration and swapping should
> >> be added.
> >>
> >> The following two new fields:
> >>
> >> numa_task_migrated
> >> numa_task_swapped
> >>
> >> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
> >> and /proc/vmstat
> >
> > Hmm these are scheduler events, how are these relevant to memory cgroup
> > or vmstat?
> > Any reason to not expose these in cpu.stat?
> >
>
> I understand that in theory they are scheduling activities.
> The reason for including these statistics here was mainly that
> I assumed there is a close relationship between page migration
> and task migration in Numa Balance. Specifically, task migration
> is triggered when page migration fails.
> Placing these statistics closer to the existing Numa Balance page
> statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
> may help users query relevant data from a single file, avoiding
> the need to search through scattered files.
> Notably, these events are associated with a task’s working set
> (footprint) rather than pure CPU cycles IMO. I took a look at
> the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
> code is needed if we want to expose them in cpu.stat, while
> reusing existing interface of count_memcg_event_mm() is simpler.
Let me address two of your points first:
(1) cpu.stat currently contains cpu cycles stats. I don't see an issue
adding these new events in it as you can see memory.stat exposes stats
and events as well.
(2) You can still use count_memcg_event_mm() and related infra while
exposing the stats/events in cpu.stat.
Now your point on having related stats within a single interface is
more convincing. Let me ask you couple of simple questions:
I am not well versed with numa migration, can you expand a bit more on
these two events (numa_task_migrated & numa_task_swapped)? How are
these related to numa memory migration? You mentioned these events
happen on page migration failure, can you please give an end-to-end
flow/story of all these events happening on a timeline.
Beside that, do you think there might be some other scheduling events
(maybe unrelated to numa balancing) which might be suitable for
memory.stat? Basically I am trying to find if having sched events in
memory.stat be an exception for numa balancing or more general.
thanks,
Shakeel
On 5/25/2025 1:32 AM, Shakeel Butt wrote:
> On Sat, May 24, 2025 at 2:07 AM Chen, Yu C <yu.c.chen@intel.com> wrote:
>>
>> Hi Shakeel,
>>
>> On 5/24/2025 7:42 AM, Shakeel Butt wrote:
>>> On Fri, May 23, 2025 at 08:51:15PM +0800, Chen Yu wrote:
>>>> On systems with NUMA balancing enabled, it has been found
>>>> that tracking task activities resulting from NUMA balancing
>>>> is beneficial. NUMA balancing employs two mechanisms for task
>>>> migration: one is to migrate a task to an idle CPU within its
>>>> preferred node, and the other is to swap tasks located on
>>>> different nodes when they are on each other's preferred nodes.
>>>>
>>>> The kernel already provides NUMA page migration statistics in
>>>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However,
>>>> it lacks statistics regarding task migration and swapping.
>>>> Therefore, relevant counts for task migration and swapping should
>>>> be added.
>>>>
>>>> The following two new fields:
>>>>
>>>> numa_task_migrated
>>>> numa_task_swapped
>>>>
>>>> will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
>>>> and /proc/vmstat
>>>
>>> Hmm these are scheduler events, how are these relevant to memory cgroup
>>> or vmstat?
>>> Any reason to not expose these in cpu.stat?
>>>
>>
>> I understand that in theory they are scheduling activities.
>> The reason for including these statistics here was mainly that
>> I assumed there is a close relationship between page migration
>> and task migration in Numa Balance. Specifically, task migration
>> is triggered when page migration fails.
>> Placing these statistics closer to the existing Numa Balance page
>> statistics in /sys/fs/cgroup/{GROUP}/memory.stat and /proc/vmstat
>> may help users query relevant data from a single file, avoiding
>> the need to search through scattered files.
>> Notably, these events are associated with a task’s working set
>> (footprint) rather than pure CPU cycles IMO. I took a look at
>> the cpu_cfs_stat_show() for cpu.stat, it seems that a lot of
>> code is needed if we want to expose them in cpu.stat, while
>> reusing existing interface of count_memcg_event_mm() is simpler.
>
> Let me address two of your points first:
>
> (1) cpu.stat currently contains cpu cycles stats. I don't see an issue
> adding these new events in it as you can see memory.stat exposes stats
> and events as well.
>
> (2) You can still use count_memcg_event_mm() and related infra while
> exposing the stats/events in cpu.stat.
>
Got it.
> Now your point on having related stats within a single interface is
> more convincing. Let me ask you couple of simple questions:
>
> I am not well versed with numa migration, can you expand a bit more on
> these two events (numa_task_migrated & numa_task_swapped)? How are
> these related to numa memory migration? You mentioned these events> happen on page migration failure,
I double-checked the code, and it seems that task numa migration
occurs regardless of whether page migration fails or succeeds.
> can you please give an end-to-end> flow/story of all these events
happening on a timeline.
>
Yes, sure, let me have a try.
The goal of NUMA balancing is to co-locate a task and its
memory pages on the same NUMA node. There are two strategies:
migrate the pages to the task's node, or migrate the task to
the node where its pages reside.
Suppose a task p1 is running on Node 0, but its pages are
located on Node 1. NUMA page fault statistics for p1 reveal
its "page footprint" across nodes. If NUMA balancing detects
that most of p1's pages are on Node 1:
1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.
2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.
Case 2.1: Idle CPU Available
If Node 1 has an idle CPU, p1 is directly scheduled there. This event is
logged as numa_task_migrated.
Case 2.2: No Idle CPU (Task Swap)
If all CPUs on Node1 are busy, direct migration could cause CPU
contention or load imbalance. Instead:
The Numa balance selects a candidate task p2 on Node 1 that prefers
Node 0 (e.g., due to its own page footprint).
p1 and p2 are swapped. This cross-node swap is recorded as
numa_task_swapped.
> Beside that, do you think there might be some other scheduling events
> (maybe unrelated to numa balancing) which might be suitable for
> memory.stat? Basically I am trying to find if having sched events in
> memory.stat be an exception for numa balancing or more general.
If the criterion is a combination of task scheduling strategy and
page-based operations, I cannot find any other existing scheduling
events. For now, NUMA balancing seems to be the only case.
thanks,
Chenyu
>
> thanks,
> Shakeel
On Sun, May 25, 2025 at 08:35:24PM +0800, Chen, Yu C wrote: > On 5/25/2025 1:32 AM, Shakeel Butt wrote: [...] > > can you please give an end-to-end> flow/story of all these events > happening on a timeline. > > > > Yes, sure, let me have a try. > > The goal of NUMA balancing is to co-locate a task and its > memory pages on the same NUMA node. There are two strategies: > migrate the pages to the task's node, or migrate the task to > the node where its pages reside. > > Suppose a task p1 is running on Node 0, but its pages are > located on Node 1. NUMA page fault statistics for p1 reveal > its "page footprint" across nodes. If NUMA balancing detects > that most of p1's pages are on Node 1: > > 1.Page Migration Attempt: > The Numa balance first tries to migrate p1's pages to Node 0. > The numa_page_migrate counter increments. > > 2.Task Migration Strategies: > After the page migration finishes, Numa balance checks every > 1 second to see if p1 can be migrated to Node 1. > > Case 2.1: Idle CPU Available > If Node 1 has an idle CPU, p1 is directly scheduled there. This event is > logged as numa_task_migrated. > Case 2.2: No Idle CPU (Task Swap) > If all CPUs on Node1 are busy, direct migration could cause CPU contention > or load imbalance. Instead: > The Numa balance selects a candidate task p2 on Node 1 that prefers > Node 0 (e.g., due to its own page footprint). > p1 and p2 are swapped. This cross-node swap is recorded as > numa_task_swapped. > Thanks for the explanation, this is really helpful and I would like this to be included in the commit message. > > Beside that, do you think there might be some other scheduling events > > (maybe unrelated to numa balancing) which might be suitable for > > memory.stat? Basically I am trying to find if having sched events in > > memory.stat be an exception for numa balancing or more general. > > If the criterion is a combination of task scheduling strategy and > page-based operations, I cannot find any other existing scheduling > events. For now, NUMA balancing seems to be the only case. Mainly I was looking if in future we need to add more sched events to memory.stat file. Let me reply on the other email chain on what should we do next.
On 5/28/2025 1:48 AM, Shakeel Butt wrote: > On Sun, May 25, 2025 at 08:35:24PM +0800, Chen, Yu C wrote: >> On 5/25/2025 1:32 AM, Shakeel Butt wrote: > [...] >>> can you please give an end-to-end> flow/story of all these events >> happening on a timeline. >>> >> >> Yes, sure, let me have a try. >> >> The goal of NUMA balancing is to co-locate a task and its >> memory pages on the same NUMA node. There are two strategies: >> migrate the pages to the task's node, or migrate the task to >> the node where its pages reside. >> >> Suppose a task p1 is running on Node 0, but its pages are >> located on Node 1. NUMA page fault statistics for p1 reveal >> its "page footprint" across nodes. If NUMA balancing detects >> that most of p1's pages are on Node 1: >> >> 1.Page Migration Attempt: >> The Numa balance first tries to migrate p1's pages to Node 0. >> The numa_page_migrate counter increments. >> >> 2.Task Migration Strategies: >> After the page migration finishes, Numa balance checks every >> 1 second to see if p1 can be migrated to Node 1. >> >> Case 2.1: Idle CPU Available >> If Node 1 has an idle CPU, p1 is directly scheduled there. This event is >> logged as numa_task_migrated. >> Case 2.2: No Idle CPU (Task Swap) >> If all CPUs on Node1 are busy, direct migration could cause CPU contention >> or load imbalance. Instead: >> The Numa balance selects a candidate task p2 on Node 1 that prefers >> Node 0 (e.g., due to its own page footprint). >> p1 and p2 are swapped. This cross-node swap is recorded as >> numa_task_swapped. >> > > Thanks for the explanation, this is really helpful and I would like this > to be included in the commit message. > OK, just sent out a v6 with the commit message enhanced. Thanks, Chenyu
© 2016 - 2025 Red Hat, Inc.