Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this
tracks the task subjected to cpuset.mems pinning and prints out its
allowed memory node mask.
Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
include/trace/events/sched.h | 30 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 4 +++-
2 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c13..25ee542fa0063 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
__entry->vm_end,
__print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
);
+
+TRACE_EVENT(sched_skip_cpuset_numa,
+
+ TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
+
+ TP_ARGS(tsk, mem_allowed_ptr),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( pid_t, tgid )
+ __field( pid_t, ngid )
+ __field( nodemask_t *, mem_allowed_ptr )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = task_pid_nr(tsk);
+ __entry->tgid = task_tgid_nr(tsk);
+ __entry->ngid = task_numa_group_id(tsk);
+ __entry->mem_allowed_ptr = mem_allowed_ptr;
+ ),
+
+ TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
+ __entry->comm,
+ __entry->pid,
+ __entry->tgid,
+ __entry->ngid,
+ nodemask_pr_args(__entry->mem_allowed_ptr))
+);
#endif /* CONFIG_NUMA_BALANCING */
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9903b1b39487..cc892961ce157 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *work)
* Memory is pinned to only one NUMA node via cpuset.mems, naturally
* no page can be migrated.
*/
- if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+ if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
+ trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
return;
+ }
if (!mm->numa_next_scan) {
mm->numa_next_scan = now +
--
2.43.5
On Thu, 17 Apr 2025 12:15:43 -0700
Libo Chen <libo.chen@oracle.com> wrote:
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c13..25ee542fa0063 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
> __entry->vm_end,
> __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
> );
> +
> +TRACE_EVENT(sched_skip_cpuset_numa,
> +
> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
> +
> + TP_ARGS(tsk, mem_allowed_ptr),
> +
> + TP_STRUCT__entry(
> + __array( char, comm, TASK_COMM_LEN )
> + __field( pid_t, pid )
> + __field( pid_t, tgid )
> + __field( pid_t, ngid )
> + __field( nodemask_t *, mem_allowed_ptr )
> + ),
> +
> + TP_fast_assign(
> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> + __entry->pid = task_pid_nr(tsk);
> + __entry->tgid = task_tgid_nr(tsk);
> + __entry->ngid = task_numa_group_id(tsk);
> + __entry->mem_allowed_ptr = mem_allowed_ptr;
This is a bug. You can't save random pointers in the TP_fast_assign() and
reference it later in the TP_printk().
The TP_fast_assign() is executed during the normal kernel workflow when the
tracepoint is triggered. The pointer is saved into the ring buffer.
> + ),
> +
> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
> + __entry->comm,
> + __entry->pid,
> + __entry->tgid,
> + __entry->ngid,
> + nodemask_pr_args(__entry->mem_allowed_ptr))
The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
file. Which could be literally months later.
The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
what was saved in the ring buffer, which the content it points to could
have been freed days ago.
If that happens, then BOOM! Kernel goes bye-bye!
The trace event verifier is made to find bugs like his. And with the recent
update to handle "%*p" it found this bug. ;-)
-- Steve
> +);
> #endif /* CONFIG_NUMA_BALANCING */
> On Apr 23, 2025, at 8:34 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 17 Apr 2025 12:15:43 -0700
> Libo Chen <libo.chen@oracle.com> wrote:
>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 8994e97d86c13..25ee542fa0063 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>> __entry->vm_end,
>> __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>> );
>> +
>> +TRACE_EVENT(sched_skip_cpuset_numa,
>> +
>> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
>> +
>> + TP_ARGS(tsk, mem_allowed_ptr),
>> +
>> + TP_STRUCT__entry(
>> + __array( char, comm, TASK_COMM_LEN )
>> + __field( pid_t, pid )
>> + __field( pid_t, tgid )
>> + __field( pid_t, ngid )
>> + __field( nodemask_t *, mem_allowed_ptr )
>> + ),
>> +
>> + TP_fast_assign(
>> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> + __entry->pid = task_pid_nr(tsk);
>> + __entry->tgid = task_tgid_nr(tsk);
>> + __entry->ngid = task_numa_group_id(tsk);
>> + __entry->mem_allowed_ptr = mem_allowed_ptr;
>
> This is a bug. You can't save random pointers in the TP_fast_assign() and
> reference it later in the TP_printk().
>
Admittedly I was a bit nervous about dereferencing this pointer at TP_printk()
time. Will fix it!
Also wondering if we can fail the build in this scenario so it will be easier to
catch this bug at the build time.
Thanks
Libo
> The TP_fast_assign() is executed during the normal kernel workflow when the
> tracepoint is triggered. The pointer is saved into the ring buffer.
>
>> + ),
>> +
>> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
>> + __entry->comm,
>> + __entry->pid,
>> + __entry->tgid,
>> + __entry->ngid,
>> + nodemask_pr_args(__entry->mem_allowed_ptr))
>
> The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
> file. Which could be literally months later.
>
> The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
> what was saved in the ring buffer, which the content it points to could
> have been freed days ago.
>
> If that happens, then BOOM! Kernel goes bye-bye!
>
> The trace event verifier is made to find bugs like his. And with the recent
> update to handle "%*p" it found this bug. ;-)
>
> -- Steve
>
>
>> +);
>> #endif /* CONFIG_NUMA_BALANCING */
On Wed, 23 Apr 2025 16:05:44 +0000 Libo Chen <libo.chen@oracle.com> wrote: > Also wondering if we can fail the build in this scenario so it will be easier to > catch this bug at the build time. return -EPONYS ;-) I wish. It's hard enough to catch this at runtime. -- Steve
> On Apr 23, 2025, at 9:12 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 23 Apr 2025 16:05:44 +0000 > Libo Chen <libo.chen@oracle.com> wrote: > >> Also wondering if we can fail the build in this scenario so it will be easier to >> catch this bug at the build time. > > return -EPONYS ;-) > > I wish. It's hard enough to catch this at runtime. > Correct me if I'm wrong but can you disallow any passed-in pointers to be dereferenced when TP_printk() is executed? This is something you can check at the build time, right? Libo > -- Steve
On Wed, 23 Apr 2025 16:50:15 +0000 Libo Chen <libo.chen@oracle.com> wrote: > Correct me if I'm wrong but can you disallow any passed-in pointers to be > dereferenced when TP_printk() is executed? This is something you can check > at the build time, right? You can dereference if the pointer is to the content on the ring buffer. For instance, you can have: "%p*h", &__entry->val It dereferences to the content stored on the ring buffer. What we can't have is: "%p*h", __entry->val -- Steve
> On Apr 23, 2025, at 9:56 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 23 Apr 2025 16:50:15 +0000
> Libo Chen <libo.chen@oracle.com> wrote:
>
>> Correct me if I'm wrong but can you disallow any passed-in pointers to be
>> dereferenced when TP_printk() is executed? This is something you can check
>> at the build time, right?
>
> You can dereference if the pointer is to the content on the ring buffer.
> For instance, you can have:
>
> "%p*h", &__entry->val
>
> It dereferences to the content stored on the ring buffer.
>
> What we can't have is:
>
> "%p*h", __entry->val
Right, I was thinking something stricter such as disallowing point-type
field in TP_STRUCT__entry {} to avoid direct assignment to point-type
field so there will be no chance to have unsafe dereference but then I
realize C doesn’t have built-in mechanism to detect various types of
pointers at the compile time, maybe rust can do that. Anyway I give up.
Thanks,
Libo
>
> -- Steve
On Wed, 23 Apr 2025 12:56:59 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > For instance, you can have: > > "%p*h", &__entry->val That should have been: "%p*h", __entry->size, &__entry->val -- Steve
© 2016 - 2026 Red Hat, Inc.