[PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning

Libo Chen posted 2 patches 9 months, 3 weeks ago
[PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Libo Chen 9 months, 3 weeks ago
Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this
tracks the task subjected to cpuset.mems pinning and prints out its
allowed memory node mask.

Signed-off-by: Libo Chen <libo.chen@oracle.com>
---
 include/trace/events/sched.h | 30 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  4 +++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c13..25ee542fa0063 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
 		  __entry->vm_end,
 		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
 );
+
+TRACE_EVENT(sched_skip_cpuset_numa,
+
+	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
+
+	TP_ARGS(tsk, mem_allowed_ptr),
+
+	TP_STRUCT__entry(
+		__array( char,		comm,		TASK_COMM_LEN	)
+		__field( pid_t,		pid				)
+		__field( pid_t,		tgid				)
+		__field( pid_t,		ngid				)
+		__field( nodemask_t *,	mem_allowed_ptr			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		 = task_pid_nr(tsk);
+		__entry->tgid		 = task_tgid_nr(tsk);
+		__entry->ngid		 = task_numa_group_id(tsk);
+		__entry->mem_allowed_ptr = mem_allowed_ptr;
+	),
+
+	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
+		  __entry->comm,
+		  __entry->pid,
+		  __entry->tgid,
+		  __entry->ngid,
+		  nodemask_pr_args(__entry->mem_allowed_ptr))
+);
 #endif /* CONFIG_NUMA_BALANCING */
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9903b1b39487..cc892961ce157 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3333,8 +3333,10 @@ static void task_numa_work(struct callback_head *work)
 	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
 	 * no page can be migrated.
 	 */
-	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
+	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1) {
+		trace_sched_skip_cpuset_numa(current, &cpuset_current_mems_allowed);
 		return;
+	}
 
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
-- 
2.43.5
Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Steven Rostedt 9 months, 3 weeks ago
On Thu, 17 Apr 2025 12:15:43 -0700
Libo Chen <libo.chen@oracle.com> wrote:

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c13..25ee542fa0063 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>  		  __entry->vm_end,
>  		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>  );
> +
> +TRACE_EVENT(sched_skip_cpuset_numa,
> +
> +	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
> +
> +	TP_ARGS(tsk, mem_allowed_ptr),
> +
> +	TP_STRUCT__entry(
> +		__array( char,		comm,		TASK_COMM_LEN	)
> +		__field( pid_t,		pid				)
> +		__field( pid_t,		tgid				)
> +		__field( pid_t,		ngid				)
> +		__field( nodemask_t *,	mem_allowed_ptr			)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> +		__entry->pid		 = task_pid_nr(tsk);
> +		__entry->tgid		 = task_tgid_nr(tsk);
> +		__entry->ngid		 = task_numa_group_id(tsk);
> +		__entry->mem_allowed_ptr = mem_allowed_ptr;

This is a bug. You can't save random pointers in the TP_fast_assign() and
reference it later in the TP_printk().

The TP_fast_assign() is executed during the normal kernel workflow when the
tracepoint is triggered. The pointer is saved into the ring buffer.

> +	),
> +
> +	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
> +		  __entry->comm,
> +		  __entry->pid,
> +		  __entry->tgid,
> +		  __entry->ngid,
> +		  nodemask_pr_args(__entry->mem_allowed_ptr))

The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
file. Which could be literally months later.

The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
what was saved in the ring buffer, which the content it points to could
have been freed days ago.

If that happens, then BOOM! Kernel goes bye-bye!

The trace event verifier is made to find bugs like his. And with the recent
update to handle "%*p" it found this bug. ;-)

-- Steve


> +);
>  #endif /* CONFIG_NUMA_BALANCING */
Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Libo Chen 9 months, 3 weeks ago

> On Apr 23, 2025, at 8:34 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Thu, 17 Apr 2025 12:15:43 -0700
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 8994e97d86c13..25ee542fa0063 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>>  __entry->vm_end,
>>  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>> );
>> +
>> +TRACE_EVENT(sched_skip_cpuset_numa,
>> +
>> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
>> +
>> + TP_ARGS(tsk, mem_allowed_ptr),
>> +
>> + TP_STRUCT__entry(
>> + __array( char, comm, TASK_COMM_LEN )
>> + __field( pid_t, pid )
>> + __field( pid_t, tgid )
>> + __field( pid_t, ngid )
>> + __field( nodemask_t *, mem_allowed_ptr )
>> + ),
>> +
>> + TP_fast_assign(
>> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> + __entry->pid = task_pid_nr(tsk);
>> + __entry->tgid = task_tgid_nr(tsk);
>> + __entry->ngid = task_numa_group_id(tsk);
>> + __entry->mem_allowed_ptr = mem_allowed_ptr;
> 
> This is a bug. You can't save random pointers in the TP_fast_assign() and
> reference it later in the TP_printk().
> 

Admittedly I was a bit nervous about dereferencing this pointer at TP_printk()
time. Will fix it!

Also wondering if we can fail the build in this scenario so it will be easier to
catch this bug at the build time.

Thanks
Libo
 
> The TP_fast_assign() is executed during the normal kernel workflow when the
> tracepoint is triggered. The pointer is saved into the ring buffer.
> 
>> + ),
>> +
>> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
>> +  __entry->comm,
>> +  __entry->pid,
>> +  __entry->tgid,
>> +  __entry->ngid,
>> +  nodemask_pr_args(__entry->mem_allowed_ptr))
> 
> The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
> file. Which could be literally months later.
> 
> The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
> what was saved in the ring buffer, which the content it points to could
> have been freed days ago.
> 
> If that happens, then BOOM! Kernel goes bye-bye!
> 
> The trace event verifier is made to find bugs like his. And with the recent
> update to handle "%*p" it found this bug. ;-)
> 
> -- Steve
> 
> 
>> +);
>> #endif /* CONFIG_NUMA_BALANCING */

Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Steven Rostedt 9 months, 3 weeks ago
On Wed, 23 Apr 2025 16:05:44 +0000
Libo Chen <libo.chen@oracle.com> wrote:

> Also wondering if we can fail the build in this scenario so it will be easier to
> catch this bug at the build time.

 return -EPONYS ;-)

I wish. It's hard enough to catch this at runtime.

-- Steve
Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Libo Chen 9 months, 3 weeks ago

> On Apr 23, 2025, at 9:12 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed, 23 Apr 2025 16:05:44 +0000
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>> Also wondering if we can fail the build in this scenario so it will be easier to
>> catch this bug at the build time.
> 
> return -EPONYS ;-)
> 
> I wish. It's hard enough to catch this at runtime.
> 

Correct me if I'm wrong but can you disallow any passed-in pointers to be
dereferenced when TP_printk() is executed? This is something you can check
at the build time, right?

Libo

> -- Steve

Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Steven Rostedt 9 months, 3 weeks ago
On Wed, 23 Apr 2025 16:50:15 +0000
Libo Chen <libo.chen@oracle.com> wrote:

> Correct me if I'm wrong but can you disallow any passed-in pointers to be
> dereferenced when TP_printk() is executed? This is something you can check
> at the build time, right?

You can dereference if the pointer is to the content on the ring buffer.
For instance, you can have:

  "%p*h", &__entry->val

It dereferences to the content stored on the ring buffer.

What we can't have is:

  "%p*h", __entry->val

-- Steve
Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Libo Chen 9 months, 3 weeks ago

> On Apr 23, 2025, at 9:56 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed, 23 Apr 2025 16:50:15 +0000
> Libo Chen <libo.chen@oracle.com> wrote:
> 
>> Correct me if I'm wrong but can you disallow any passed-in pointers to be
>> dereferenced when TP_printk() is executed? This is something you can check
>> at the build time, right?
> 
> You can dereference if the pointer is to the content on the ring buffer.
> For instance, you can have:
> 
>  "%p*h", &__entry->val
> 
> It dereferences to the content stored on the ring buffer.
> 
> What we can't have is:
> 
>  "%p*h", __entry->val

Right, I was thinking something stricter such as disallowing point-type
field in TP_STRUCT__entry {} to avoid direct assignment to point-type
field so there will be no chance to have unsafe dereference but then I
realize C doesn’t have built-in mechanism to detect various types of
pointers at the compile time, maybe rust can do that. Anyway I give up.

Thanks,
Libo

> 
> -- Steve

Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Posted by Steven Rostedt 9 months, 3 weeks ago
On Wed, 23 Apr 2025 12:56:59 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> For instance, you can have:
> 
>   "%p*h", &__entry->val

That should have been:

   "%p*h", __entry->size, &__entry->val


-- Steve