[v1] tracing: Fixes for v6.17

[GIT PULL] tracing: Fixes for v6.17

Posted by Steven Rostedt 3 weeks, 2 days ago

Guenter Roeck <linux@roeck-us.net>, Luo Gengkun <luogengkun@huaweicloud.com>, Pu Lehui <pulehui@huawei.com>, Qianfeng Rong <rongqianfeng@vivo.com>, Vladimir Riabchun <ferr.lambarginio@gmail.com>, Wang Liang <wangliang74@huawei.com>

Linus,

Tracing fixes for v6.17:

- Remove redundant __GFP_NOWARN flag is kmalloc

  As now __GFP_NOWARN is part of __GFP_NOWAIT, it can be removed from kmalloc
  as it is redundant.

- Use copy_from_user_nofault() instead of _inatomic() for trace markers

  The trace_marker files are written to to allow user space to quickly write
  into the tracing ring buffer. Back in 2016, the get_user_pages_fast() and
  the kmap() logic was replaced by a __copy_from_user_inatomic(). But the
  _inatomic() is somewhat a misnomer, as if the data being read faults, it can
  cause a schedule. This is not something you want to do in an atomic context.
  Since the time this was added, copy_from_user_nofault() was added which is
  what is actually needed here. Replace the inatomic() with the nofault().

- Fix the assembly markup in the ftrace direct sample code

  The ftrace direct sample code (which is also used for selftests), had the
  size directive between the "leave" and the "ret" instead of after the ret.
  This caused objtool to think the code was unreachable.

- Only call unregister_pm_notifier() on outer most fgraph registration

  There was an error path in register_ftrace_graph() that did not call
  unregister_pm_notifier() on error, so it was added in the error path.
  The problem with that fix, is that register_pm_notifier() is only called by
  the initial user of fgraph. If that succeeds, but another fgraph
  registration were to fail, then unregister_pm_notifier() would be called
  incorrectly.

- Fix a crash in osnoise when zero size cpumask is passed in

  If a zero size CPU mask is passed in, the kmalloc() would return
  ZERO_SIZE_PTR which is not checked, and the code would continue thinking it
  had real memory and crash. If zero is passed in as the size of the write,
  simply return 0.

- Fix possible warning in trace_pid_write()

  If while processing a series of numbers passed to the "set_event_pid" file,
  and one of the updates fails to allocate (triggered by a fault injection),
  it can cause a warning to trigger. Check the return value of the call to
  trace_pid_list_set() and break out early with an error code if it fails.


Please pull the latest trace-v6.17-rc4 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v6.17-rc4

Tag SHA1: ff84e946ad59a3b7f3254d667aa7f47bf21246a0
Head SHA1: cd4453c5e983cf1fd5757e9acb915adb1e4602b6


Guenter Roeck (1):
      trace/fgraph: Fix error handling

Luo Gengkun (1):
      tracing: Fix tracing_marker may trigger page fault during preempt_disable

Pu Lehui (1):
      tracing: Silence warning when chunk allocation fails in trace_pid_write

Qianfeng Rong (1):
      trace: Remove redundant __GFP_NOWARN

Vladimir Riabchun (1):
      ftrace/samples: Fix function size computation

Wang Liang (1):
      tracing/osnoise: Fix null-ptr-deref in bitmap_parselist()

----
 kernel/trace/fgraph.c                 |  3 ++-
 kernel/trace/trace.c                  | 10 +++++++---
 kernel/trace/trace_events_user.c      |  2 +-
 kernel/trace/trace_osnoise.c          |  3 +++
 samples/ftrace/ftrace-direct-modify.c |  2 +-
 5 files changed, 14 insertions(+), 6 deletions(-)
---------------------------
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 2a42c1036ea8..1e3b32b1e82c 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1397,7 +1397,8 @@ int register_ftrace_graph(struct fgraph_ops *gops)
 		ftrace_graph_active--;
 		gops->saved_func = NULL;
 		fgraph_lru_release_index(i);
-		unregister_pm_notifier(&ftrace_suspend_notifier);
+		if (!ftrace_graph_active)
+			unregister_pm_notifier(&ftrace_suspend_notifier);
 	}
 	return ret;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1b7db732c0b1..b3c94fbaf002 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -834,7 +834,10 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 		/* copy the current bits to the new max */
 		ret = trace_pid_list_first(filtered_pids, &pid);
 		while (!ret) {
-			trace_pid_list_set(pid_list, pid);
+			ret = trace_pid_list_set(pid_list, pid);
+			if (ret < 0)
+				goto out;
+
 			ret = trace_pid_list_next(filtered_pids, pid + 1, &pid);
 			nr_pids++;
 		}
@@ -871,6 +874,7 @@ int trace_pid_write(struct trace_pid_list *filtered_pids,
 		trace_parser_clear(&parser);
 		ret = 0;
 	}
+ out:
 	trace_parser_put(&parser);
 
 	if (ret < 0) {
@@ -7209,7 +7213,7 @@ static ssize_t write_marker_to_buffer(struct trace_array *tr, const char __user
 	entry = ring_buffer_event_data(event);
 	entry->ip = ip;
 
-	len = __copy_from_user_inatomic(&entry->buf, ubuf, cnt);
+	len = copy_from_user_nofault(&entry->buf, ubuf, cnt);
 	if (len) {
 		memcpy(&entry->buf, FAULTED_STR, FAULTED_SIZE);
 		cnt = FAULTED_SIZE;
@@ -7306,7 +7310,7 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
 
 	entry = ring_buffer_event_data(event);
 
-	len = __copy_from_user_inatomic(&entry->id, ubuf, cnt);
+	len = copy_from_user_nofault(&entry->id, ubuf, cnt);
 	if (len) {
 		entry->id = -1;
 		memcpy(&entry->buf, FAULTED_STR, FAULTED_SIZE);
diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index af42aaa3d172..2ab283fd3032 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -496,7 +496,7 @@ static bool user_event_enabler_queue_fault(struct user_event_mm *mm,
 {
 	struct user_event_enabler_fault *fault;
 
-	fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN);
+	fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT);
 
 	if (!fault)
 		return false;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index fd259da0aa64..337bc0eb5d71 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2322,6 +2322,9 @@ osnoise_cpus_write(struct file *filp, const char __user *ubuf, size_t count,
 	int running, err;
 	char *buf __free(kfree) = NULL;
 
+	if (count < 1)
+		return 0;
+
 	buf = kmalloc(count, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
diff --git a/samples/ftrace/ftrace-direct-modify.c b/samples/ftrace/ftrace-direct-modify.c
index cfea7a38befb..da3a9f2091f5 100644
--- a/samples/ftrace/ftrace-direct-modify.c
+++ b/samples/ftrace/ftrace-direct-modify.c
@@ -75,8 +75,8 @@ asm (
 	CALL_DEPTH_ACCOUNT
 "	call my_direct_func1\n"
 "	leave\n"
-"	.size		my_tramp1, .-my_tramp1\n"
 	ASM_RET
+"	.size		my_tramp1, .-my_tramp1\n"
 
 "	.type		my_tramp2, @function\n"
 "	.globl		my_tramp2\n"

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by pr-tracker-bot@kernel.org 3 weeks, 1 day ago

The pull request you sent on Tue, 9 Sep 2025 16:21:55 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace-v6.17-rc4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/1b5d4661c7ee7937d062a00bd336761a237870b4

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Linus Torvalds 3 weeks, 1 day ago

On Tue, 9 Sept 2025 at 13:21, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>         Back in 2016, the get_user_pages_fast() and
>   the kmap() logic was replaced by a __copy_from_user_inatomic(). But the
>   _inatomic() is somewhat a misnomer, as if the data being read faults, it can
>   cause a schedule. This is not something you want to do in an atomic context.

Somebody is very very confused, and this "explanation" is just wrong
and entirely misleading.

__copy_from_user_inatomic() is very much atomic. But it is - as the
dual underscores indicate - a *HELPER* function that needs the caller
to do the appropriate coding around it.

In this case, the appropriate coding is to typically surround it with
a pagefault_{disable,enable}() pattern to let the page faulting code
know to not actually do the fault.

You also need to actually verify that the user address is valid - as
is typical with all the double-undercore user access functions.

>   Since the time this was added, copy_from_user_nofault() was added which is
>   what is actually needed here. Replace the inatomic() with the nofault().

I'm not disagreeing with the change, because that "nofault()" helper
(without the double underscores) does do all the "appropriate coding
around it".

And then it actually *uses* __copy_from_user_inatomic() to do the copy
- because that function really is meant for atomic contents.

So that explanation really is very very wrong and entirely confused.

Because it was never about __copy_from_user_inatomic() not being
appropriate from atomic context. It was about the tracing code
apparently just using it wrong.

               Linus

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Steven Rostedt 3 weeks, 1 day ago

On Wed, 10 Sep 2025 12:19:44 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 9 Sept 2025 at 13:21, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> >         Back in 2016, the get_user_pages_fast() and
> >   the kmap() logic was replaced by a __copy_from_user_inatomic(). But the
> >   _inatomic() is somewhat a misnomer, as if the data being read faults, it can
> >   cause a schedule. This is not something you want to do in an atomic context.  
> 
> Somebody is very very confused, and this "explanation" is just wrong
> and entirely misleading.
> 
> __copy_from_user_inatomic() is very much atomic. But it is - as the
> dual underscores indicate - a *HELPER* function that needs the caller
> to do the appropriate coding around it.
> 
> In this case, the appropriate coding is to typically surround it with
> a pagefault_{disable,enable}() pattern to let the page faulting code
> know to not actually do the fault.

It would have been nice if there was some comments by that function to let
one know that. This solution was suggested to me back then (2016) by a very
competent kernel developer but I was never told I needed to disable page
faults when using it.

> 
> You also need to actually verify that the user address is valid - as
> is typical with all the double-undercore user access functions.

Well, that is some tribal knowledge I didn't fully understand at the time.
Which is why I also suggested that we add comments to the code to state
that this call expects to have page faults disabled. That way others will
know why the function has double underscores.

The __copy_to_user_inatomic() has some mention of it, but it's not obvious
that those comments pertain to the _from_ variant.

> 
> >   Since the time this was added, copy_from_user_nofault() was added which is
> >   what is actually needed here. Replace the inatomic() with the nofault().  
> 
> I'm not disagreeing with the change, because that "nofault()" helper
> (without the double underscores) does do all the "appropriate coding
> around it".
> 
> And then it actually *uses* __copy_from_user_inatomic() to do the copy
> - because that function really is meant for atomic contents.
> 
> So that explanation really is very very wrong and entirely confused.
> 
> Because it was never about __copy_from_user_inatomic() not being
> appropriate from atomic context. It was about the tracing code
> apparently just using it wrong.

Agreed. But it was incorrectly used because there's no documentation on how
to use it properly.

Let me go and remedy that.

-- Steve

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Linus Torvalds 3 weeks, 1 day ago

On Wed, 10 Sept 2025 at 12:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In this case, the appropriate coding is to typically surround it with
> a pagefault_{disable,enable}() pattern to let the page faulting code
> know to not actually do the fault.

Btw, I say "typically", because you don't have to do that. The page
fault code uses

        if (faulthandler_disabled() ..)

to decide if it should handle the fault or not, and that checks not
just if page faults are explicitly disabled, but also checks -
surprise surprise - "in_atomic()".

So just being in an explicitly atomic context automatically means that
__copy_from_user_inatomic() is atomic.

Which makes me wonder if there is something entirely wrong.

Because the explanation for this in commit 3d62ab32df06 ("tracing: Fix
tracing_marker may trigger page fault during preempt_disable") talks
about the task being preempted in between the

  ring_buffer_lock_reserve
  ring_buffer_unlock_commit

and it sounds like maybe the tracing code isn't disabling preemption
for the whole sequence?

Because "in_atomic()" does check the preempt count, and so just being
non-preemptible should already have disabled page faults.

Maybe the page fault just ends up being expensive enough that it
exposes preemption being more *likely* just because the window now is
much wider.

Alternatively, this is perhaps an arm64-specific bug where the page
fault disabling doesn't honor the preemption disable of
faulthandler_disabled()?

I did *not* go through the whole arm64 page faulting code: that commit
talks about do_mem_abort() which is done as part of the common arm64
fault handling, and if that then doesn't honor
faulthandler_disabled(), then honestly, that perf fix isn't actually
fixing anything either. It would still do the same reschedule even
with an explicit "pagefault_disable()/enable()" if
faulthandler_disabled() simply isn't honored properly.

Adding Catalin and Will to the participants to see if they have input
on that whole do_mem_abort() angle.

             Linus

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Steven Rostedt 3 weeks, 1 day ago

On Wed, 10 Sep 2025 12:37:50 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 10 Sept 2025 at 12:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > In this case, the appropriate coding is to typically surround it with
> > a pagefault_{disable,enable}() pattern to let the page faulting code
> > know to not actually do the fault.  
> 
> Btw, I say "typically", because you don't have to do that. The page
> fault code uses
> 
>         if (faulthandler_disabled() ..)
> 
> to decide if it should handle the fault or not, and that checks not
> just if page faults are explicitly disabled, but also checks -
> surprise surprise - "in_atomic()".
> 
> So just being in an explicitly atomic context automatically means that
> __copy_from_user_inatomic() is atomic.
> 
> Which makes me wonder if there is something entirely wrong.
> 
> Because the explanation for this in commit 3d62ab32df06 ("tracing: Fix
> tracing_marker may trigger page fault during preempt_disable") talks
> about the task being preempted in between the
> 
>   ring_buffer_lock_reserve
>   ring_buffer_unlock_commit
> 
> and it sounds like maybe the tracing code isn't disabling preemption
> for the whole sequence?
> 
> Because "in_atomic()" does check the preempt count, and so just being
> non-preemptible should already have disabled page faults.
> 
> Maybe the page fault just ends up being expensive enough that it
> exposes preemption being more *likely* just because the window now is
> much wider.
> 
> Alternatively, this is perhaps an arm64-specific bug where the page
> fault disabling doesn't honor the preemption disable of
> faulthandler_disabled()?
> 
> I did *not* go through the whole arm64 page faulting code: that commit
> talks about do_mem_abort() which is done as part of the common arm64
> fault handling, and if that then doesn't honor
> faulthandler_disabled(), then honestly, that perf fix isn't actually
> fixing anything either. It would still do the same reschedule even
> with an explicit "pagefault_disable()/enable()" if
> faulthandler_disabled() simply isn't honored properly.
> 
> Adding Catalin and Will to the participants to see if they have input
> on that whole do_mem_abort() angle.
> 

We figured out in the discussion that the user that triggered this had
CONFIG_PREEMPT_NONE, where in_atomic() always returns false.

-- Steve

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Linus Torvalds 3 weeks, 1 day ago

On Wed, 10 Sept 2025 at 13:20, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> We figured out in the discussion that the user that triggered this had
> CONFIG_PREEMPT_NONE, where in_atomic() always returns false.

Ahh, ok. That explains it. Then that in_atomic() ends up being a no-op. Ack.

So the patch is fine and should fix things. Good.

            Linus

Re: [GIT PULL] tracing: Fixes for v6.17

Posted by Mathieu Desnoyers 3 weeks, 1 day ago

On 2025-09-10 15:37, Linus Torvalds wrote:
> On Wed, 10 Sept 2025 at 12:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> In this case, the appropriate coding is to typically surround it with
>> a pagefault_{disable,enable}() pattern to let the page faulting code
>> know to not actually do the fault.
> 
> Btw, I say "typically", because you don't have to do that. The page
> fault code uses
> 
>          if (faulthandler_disabled() ..)
> 
> to decide if it should handle the fault or not, and that checks not
> just if page faults are explicitly disabled, but also checks -
> surprise surprise - "in_atomic()".
> 
> So just being in an explicitly atomic context automatically means that
> __copy_from_user_inatomic() is atomic.
> 
> Which makes me wonder if there is something entirely wrong.
> 
> Because the explanation for this in commit 3d62ab32df06 ("tracing: Fix
> tracing_marker may trigger page fault during preempt_disable") talks
> about the task being preempted in between the
> 
>    ring_buffer_lock_reserve
>    ring_buffer_unlock_commit
> 
> and it sounds like maybe the tracing code isn't disabling preemption
> for the whole sequence?
> 
> Because "in_atomic()" does check the preempt count, and so just being
> non-preemptible should already have disabled page faults.
> 
> Maybe the page fault just ends up being expensive enough that it
> exposes preemption being more *likely* just because the window now is
> much wider.
> 
> Alternatively, this is perhaps an arm64-specific bug where the page
> fault disabling doesn't honor the preemption disable of
> faulthandler_disabled()?
> 
> I did *not* go through the whole arm64 page faulting code: that commit
> talks about do_mem_abort() which is done as part of the common arm64
> fault handling, and if that then doesn't honor
> faulthandler_disabled(), then honestly, that perf fix isn't actually
> fixing anything either. It would still do the same reschedule even
> with an explicit "pagefault_disable()/enable()" if
> faulthandler_disabled() simply isn't honored properly.
> 
> Adding Catalin and Will to the participants to see if they have input
> on that whole do_mem_abort() angle.

See include/linux/uaccess.h:

/*
  * The pagefault handler is in general disabled by pagefault_disable() or
  * when in irq context (via in_atomic()).
  *
  * This function should only be used by the fault handlers. Other users should
  * stick to pagefault_disabled().
  * Please NEVER use preempt_disable() to disable the fault handler. With
  * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled.
  * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT.
  */
#define faulthandler_disabled() (pagefault_disabled() || in_atomic())

Especially the part where in_atomic() is not so useful with
!CONFIG_PREEMPT_COUNT. AFAIU, this is why we rely on explicit
pagefault disable/enable in tracing code.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com