[v1] ftrace: Fixes for v6.13

[GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago



Linus,

Ftrace fixes for 6.13:

- Fix output of trace when hashing a pointer is disabled

  Pointers are recorded into the ring buffer by simply copying them.
  When the ring buffer is read directly in binary format, the
  pointers are unhashed, and tools like perf will show them as such.
  But when the pointers are printed from the trace file, they are
  hashed because the output from the trace uses vsnprintf() which will
  hash "%p". Since the tracing infrastructure requires just as much
  privileged as kallsyms this was an unneeded protection. An option
  was created to allow the pointers to not be hashed, as in some cases
  in debugging, the unhashed pointer was required.

  To do this, the unhashing would add a "x" after a "%p" to make it
  "%px" and not have the call to vsnprintf() hash. It used the iter->fmt
  temp buffer to accomplish this. The problem was that the iter->fmt temp
  buffer was already being used as a temp buffer to check if pointers in the
  event format output were being called indirectly (like using a "%pI4" or
  "%pI6") where the pointer may be pointing to a freed location. The check
  code will catch that.

  The issue was that when the hash pointer was being disabled, the
  logic that used the temporary iter->fmt buffer passed that buffer
  to the function that would test bad pointers and that function
  would use iter->fmt buffer as well, causing the output to be
  screwed up.

  The solution was to change the bad pointer logic to use the iter->fmt
  buffer directly and not as a temp buffer. If the fmt parameter passed
  in was not the iter->fmt buffer, it would copy the content to the
  iter->fmt buffer and process that buffer directly. This now allows
  passing the iter->fmt buffer as the fmt parameter to the bad pointer
  check function. The nohash function that added the "%px" would not be an
  issue when passing the iter->fmt to the check function as the format to
  check.

- Always try to initialize the idle functions when graph tracer starts

  A bug was found that when a CPU is offline when graph tracing starts
  and then comes online, that CPU is not traced. The fix to that was
  to move the initialization of the idle shadow stack over to the
  hot plug online logic, which also handle onlined CPUs. The issue was
  that it removed the initialization of the shadow stack when graph tracing
  starts, but the callbacks to the hot plug logic do nothing if graph
  tracing isn't currently running. Although that fix fixed the onlining
  of a CPU during tracing, it broke the CPUs that were already online.

- Have microblaze not try to get the "true parent" in function tracing

  If function tracing and graph tracing are both enabled at the same time
  the parent of the functions traced by the function tracer may sometimes
  be the graph tracing trampoline. The graph tracing hijacks the return
  pointer of the function to trace it, but that can interfere with the
  function tracing parent output. This was fixed by using the
  ftrace_graph_ret_addr() function passing in the kernel stack pointer
  using the ftrace_regs_get_stack_pointer() function. But Al Viro reported
  that Microblaze does not implement the kernel_stack_pointer(regs)
  helper function that ftrace_regs_get_stack_pointer() uses and fails
  to compile when function graph tracing is enabled.

  The real fix would be to have microblaze implement the kernel_stack_pointer()
  function. For now, just make microblaze use the old logic which prints the
  function graph trampoline in the function tracer.


Please pull the latest ftrace-v6.13-rc2 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
ftrace-v6.13-rc2

Tag SHA1: bdc021d764d3b571031c195506eb0eed208b2087
Head SHA1: 607f2f4884c1a66c297cd0fb35d5523538d18b90


Steven Rostedt (3):
      tracing: Fix trace output when pointer hash is disabled
      fgraph: Still initialize idle shadow stacks when starting
      ftrace/microblaze: Do not find "true_parent" for return address

----
 kernel/trace/fgraph.c          |  8 +++-
 kernel/trace/trace.c           | 90 ++++++++++++++++++++++++++----------------
 kernel/trace/trace_functions.c |  3 +-
 3 files changed, 64 insertions(+), 37 deletions(-)
---------------------------
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0bf78517b5d4..ddedcb50917f 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1215,7 +1215,7 @@ void fgraph_update_pid_func(void)
 static int start_graph_tracing(void)
 {
 	unsigned long **ret_stack_list;
-	int ret;
+	int ret, cpu;
 
 	ret_stack_list = kcalloc(FTRACE_RETSTACK_ALLOC_SIZE,
 				 sizeof(*ret_stack_list), GFP_KERNEL);
@@ -1223,6 +1223,12 @@ static int start_graph_tracing(void)
 	if (!ret_stack_list)
 		return -ENOMEM;
 
+	/* The cpu_boot init_task->ret_stack will never be freed */
+	for_each_online_cpu(cpu) {
+		if (!idle_task(cpu)->ret_stack)
+			ftrace_graph_init_idle_task(idle_task(cpu), cpu);
+	}
+
 	do {
 		ret = alloc_retstack_tasklist(ret_stack_list);
 	} while (ret == -EAGAIN);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..b44b1cdaa20e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3711,8 +3711,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 {
 	long text_delta = 0;
 	long data_delta = 0;
-	const char *p = fmt;
 	const char *str;
+	char save_ch;
+	char *buf = NULL;
+	char *p;
 	bool good;
 	int i, j;
 
@@ -3720,7 +3722,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 		return;
 
 	if (static_branch_unlikely(&trace_no_verify))
-		goto print;
+		goto print_fmt;
 
 	/*
 	 * When the kernel is booted with the tp_printk command line
@@ -3735,8 +3737,21 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 
 	/* Don't bother checking when doing a ftrace_dump() */
 	if (iter->fmt == static_fmt_buf)
-		goto print;
+		goto print_fmt;
 
+	if (fmt != iter->fmt) {
+		int len = strlen(fmt);
+		while (iter->fmt_size < len + 1) {
+			/*
+			 * If we can't expand the copy buffer,
+			 * just print it.
+			 */
+			if (!trace_iter_expand_format(iter))
+				goto print_fmt;
+		}
+		strscpy(iter->fmt, fmt, iter->fmt_size);
+	}
+	p = iter->fmt;
 	while (*p) {
 		bool star = false;
 		int len = 0;
@@ -3748,14 +3763,6 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 		 * as well as %p[sS] if delta is non-zero
 		 */
 		for (i = 0; p[i]; i++) {
-			if (i + 1 >= iter->fmt_size) {
-				/*
-				 * If we can't expand the copy buffer,
-				 * just print it.
-				 */
-				if (!trace_iter_expand_format(iter))
-					goto print;
-			}
 
 			if (p[i] == '\\' && p[i+1]) {
 				i++;
@@ -3788,10 +3795,11 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 		if (!p[i])
 			break;
 
-		/* Copy up to the %s, and print that */
-		strncpy(iter->fmt, p, i);
-		iter->fmt[i] = '\0';
-		trace_seq_vprintf(&iter->seq, iter->fmt, ap);
+		/* Print up to the %s */
+		save_ch = p[i];
+		p[i] = '\0';
+		trace_seq_vprintf(&iter->seq, p, ap);
+		p[i] = save_ch;
 
 		/* Add delta to %pS pointers */
 		if (p[i+1] == 'p') {
@@ -3837,6 +3845,8 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 			good = trace_safe_str(iter, str, star, len);
 		}
 
+		p += i;
+
 		/*
 		 * If you hit this warning, it is likely that the
 		 * trace event in question used %s on a string that
@@ -3849,41 +3859,51 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 		if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
 			      fmt, seq_buf_str(&iter->seq.seq))) {
 			int ret;
+#define TEMP_BUFSIZ 1024
+
+			if (!buf) {
+				char *buf = kmalloc(TEMP_BUFSIZ, GFP_KERNEL);
+				if (!buf) {
+					/* Need buffer to read address */
+					trace_seq_printf(&iter->seq, "(0x%px)[UNSAFE-MEMORY]", str);
+					p += j + 1;
+					goto print;
+				}
+			}
+			if (len >= TEMP_BUFSIZ)
+				len = TEMP_BUFSIZ - 1;
 
 			/* Try to safely read the string */
 			if (star) {
-				if (len + 1 > iter->fmt_size)
-					len = iter->fmt_size - 1;
-				if (len < 0)
-					len = 0;
-				ret = copy_from_kernel_nofault(iter->fmt, str, len);
-				iter->fmt[len] = 0;
-				star = false;
+				ret = copy_from_kernel_nofault(buf, str, len);
+				buf[len] = 0;
 			} else {
-				ret = strncpy_from_kernel_nofault(iter->fmt, str,
-								  iter->fmt_size);
+				ret = strncpy_from_kernel_nofault(buf, str, TEMP_BUFSIZ);
 			}
 			if (ret < 0)
 				trace_seq_printf(&iter->seq, "(0x%px)", str);
 			else
-				trace_seq_printf(&iter->seq, "(0x%px:%s)",
-						 str, iter->fmt);
-			str = "[UNSAFE-MEMORY]";
-			strcpy(iter->fmt, "%s");
+				trace_seq_printf(&iter->seq, "(0x%px:%s)", str, buf);
+			trace_seq_puts(&iter->seq, "[UNSAFE-MEMORY]");
 		} else {
-			strncpy(iter->fmt, p + i, j + 1);
-			iter->fmt[j+1] = '\0';
+			save_ch = p[j + 1];
+			p[j + 1] = '\0';
+			if (star)
+				trace_seq_printf(&iter->seq, p, len, str);
+			else
+				trace_seq_printf(&iter->seq, p, str);
+			p[j + 1] = save_ch;
 		}
-		if (star)
-			trace_seq_printf(&iter->seq, iter->fmt, len, str);
-		else
-			trace_seq_printf(&iter->seq, iter->fmt, str);
 
-		p += i + j + 1;
+		p += j + 1;
 	}
  print:
 	if (*p)
 		trace_seq_vprintf(&iter->seq, p, ap);
+	kfree(buf);
+	return;
+ print_fmt:
+	trace_seq_vprintf(&iter->seq, fmt, ap);
 }
 
 const char *trace_event_format(struct trace_iterator *iter, const char *fmt)
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 74c353164ca1..a75d107a45f8 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -176,7 +176,8 @@ static void function_trace_start(struct trace_array *tr)
 	tracing_reset_online_cpus(&tr->array_buffer);
 }
 
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+/* Microblaze currently doesn't implement kernel_stack_pointer() */
+#if defined(CONFIG_FUNCTION_GRAPH_TRACER) && !defined(CONFIG_MICROBLAZE)
 static __always_inline unsigned long
 function_get_true_parent_ip(unsigned long parent_ip, struct ftrace_regs *fregs)
 {

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Al Viro 1 year, 1 month ago

On Sat, Dec 14, 2024 at 06:21:38PM -0500, Steven Rostedt wrote:

>   The real fix would be to have microblaze implement the kernel_stack_pointer()
>   function. For now, just make microblaze use the old logic which prints the
>   function graph trampoline in the function tracer.

Correction: kernel_stack_pointer() implementation there would make it _compile_,
but it wouldn't work right.

What it needs is ftrace_regs_get_stack_pointer() *and* HAVE_ARCH_FTRACE_REGS
defined to suppress the use of defaults.

Background, one more time: kernel_stack_pointer() would give you the kernel
stack pointer by struct pt_regs (if that is possible in the first place).
ftrace_regs_get_stack_pointer() gives the kernel stack pointer by whatever
object _mcount() has set up.  *IF* that object is identical to struct pt_regs
and kernel_stack_pointer() exists, you could cast the pointer to a struct pt_regs *
value and feed that to kernel_stack_pointer().  For microblaze _mcount-created
object is *NOT* the same thing as struct pt_regs, so the default would've done
you no good even if kernel_stack_pointer() existed on microblaze.

See arch/microblaze/kernel/mcount.S SAVE_REGS and compare with their
definition of struct pt_regs in arch/microblaze/include/uapi/asm/ptrace.h

IOW, immediate fix is right, but the long-term fix suggested in pull request
is not.

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 04:48:43 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> IOW, immediate fix is right, but the long-term fix suggested in pull request
> is not.

Actually, this requires HAVE_DYNAMIC_FTRACE_WITH_ARGS, because without
that, the fregs passed in could be NULL. fregs is guaranteed to be non-NULL
for all archs that have that set. As microblaze doesn't have that set, that
would be the fix.

I'll have to resend this pull request anyway depending on the outcome of
the first commit.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 00:05:22 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'll have to resend this pull request anyway depending on the outcome of
> the first commit.

Which will likely have to wait till Monday. I'm going to bed now and
tomorrow's my wife's birthday. We are going to do some wine touring.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 15:21, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> - Fix output of trace when hashing a pointer is disabled

Christ.

I'm looking at this, and my reaction is literally "tracing is doing
crazy things AGAIN".

This stuff is full of crazy special cases for things that should never
be done in the first place.

And - surprise surprise - it was buggy in odd ways as a result.

If I read the code right, this hacky code literally only catches stale
strings, but also has some *insane* code to do "text_delta" to
'%p[sS]' printouts.

And I see how it does that

                text_delta = iter->tr->text_delta;

but I don't actually see where 'text_delta' would ever actually get
set.  Where does tr->text_delta ever get set to anything but zero?

That code should be *removed*. It seems to be there entirely as a
sanity check, and when the sanity check itself is buggy, you don't try
to fix it, you remove the whole crazy garbage.

What makes '%s' so special in trace formats that it merits this
horrible hackery?

What makes 'text_delta' at all sane, never mind where it even gets set?

And why should we maintain that horrible hackery and make it worse?

This code shows a complete lack of taste. This needs extreme
crapectomy, not some alleged "fix".

If "people use stale pointers for '%s' and we actually care" is a real
issue, we could very possibly teach vsnprintf() about that. The code
already has a "check_pointer()" thing for just NULL pointer magic
printouts.

                  Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 16:37:39 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 14 Dec 2024 at 15:21, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > - Fix output of trace when hashing a pointer is disabled  
> 
> Christ.
> 
> I'm looking at this, and my reaction is literally "tracing is doing
> crazy things AGAIN".
> 
> This stuff is full of crazy special cases for things that should never
> be done in the first place.
> 
> And - surprise surprise - it was buggy in odd ways as a result.
> 
> If I read the code right, this hacky code literally only catches stale
> strings, but also has some *insane* code to do "text_delta" to
> '%p[sS]' printouts.
> 
> And I see how it does that
> 
>                 text_delta = iter->tr->text_delta;
> 
> but I don't actually see where 'text_delta' would ever actually get
> set.  Where does tr->text_delta ever get set to anything but zero?

It gets set by the persistent ring buffer code, via a call to
ring_buffer_last_boot_delta(). It's the delta between pointers from the
previous kernel boot up and the current kernel boot up. Without it, you
don't get function names in your function traces from the last boot. That
is, instead of having this:

              <idle>-0       [003] d.h2.   156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
              <idle>-0       [003] d.h2.   156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
              <idle>-0       [003] d.h2.   156.462396: __rcu_read_unlock <-__sysvec_reboot
              <idle>-0       [003] d.h2.   156.462397: stop_this_cpu <-__sysvec_reboot
              <idle>-0       [003] d.h2.   156.462397: set_cpu_online <-stop_this_cpu
              <idle>-0       [003] d.h2.   156.462397: disable_local_APIC <-stop_this_cpu
              <idle>-0       [003] d.h2.   156.462398: clear_local_APIC <-disable_local_APIC
              <idle>-0       [003] d.h2.   156.462574: mcheck_cpu_clear <-stop_this_cpu
              <idle>-0       [003] d.h2.   156.462575: mce_intel_feature_clear <-stop_this_cpu
              <idle>-0       [003] d.h2.   156.462575: lmce_supported <-mce_intel_feature_clear

You get this:

              <idle>-0       [003] d.h2.   156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
              <idle>-0       [003] d.h2.   156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
              <idle>-0       [003] d.h2.   156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
              <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
              <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
              <idle>-0       [003] d.h2.   156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
              <idle>-0       [003] d.h2.   156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
              <idle>-0       [003] d.h2.   156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
              <idle>-0       [003] d.h2.   156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
              <idle>-0       [003] d.h2.   156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f

Due to KASLR.

> 
> That code should be *removed*. It seems to be there entirely as a
> sanity check, and when the sanity check itself is buggy, you don't try
> to fix it, you remove the whole crazy garbage.
> 
> What makes '%s' so special in trace formats that it merits this
> horrible hackery?
> 
> What makes 'text_delta' at all sane, never mind where it even gets set?
> 
> And why should we maintain that horrible hackery and make it worse?
> 
> This code shows a complete lack of taste. This needs extreme
> crapectomy, not some alleged "fix".
> 
> If "people use stale pointers for '%s' and we actually care" is a real
> issue, we could very possibly teach vsnprintf() about that. The code
> already has a "check_pointer()" thing for just NULL pointer magic
> printouts.

The check code was added because I was sick and tired of fixing bugs in
trace events. People would use the TRACE_EVENT() TP_printk() like a normal
printk with things like:

  TP_fast_assign(
      __entry->ipv6 = ptr_to_ipv6;
  )

  TP_printk("ipv6=%pI6", __entry->ipv6)


This is very buggy. Because the ipv6 pointer is added to the ring buffer at
the time of the trace event was triggered, but that "ipv6=%pI6" is executed
when the user reads the trace. Which can be minutes, hours, days, even
months later! In other words, it will read a pointer that has long been
freed.

With the check, instead it doesn't crash the kernel, but issues a nasty
WARN_ON() (as it is a bug in the kernel) and then reads the value using
safe mechanisms and injects into the trace "[UNSAFE-MEMORY]" by the pointer.

I haven't seen this bug show up again since I added that. That's because
developers see the warning when they add the buggy trace event, and it is
fixed before it ever reaches you.

Should I move that protection from the tracing code into vsnprintf()?

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 19:03, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> It gets set by the persistent ring buffer code, via a call to
> ring_buffer_last_boot_delta().

That seems entirely broken, since it basically depends on the kernel
being the same one. Which isn't even checked for.

Not to mention that it wouldn't work for modules anyway.

This kind of random hackery needs to DIE.

Tell people to turn of KASLR for cross-boot tracing. Or just not do it.

> > If "people use stale pointers for '%s' and we actually care" is a real
> > issue, we could very possibly teach vsnprintf() about that. The code
> > already has a "check_pointer()" thing for just NULL pointer magic
> > printouts.
>
> The check code was added because I was sick and tired of fixing bugs in
> trace events. People would use the TRACE_EVENT() TP_printk() like a normal
> printk with things like:
>
>   TP_fast_assign(
>       __entry->ipv6 = ptr_to_ipv6;
>   )
>
>   TP_printk("ipv6=%pI6", __entry->ipv6)

I think you are confused.

You are talking about the checks in test_event_printk(). That is fine
- that's a sanity check of the trace fmt string at trace event
registration time.

But that's not at all the code that the new "fix" is all about.

The new "fix" is for the disgusting horror that is
trace_check_vprintf(), which is the "I'll take the trace format
string, and I'll print it out partly using the regular vprintf()
machinery, but I'll partly dig into it and do horrendous things to
it".

And *THAT* is the disgusting code, and it only deals with '%s' (and
the odd symbol relocation case that I think is completely broken).

And the '%s' case we could at least partly handle in lib/vsprintf.c -
not the odd tracing special cases (that you might as well just handle
in test_event_printk()), but the "we can check that it doesn't fault".

           Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 20:00:57 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 14 Dec 2024 at 19:03, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > It gets set by the persistent ring buffer code, via a call to
> > ring_buffer_last_boot_delta().  
> 
> That seems entirely broken, since it basically depends on the kernel
> being the same one. Which isn't even checked for.
> 
> Not to mention that it wouldn't work for modules anyway.
> 
> This kind of random hackery needs to DIE.
> 
> Tell people to turn of KASLR for cross-boot tracing. Or just not do it.

We are using this for ChromeOS in ChromeBooks where we have to have KASLR
on. And yes, I've documented that it has to be the same kernel to work and
will not work for modules. That has been accepted by the users.

> 
> > > If "people use stale pointers for '%s' and we actually care" is a real
> > > issue, we could very possibly teach vsnprintf() about that. The code
> > > already has a "check_pointer()" thing for just NULL pointer magic
> > > printouts.  
> >
> > The check code was added because I was sick and tired of fixing bugs in
> > trace events. People would use the TRACE_EVENT() TP_printk() like a normal
> > printk with things like:
> >
> >   TP_fast_assign(
> >       __entry->ipv6 = ptr_to_ipv6;
> >   )
> >
> >   TP_printk("ipv6=%pI6", __entry->ipv6)  
> 
> I think you are confused.
> 
> You are talking about the checks in test_event_printk(). That is fine
> - that's a sanity check of the trace fmt string at trace event
> registration time.

Ah, I used the wrong example. But it is still fixing the same bug. For %pI6
it is caught at registration time. But the trace_check_vprintk() was added
at the same time test_event_printk() to handle strings. That's because
RCU's trace events (and other trace events) pointed to static strings in
kernel data, and because the string pointer is in the ring buffer, it had
to be checked at run time and not registration time.

My mistake in using the "%pI6" as an example because, as you stated, that
is fixed at registration time. But we have the same bug with strings:

   TP_fast_assign(
       __entry->str = field->name;
   )

   TP_printk("str=%s", __entry->str)  

Which is also a bug and actually a more common bug than the "%pX" ones. But
because of RCU and other trace events that point to static strings, it has to be
tested at run time.

> 
> But that's not at all the code that the new "fix" is all about.
> 
> The new "fix" is for the disgusting horror that is
> trace_check_vprintf(), which is the "I'll take the trace format
> string, and I'll print it out partly using the regular vprintf()
> machinery, but I'll partly dig into it and do horrendous things to
> it".
> 
> And *THAT* is the disgusting code, and it only deals with '%s' (and
> the odd symbol relocation case that I think is completely broken).

The symbol relocation is a special case that just hooked into this code.

> 
> And the '%s' case we could at least partly handle in lib/vsprintf.c -
> not the odd tracing special cases (that you might as well just handle
> in test_event_printk()), but the "we can check that it doesn't fault".

Again, it was added to fix the same issue that test_event_printk(), but for
strings because there's several trace events that point to static strings,
and that will be flagged as being a bug if it was done in test_event_printk().

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 20:23, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> We are using this for ChromeOS in ChromeBooks where we have to have KASLR
> on. And yes, I've documented that it has to be the same kernel to work and
> will not work for modules. That has been accepted by the users.

"Accepted by users" is not an argument for sh*t code quality.

This smells of all the horrors you had in tracefs. Bad, unmaintainable
code, that caused endless "fix up bugs because the code was doing bad
things".

Now you're doing the same thing here. Bad, unmaintainable code that
basically misuses the vsprintf code instead of misusing the VFS code.

I can rip it out if you refuse to do so. In the kernel we don't put in
random hacks like this that "users are ok with". Code needs to make
sense and be maintainable, not be a pile of garbage hackery that then
results in even more hackery because the first hackery was broken and
caused problems.

               Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 20:37:01 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 14 Dec 2024 at 20:23, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > We are using this for ChromeOS in ChromeBooks where we have to have
> > KASLR on. And yes, I've documented that it has to be the same kernel to
> > work and will not work for modules. That has been accepted by the
> > users.  
> 
> "Accepted by users" is not an argument for sh*t code quality.

That's just for the mapped code. I can easily add logic to make it fail if
the kernel signature does not match, and the deltas will be zero and the
buffer is cleared, and it will not affect anything here. I was planning on
adding that anyway.

I can also make it ignore pointers to modules.

The use case is for always on tracing and when a crash happens, you can
retrieve the events that lead up to a crash.

The text_delta was only added to this because it was causing this logic to
trigger WARN_ON()s when it tried to read the origin address of the static
strings.

> 
> This smells of all the horrors you had in tracefs. Bad, unmaintainable
> code, that caused endless "fix up bugs because the code was doing bad
> things".
> 
> Now you're doing the same thing here. Bad, unmaintainable code that
> basically misuses the vsprintf code instead of misusing the VFS code.
> 
> I can rip it out if you refuse to do so. In the kernel we don't put in
> random hacks like this that "users are ok with". Code needs to make
> sense and be maintainable, not be a pile of garbage hackery that then
> results in even more hackery because the first hackery was broken and
> caused problems.

But main code of trace_check_vprintk() that you appear to be objecting to
is the hacking into vsnprintf() that was added back in 2021 to handle the
RCU and other events that point to their strings. I'm not sure how we can
check that before hand because we don't see the pointers until the event is
being printed.

Sure, we can harden the "%s" to just print "(fault)" or something, but then
we don't know about it until it happens, and then we will likely get trace
events that have string pointers to fields that may be freed in the future
popping up in the ring buffer again. Testing will not uncover this because
the strings will exist when the developer tests the events. But when a user
sends a report back to someone and the trace has a bunch of "(fault)"s in
it where strings should be, I'm the one that's going to be told that
tracing is broken.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 20:52, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But main code of trace_check_vprintk() that you appear to be objecting to
> is the hacking into vsnprintf() that was added back in 2021 to handle the
> RCU and other events that point to their strings. I'm not sure how we can
> check that before hand because we don't see the pointers until the event is
> being printed.

I'm objecting to ALL of it.

It's a debug feature that is *broken*. And not in some kind of subtle
way. In a very fundamental way.

                  Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 22:04:03 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> With the check, instead it doesn't crash the kernel, but issues a nasty
> WARN_ON() (as it is a bug in the kernel) and then reads the value using
> safe mechanisms and injects into the trace "[UNSAFE-MEMORY]" by the pointer.

I will also point out that the check triggers even if the pointer is still
around (hasn't been freed). That is, it makes sure that the contents of the
pointer is in the ring buffer event that is being printed, or if a %s, it
is also OK to point to static strings (as the RCU trace points all do that).

That is, even though:

  TP_fast_assign(
      __entry->ipv6 = ptr_to_ipv6;
  )

  TP_printk("ipv6=%pI6", __entry->ipv6)

is buggy, and the check will report it immediately even if the contents of
that ptr_to_ipv6 hasn't been freed yet. But it is perfectly fine with:

  TP_fast_assign(
      memcpy(__entry->ipv6, ptr_to_ipv6, 16);
  )

  TP_printk("ipv6=%pI6", &__entry->ipv6)

As the content of the ipv6 lives in the ring buffer itself and will not be
freed at the time the event is printed.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 19:11, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> That is, even though:
>
>   TP_fast_assign(
>       __entry->ipv6 = ptr_to_ipv6;
>   )
>
>   TP_printk("ipv6=%pI6", __entry->ipv6)

Again, you're not talking about the code that is being fixed.

I'm complaining about the mess that you try to fix in commit
ff54c8dc3f7a ("tracing: Fix trace output when pointer hash is
disabled"), which is that eldritch horror called
trace_check_vprintf().

You are describing something entirely different: the admittedly
similarly named test_event_printk(), called from the event init code.

I think trace_check_vprintf() should be removed entirely.

And I think test_event_printk() could probably be expanded to handle
at least some cases of '%s' (because, as you say, pointers to various
static allocations are fine for it).

And no, I don't know the code all that well, so maybe I'm overlooking
something, but trace_check_vprintf() really is a horror. And all your
arguments for it seem to be missing the mark and are actually
arguments for test_event_printk().

               Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 20:06, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I think test_event_printk() could probably be expanded to handle
> at least some cases of '%s' (because, as you say, pointers to various
> static allocations are fine for it).

.. and I think we can at least protect against kernel faults in the
generic '%s' handling. We already do some of that in
check_pointer_msg, and it could be expanded upon with something like

  --- a/lib/vsprintf.c
  +++ b/lib/vsprintf.c
  @@ -695,13 +695,22 @@ static char *error_string(char *buf, ...
    */
   static const char *check_pointer_msg(const void *ptr)
   {
  +       unsigned char val;
  +
          if (!ptr)
                  return "(null)";

          if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
                  return "(efault)";

  +       pagefault_disable();
  +       __get_kernel_nofault(&val, (const char *)ptr, u8, Efault);
  +       pagefault_enable();
          return NULL;
  +
  +Efault:
  +       pagefault_enable();
  +       return "(bad address)";
   }

which is admittedly

 (a) ENTIRELY UNTESTED
 (b) not exactly pretty
 (c) only handles the first byte of the string

but it's at least very simple and cheap and fits in with the checks we
already do.

No, that vsprintf code wouldn't deal with any tracing-specific sanity
checks that say "the pointer is in the trace ring buffer", but as
mentioned, *that* code is entirely separate from the thing I'm
objecting to anyway.

               Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 20:11:50 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> No, that vsprintf code wouldn't deal with any tracing-specific sanity
> checks that say "the pointer is in the trace ring buffer", but as
> mentioned, *that* code is entirely separate from the thing I'm
> objecting to anyway.

So what are you objecting to?

The current change, or the code that's already been there?

I'm also mistaken to what caused the bug and the Fixes tag is incorrect. I
thought the trace_check_vprintk() came before the code that added the
"%px", but it did come after. The "%px" code was added in 2020 and the
trace_check_vprintk() was added in 2021. The mistake was that it missed the
"hash-ptr" off case that then passes the iter->fmt as the "fmt" and the
trace_check_printk() was expecting the TP_printk() format to be the format
passed in. Which is the default case. The bug only happens when we clear
the hash-ptr option (which is on by default).

It should say:

 Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events")

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sat, 14 Dec 2024 at 20:38, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> So what are you objecting to?

I'm objecting to "we wrote bad code, so let's hack it up some more".

> The current change, or the code that's already been there?

Both.

The current change looked so random that it made me go "WTF?" and then
I looked at the code it's changing.

The code is literally making assumptions about how va_list arguments
even work. It calls trace_seq_printf() and expects that to keep the
"va_list" argument in sync as it walks the argument list.

The code even seems to *know* how hacky and wrong it is, because it
does a RUNTIME CHECK for this in test_can_verify(). But that really is
just testing one random implementation. The whole approach
fundamnetally doesn't work on some architectures at all, and on others
it would be broken by seq_buf_vprintf() (or vsprintf) doing a
va_copy() and using an internal version or any number of other things.

So the code IS WRONG.

It's wrong at a fundamental level. The thing where it modifies
iter->fmt in place is just a detail in the bigger picture of that
function being completely broken.

It's literally hacking together a really *bad* version of KASAN
checking, re-implementing 'vsnprintf()' *badly* by trying to use the
real vsnprintf() for the heavy lifting and literally hacking around
corner cases. And when I say that the "va_list" handling is buggy, I'm
*not* asking you to try to fix it. The va_list bug is just another
symptom of the deeper problem in that thing.

This is why I compared it to the tracefs situation. It has exactly the
same pattern: fundamentally misusing core kernel infrastructure,
hacking around it to get something that "works", but that is wrong on
a fundamental level, and then adding more hacks on top when some new
issue rears its ugly head.

I absolutely believe that coimmit ff54c8dc3f7a fixes some behavior.
But I'm looking at it and going "this code is not salvageable". All
you do is add more bandaids on top.

Do it right. And honestly, "right" in this case is almost certainly
"get rid of trace_check_vprintf() entirely".

The fact that you basically disable the thing already with a static
branch when 'va_list" doesn't work the way you want is a BIG sign.

Just disable it unconditionally.

              Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sat, 14 Dec 2024 21:19:01 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

I'll start by saying that I'm fine with ripping this out, but I'm going to
expect seeing bugs in trace events where people just reference strings in
the TP_printk() instead of using the assign_str() logic. I use to get
reports of this a few times a year, and have not had one since this code
was added back in 2021. That's over 3 years of this bug going away in
mainline.

> The code is literally making assumptions about how va_list arguments
> even work. It calls trace_seq_printf() and expects that to keep the
> "va_list" argument in sync as it walks the argument list.

I did my research on this before adding it. With the eventfs, I was very
open about not understanding the file system logic and asked for help,
which I received very little on. I had to go by tribal knowledge.

> 
> The code even seems to *know* how hacky and wrong it is, because it
> does a RUNTIME CHECK for this in test_can_verify(). But that really is
> just testing one random implementation. The whole approach
> fundamnetally doesn't work on some architectures at all, and on others
> it would be broken by seq_buf_vprintf() (or vsprintf) doing a
> va_copy() and using an internal version or any number of other things.
> 
> So the code IS WRONG.

As I said, I did my homework on this and knew exactly what it was doing. I
did my research on va_list and it is possible to do this if the va_list is a
reference and not a copy (as my comment states), which is what the verify
was checking.

> 
> It's wrong at a fundamental level. The thing where it modifies
> iter->fmt in place is just a detail in the bigger picture of that
> function being completely broken.

BTW, this code had only two minor bugs in it since it was created, and the
last one was on March 2022. The bug here, was simply that the temp buffer
it used could also be passed into the function. That would break any
algorithm. The logic itself was not the bug at all. One solution I was
thinking of doing was to introduce a second temp buffer to use, and not
change this code at all.

> 
> It's literally hacking together a really *bad* version of KASAN
> checking, re-implementing 'vsnprintf()' *badly* by trying to use the
> real vsnprintf() for the heavy lifting and literally hacking around
> corner cases. And when I say that the "va_list" handling is buggy, I'm
> *not* asking you to try to fix it. The va_list bug is just another
> symptom of the deeper problem in that thing.

What va_list bug? That logic didn't break. The breakage came here:

		strncpy(iter->fmt, p, i);
		iter->fmt[i] = '\0';

Where it didn't expect p to be pointing to iter->fmt! Because it expected
the fmt parameter to point to the "const char * printk_fmt" of the trace
event, and not be pointing to the iter->fmt itself. My fault for not
noticing in the call in trace_event_printf():

	trace_check_vprintf(iter, trace_event_format(iter, fmt), ap);

that trace_event_format() can return iter->fmt if the hash-ptr option is
disabled.

p was pointing into iter->fmt because iter->fmt was passed into the
function as the "fmt" parameter that p was assigned to.

The fmt parameter started with "comm=%s" and in this case iter->fmt = fmt
and p was iterating it.

When the fmt had:  "comm=%s pid=%d"

We get to i = 5 and since fmt = iter->fmt, p = fmt, that means p = iter->fmt!

The iter->fmt[i] = '\0'; Just changed p[i] to be '\0'. That was the bug. It
had nothing to do with the va_list logic.

		trace_seq_vprintf(&iter->seq, iter->fmt, ap);

Would copy the "comm=" into the seq buffer, but then the code after that
was broken, as p[i] == '\0'.

			strncpy(iter->fmt, p + i, j + 1);
			iter->fmt[j+1] = '\0';

Which just copied an empty string to iter->fmt, so when we do:

			trace_seq_printf(&iter->seq, iter->fmt, str);

It also copied an empty string into the seq buffer.

 The result was:

  "comm= pid=1234"

> 
> This is why I compared it to the tracefs situation. It has exactly the
> same pattern: fundamentally misusing core kernel infrastructure,
> hacking around it to get something that "works", but that is wrong on
> a fundamental level, and then adding more hacks on top when some new
> issue rears its ugly head.
> 
> I absolutely believe that coimmit ff54c8dc3f7a fixes some behavior.
> But I'm looking at it and going "this code is not salvageable". All
> you do is add more bandaids on top.

That commit fixes the iter->fmt being passed into the function by making
the assumption that p will iterate over the temp buffer. It has nothing to
do with the current code that hasn't seen a bug since 2022, and has stopped
constant "%s" bugs since then.

I could have replace ff54c8dc3f7a with using another temp buffer instead
and not change the logic of trace_check_vprintk() at all. Hence, the
va_list logic wasn't the bug.

> 
> Do it right. And honestly, "right" in this case is almost certainly
> "get rid of trace_check_vprintf() entirely".

I can get rid of it, but I highly expect to see the "%s" bug start showing
up within a few months again.

> 
> The fact that you basically disable the thing already with a static
> branch when 'va_list" doesn't work the way you want is a BIG sign.

It gets disabled because I know there were two implementations of va_list.
One that used a pass by reference and another that used a copy. This
wouldn't work with the copy (32bit x86 does that) but does work with pass
by reference (64 bit x86 does that). And the fact that the "%s" bug stopped
showing up after I implemented it, and I was even asked by someone about
the WARN_ON() triggering (but the bug never made it into your tree), I know
that it was the reason the bug stopped showing up.

> 
> Just disable it unconditionally.
> 

I can do that, but I'm not looking forward to seeing random crashes in the
trace event code again :-(

Honestly, I did not like this code when I wrote it, but I have no idea how
to stop the "%s" bug from happening before it gets out to production. This
worked. Do you have any suggestions for alternatives?

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sun, 15 Dec 2024 at 02:04, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> What va_list bug? That logic didn't break.

That logic HAS NEVER EVER WORKED RIGHT.

You are literally mis-using va_list. The code is *wrong*. It depends
on the exact calling convention of varargs, and just happens to work
on many platforms.

And even when the calling conventions happened to match, it only
happens to work because of how vsnprintf() treats its incoming
argument.

So my complaint is literally that the code you are fixing is
unfixable. It's a "it happens to work" situation (except when it
didn't), not a "this code is right" situation.

And dammit, you must have known this code was garbage, because when I
do a "git blame", I see your name on that test_can_verify_check()
function.

Basically, you are walking the va_list in invalid ways, because you
want to use vsnprintf() in invalid ways. As part of that walk, you are
(a) fundamentally mis-using va_list and (b) modifying the format
string as you go along (both in the sense of actually changing the
format string, but also switching that format pointer around).

My issue with this is that (b) was buggy, but it was a bug that came
from a much more *fundamental* bug - the misuse of vsnprintf.   Which
is why I'm saying "rip it out". Because your fix is fixing symptoms.
It's not fixing the fact that you are doing things that are wrong.

Looking up more, I also think the whole "replace "%p" with "%px" is
broken and wrong.

Isn't every single case of '%p' in this context from a TP_printk() situation?

IOW, instead of dynamically creating a temporary buffer and adding
that 'x' by hand, why wasn't that just a 'sed' script and done
statically?

In fact, wouldn't *most* of the sanity checking be possible to just do
entirely statically instead of at runtime?

Yeah, yeah, there's more than a few '%p' users, but

     git grep 'TP_printk.*%p[^A-Za-z]' | wc

shows that it's manageable. That probably misses some multi-line
cases, but still - doing this kind of "rewrite constant string at
runtime because we didn't do it statically" seems *wrong*.

And in this case, that wrongness was literally the cause of the bug.

                Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 09:23:18 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And dammit, you must have known this code was garbage, because when I
> do a "git blame", I see your name on that test_can_verify_check()
> function.

Yes, that was added in the same commit that added the
trace_check_vprintk() code.

> 
> Basically, you are walking the va_list in invalid ways, because you
> want to use vsnprintf() in invalid ways. As part of that walk, you are
> (a) fundamentally mis-using va_list and (b) modifying the format
> string as you go along (both in the sense of actually changing the
> format string, but also switching that format pointer around).
> 
> My issue with this is that (b) was buggy, but it was a bug that came
> from a much more *fundamental* bug - the misuse of vsnprintf.   Which
> is why I'm saying "rip it out". Because your fix is fixing symptoms.
> It's not fixing the fact that you are doing things that are wrong.

I'm not disagreeing with you. I didn't like the code when I wrote it,
but I did write it because it was the only way I could stop the "%s"
bug. Last night, thinking about this, I think I have another solution
that can be added to the test_event_printk() and flag the fields that
can be printed via "%s" to make sure that they are correct *before*
processing the vsnprintf() code. That should give the same warning and
checks as the trace_check_vprintf().

I would also add a check in the vsnprintf() that uses
__get_kernel_nofault() like you suggested.

> 
> Looking up more, I also think the whole "replace "%p" with "%px" is
> broken and wrong.
> 
> Isn't every single case of '%p' in this context from a TP_printk() situation?
> 
> IOW, instead of dynamically creating a temporary buffer and adding
> that 'x' by hand, why wasn't that just a 'sed' script and done
> statically?
> 
> In fact, wouldn't *most* of the sanity checking be possible to just do
> entirely statically instead of at runtime?
> 
> Yeah, yeah, there's more than a few '%p' users, but
> 
>      git grep 'TP_printk.*%p[^A-Za-z]' | wc
> 
> shows that it's manageable. That probably misses some multi-line
> cases, but still - doing this kind of "rewrite constant string at
> runtime because we didn't do it statically" seems *wrong*.
> 
> And in this case, that wrongness was literally the cause of the bug.

I'm also OK with that. Should that be done for 6.13 or something to be
added for 6.14?

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Sun, 15 Dec 2024 at 17:24, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > IOW, instead of dynamically creating a temporary buffer and adding
> > that 'x' by hand, why wasn't that just a 'sed' script and done
> > statically?
>
> I'm also OK with that. Should that be done for 6.13 or something to be
> added for 6.14?

So a scripted thing could probably be done right now, assuming it's
obvious enough. Something like

   git grep -l TP_printk.*%p |
        xargs sed -i '/TP_printk/s/%p\([^a-zA-Z]\)/%px\1/'

would seem to do the RightThing(tm) for at least simple cases. And I
didn't actually find any cases of people using %p with precision
modifiers or anything like that, so "simple cases" seems to be all
that exists.

What the above does *not* do is handle any multi-line cases, and there
are probably several of those.

That said, honestly, looking at the resulting diff, I really think a
lot of those %p users are mindless drivel in the first place, and I am
not convinced that the real address is even wanted or needed. I think
people have completely mindlessly added "print out address" without
any actual reason for it.

I'm seeing things like just random usb_request pointers being printed
out, and I'm not convinced that is ever really useful or a situation
where a '%px' is *any* different from a plain '%p' (ie the main thing
I suspect you can do with that value is just say "it's the same
request", and the hashed value works just fine for that).

So I am not convinced that the original reason for the mindless
(runtime) conversion of '%p' to '%px' was really even well-motivated.
I think it was a bad idea, and implemented badly.

End result: instead of doing that automated sed-script (which is
certainly easy, but noisy), I actually think it might be much better
to just remove the runtime '%p' -> '%px' conversion entirely, and see
if anybody even notices - and then when somebody has a good reason for
actually caring about a random kernel data structure address, change
it at *THAT* point.

                 Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Mon, 16 Dec 2024 09:59:35 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> That said, honestly, looking at the resulting diff, I really think a
> lot of those %p users are mindless drivel in the first place, and I am
> not convinced that the real address is even wanted or needed. I think
> people have completely mindlessly added "print out address" without
> any actual reason for it.
> 
> I'm seeing things like just random usb_request pointers being printed
> out, and I'm not convinced that is ever really useful or a situation
> where a '%px' is *any* different from a plain '%p' (ie the main thing
> I suspect you can do with that value is just say "it's the same
> request", and the hashed value works just fine for that).
> 
> So I am not convinced that the original reason for the mindless
> (runtime) conversion of '%p' to '%px' was really even well-motivated.
> I think it was a bad idea, and implemented badly.
> 
> End result: instead of doing that automated sed-script (which is
> certainly easy, but noisy), I actually think it might be much better
> to just remove the runtime '%p' -> '%px' conversion entirely, and see
> if anybody even notices - and then when somebody has a good reason for
> actually caring about a random kernel data structure address, change
> it at *THAT* point.

Masami was the one that implemented this for debugging reasons.

Masami, what was the motivation behind using %px? I know you needed that to
debug crashes to compare some event pointers with where they exist in the
crash dump.

But perhaps for those events that you used for crashes, we can just change
the TP_printk() over to use "%px" and remove the full conversion of all
events that you have.

Thoughts?

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 20:24:04 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > Isn't every single case of '%p' in this context from a TP_printk() situation?
> > 
> > IOW, instead of dynamically creating a temporary buffer and adding
> > that 'x' by hand, why wasn't that just a 'sed' script and done
> > statically?
> > 
> > In fact, wouldn't *most* of the sanity checking be possible to just do
> > entirely statically instead of at runtime?
> > 
> > Yeah, yeah, there's more than a few '%p' users, but
> > 
> >      git grep 'TP_printk.*%p[^A-Za-z]' | wc
> > 
> > shows that it's manageable. That probably misses some multi-line
> > cases, but still - doing this kind of "rewrite constant string at
> > runtime because we didn't do it statically" seems *wrong*.
> > 
> > And in this case, that wrongness was literally the cause of the bug.  
> 
> I'm also OK with that. Should that be done for 6.13 or something to be
> added for 6.14?

One issue with just doing it manually, is that it would need to be
maintained, as one or two "%p" are added per release cycle.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Linus Torvalds 1 year, 1 month ago

On Mon, 16 Dec 2024 at 05:32, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> One issue with just doing it manually, is that it would need to be
> maintained, as one or two "%p" are added per release cycle.

See my other thing. I think people add those things entirely
mindlessly without any thinking.

If you add garbage, you get garbage. I do *not* think "people add one
or two of this every release' is an argument for it being valid at
all.

               Linus

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Theodore Ts'o 1 year, 1 month ago

On Sun, Dec 15, 2024 at 09:23:18AM -0800, Linus Torvalds wrote:
> 
> You are literally mis-using va_list. The code is *wrong*. It depends
> on the exact calling convention of varargs, and just happens to work
> on many platforms.

It seems to me that the disagreement is fundamentally about whether we
can depend on implementation details, or the formal abstraction of
interfaces like varargs.  One school of thought is that we should only
depend on the formally defined abstraction, so that we are proof
against random breakage cauesed by compilers trying to win
benchmarketing wars by proving that they are 0.001% faster because
that makes a big deal when you can advertise on the back cover of
Businessweek.  (OK, that's really more a trick that enterprise
databases play, but you get the point.)

The other school of thought is that when trying to squeeze every last
cycle of performance (because *we* are the ones engaging in
benchmarketing wars), it's fair game to depend on implementation
details if it gets us a sufficiently large performance advantage, or
if it allows us to preserve interface semantics (perhaps for something
which was imprudently guaranteed by us when we or the code was younger
and more foolish, but we really don't want to break programs depending
on the current semantics).

I've been on both sides of this debate, although when I do the second,
it's often because I know something specific about my operating
environment (such as $WORK's data centers will *never* worry about
using big endian systems, or some such).  I *have* gotten in trouble
when I do this, so these days I insist on doumenting with big red
flags what abstractions I am violating, and the justification for
doing this, and adding tests that check to make sure that the
assumptions I am making won't suddenly break with a new version of the
compiler, or when someone tries to do something like introduce Rust
bindings, that might not know about the terrible assumptions we are
making.

I'm not convinced that it's worth it in this particular case, so I
think I side with Linus here; maybe all of this hackery isn't worth
it?  Steven, what am I missing?  Why did we go down this particular
path in the first place?  I assume there must have been something that
seemed like a good reason at the time?

						- Ted

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

[
  I'm back from a lovely day with my wife and friends visiting the
  Finger Lake wineries which I would love to share with the Linux
  Plumbers attendees if we ever get to have Plumbers at the Ithaca
  Conference center. ;-)
]

On Sun, 15 Dec 2024 16:40:34 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> I'm not convinced that it's worth it in this particular case, so I
> think I side with Linus here; maybe all of this hackery isn't worth
> it?  Steven, what am I missing?  Why did we go down this particular
> path in the first place?  I assume there must have been something that
> seemed like a good reason at the time?

It has nothing to do with performance. It has only to do with debugging
and catching the issue where developers use "%s" in the TP_printk() of
TRACE_EVENT() for a string that can be freed after the trace event was
executed but before it is read from the ring buffer.

I stated in another email that I didn't like this solution when I wrote
it, but it was the only solution I could think of to catch this common
bug. But I also stated in another email that I think there's another
solution that doesn't deal with va_list implementations and should be
acceptable by Linus.

I'm not against ripping out the code. I agree with Linus that it's a
hack, but the hack that caught the bugs early seemed to me worth it at
the time.

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 05:05:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Honestly, I did not like this code when I wrote it, but I have no idea how
> to stop the "%s" bug from happening before it gets out to production. This
> worked. Do you have any suggestions for alternatives?

My mind wouldn't stop thinking about this all night, and I think I found an
alternative.

The test_event_printk() catches everything but the "%s" issue, only because
we allow events to print to static strings and not have to copy them into
the ring buffer. But that function also has access to the event fields. It
can figure out which field is being printed by "%s", and flag it. Then on
output, it can check that the field is going to be printed before the
vsnprintf() is called on the TP_printk() format.

If it is a bad field, it can inject into the trace that the field is bad,
not print the event at all, and trigger a WARN_ON_ONCE().

Hmm,

-- Steve

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Mathieu Desnoyers 1 year, 1 month ago

On 2024-12-15 05:05, Steven Rostedt wrote:
> On Sat, 14 Dec 2024 21:19:01 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:

[...]

>>
>> Just disable it unconditionally.
>>
> 
> I can do that, but I'm not looking forward to seeing random crashes in the
> trace event code again :-(
> 
> Honestly, I did not like this code when I wrote it, but I have no idea how
> to stop the "%s" bug from happening before it gets out to production. This
> worked. Do you have any suggestions for alternatives?

IMHO, deferred execution of TP_printk() code in kernel context is
a fundamental mistake causing all those problems. This opens the
door to store pointers to strings (or anything else really)
that sit in kernel modules which can be unloaded between
tracing and TP_printk() execution, or as we are seeing here
pointers to data which can be mapped at different addresses
across kernel reboot, into the ring buffer.

If TP_printk() don't have access to load data from random kernel
memory in the first place, and can only read from the buffer, we
would not be having those misuses, and there would be nothing to
work-around as the strings/data would all be serialized into the
ring buffer.

In LTTng we've taken the approach to only read the trace data
at post-processing from user-space (we don't have the equivalent
of TP_printk(), and that's on purpose).

I wonder if we could keep the ftrace trace_pipe pretty-printing
behavior, while isolating the TP_printk() execution into a
userspace process which would only map the ring buffer ? This way,
users trying to misuse TP_printk() would get immediate feedback
about their mistake because they cannot print the trace. We could
print a dmesg warning about crash of a usermode helper program,
for instance.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 07:42:35 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2024-12-15 05:05, Steven Rostedt wrote:
> > On Sat, 14 Dec 2024 21:19:01 -0800
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:  
> 
> [...]
> 
> >>
> >> Just disable it unconditionally.
> >>  
> > 
> > I can do that, but I'm not looking forward to seeing random crashes in the
> > trace event code again :-(
> > 
> > Honestly, I did not like this code when I wrote it, but I have no idea how
> > to stop the "%s" bug from happening before it gets out to production. This
> > worked. Do you have any suggestions for alternatives?  
> 
> IMHO, deferred execution of TP_printk() code in kernel context is
> a fundamental mistake causing all those problems. This opens the
> door to store pointers to strings (or anything else really)
> that sit in kernel modules which can be unloaded between

Module unloading will clear out the ring buffers to prevent issues.

> tracing and TP_printk() execution, or as we are seeing here
> pointers to data which can be mapped at different addresses
> across kernel reboot, into the ring buffer.
> 
> If TP_printk() don't have access to load data from random kernel
> memory in the first place, and can only read from the buffer, we
> would not be having those misuses, and there would be nothing to
> work-around as the strings/data would all be serialized into the
> ring buffer.
> 
> In LTTng we've taken the approach to only read the trace data
> at post-processing from user-space (we don't have the equivalent
> of TP_printk(), and that's on purpose).
> 
> I wonder if we could keep the ftrace trace_pipe pretty-printing
> behavior, while isolating the TP_printk() execution into a
> userspace process which would only map the ring buffer ? This way,

That would change the entire use of tracefs, especially in the embedded
world. Note, this hasn't been a major issue since the test/check logic was
put in place. It catches pretty much all issues with the delayed printing.

-- Steve


> users trying to misuse TP_printk() would get immediate feedback
> about their mistake because they cannot print the trace. We could
> print a dmesg warning about crash of a usermode helper program,
> for instance.

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Mathieu Desnoyers 1 year, 1 month ago

On 2024-12-15 08:47, Steven Rostedt wrote:
> On Sun, 15 Dec 2024 07:42:35 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2024-12-15 05:05, Steven Rostedt wrote:
>>> On Sat, 14 Dec 2024 21:19:01 -0800
>>> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> [...]
>>
>>>>
>>>> Just disable it unconditionally.
>>>>   
>>>
>>> I can do that, but I'm not looking forward to seeing random crashes in the
>>> trace event code again :-(
>>>
>>> Honestly, I did not like this code when I wrote it, but I have no idea how
>>> to stop the "%s" bug from happening before it gets out to production. This
>>> worked. Do you have any suggestions for alternatives?
>>
>> IMHO, deferred execution of TP_printk() code in kernel context is
>> a fundamental mistake causing all those problems. This opens the
>> door to store pointers to strings (or anything else really)
>> that sit in kernel modules which can be unloaded between
> 
> Module unloading will clear out the ring buffers to prevent issues.

As a side-effect issues caused by module unloading won't be
observable with tracing.

> 
>> tracing and TP_printk() execution, or as we are seeing here
>> pointers to data which can be mapped at different addresses
>> across kernel reboot, into the ring buffer.
>>
>> If TP_printk() don't have access to load data from random kernel
>> memory in the first place, and can only read from the buffer, we
>> would not be having those misuses, and there would be nothing to
>> work-around as the strings/data would all be serialized into the
>> ring buffer.
>>
>> In LTTng we've taken the approach to only read the trace data
>> at post-processing from user-space (we don't have the equivalent
>> of TP_printk(), and that's on purpose).
>>
>> I wonder if we could keep the ftrace trace_pipe pretty-printing
>> behavior, while isolating the TP_printk() execution into a
>> userspace process which would only map the ring buffer ? This way,
> 
> That would change the entire use of tracefs, especially in the embedded
> world. Note, this hasn't been a major issue since the test/check logic was
> put in place. It catches pretty much all issues with the delayed printing.

This is not at all what I have in mind, so let me rephrase.

What I am saying is: is there a way we could execute TP_printk()
in userspace mode _while preserving the trace_pipe tracefs ABI_ ?

I suspect that inserting this small userspace program into the
kernel image with objcopy would be a start. Then adapting the
usermode helper code to run a program from a preexisting
in-kernel copy could be a second step. Then modifying trace_pipe
so it blocks and communicates with this helper program to
consume the formatted output would come last.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
>> users trying to misuse TP_printk() would get immediate feedback
>> about their mistake because they cannot print the trace. We could
>> print a dmesg warning about crash of a usermode helper program,
>> for instance.

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [GIT PULL] ftrace: Fixes for v6.13

Posted by Steven Rostedt 1 year, 1 month ago

On Sun, 15 Dec 2024 05:05:17 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> That commit fixes the iter->fmt being passed into the function by making
> the assumption that p will iterate over the temp buffer. It has nothing to
> do with the current code that hasn't seen a bug since 2022, and has stopped
> constant "%s" bugs since then.

In fact, I could have fixed this with this one line change:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..d990687335b1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3722,6 +3722,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 	if (static_branch_unlikely(&trace_no_verify))
 		goto print;
 
+	/* If hash-ptr is not set, the fmt points to our temp buffer */
+	if (p == iter->fmt)
+		goto print;
+
 	/*
 	 * When the kernel is booted with the tp_printk command line
 	 * parameter, trace events go directly through to printk().


-- Steve