kernel/trace/fgraph.c | 8 +++- kernel/trace/trace.c | 90 ++++++++++++++++++++++++++---------------- kernel/trace/trace_functions.c | 3 +- 3 files changed, 64 insertions(+), 37 deletions(-)
Linus,
Ftrace fixes for 6.13:
- Fix output of trace when hashing a pointer is disabled
Pointers are recorded into the ring buffer by simply copying them.
When the ring buffer is read directly in binary format, the
pointers are unhashed, and tools like perf will show them as such.
But when the pointers are printed from the trace file, they are
hashed because the output from the trace uses vsnprintf() which will
hash "%p". Since the tracing infrastructure requires just as much
privileged as kallsyms this was an unneeded protection. An option
was created to allow the pointers to not be hashed, as in some cases
in debugging, the unhashed pointer was required.
To do this, the unhashing would add a "x" after a "%p" to make it
"%px" and not have the call to vsnprintf() hash. It used the iter->fmt
temp buffer to accomplish this. The problem was that the iter->fmt temp
buffer was already being used as a temp buffer to check if pointers in the
event format output were being called indirectly (like using a "%pI4" or
"%pI6") where the pointer may be pointing to a freed location. The check
code will catch that.
The issue was that when the hash pointer was being disabled, the
logic that used the temporary iter->fmt buffer passed that buffer
to the function that would test bad pointers and that function
would use iter->fmt buffer as well, causing the output to be
screwed up.
The solution was to change the bad pointer logic to use the iter->fmt
buffer directly and not as a temp buffer. If the fmt parameter passed
in was not the iter->fmt buffer, it would copy the content to the
iter->fmt buffer and process that buffer directly. This now allows
passing the iter->fmt buffer as the fmt parameter to the bad pointer
check function. The nohash function that added the "%px" would not be an
issue when passing the iter->fmt to the check function as the format to
check.
- Always try to initialize the idle functions when graph tracer starts
A bug was found that when a CPU is offline when graph tracing starts
and then comes online, that CPU is not traced. The fix to that was
to move the initialization of the idle shadow stack over to the
hot plug online logic, which also handle onlined CPUs. The issue was
that it removed the initialization of the shadow stack when graph tracing
starts, but the callbacks to the hot plug logic do nothing if graph
tracing isn't currently running. Although that fix fixed the onlining
of a CPU during tracing, it broke the CPUs that were already online.
- Have microblaze not try to get the "true parent" in function tracing
If function tracing and graph tracing are both enabled at the same time
the parent of the functions traced by the function tracer may sometimes
be the graph tracing trampoline. The graph tracing hijacks the return
pointer of the function to trace it, but that can interfere with the
function tracing parent output. This was fixed by using the
ftrace_graph_ret_addr() function passing in the kernel stack pointer
using the ftrace_regs_get_stack_pointer() function. But Al Viro reported
that Microblaze does not implement the kernel_stack_pointer(regs)
helper function that ftrace_regs_get_stack_pointer() uses and fails
to compile when function graph tracing is enabled.
The real fix would be to have microblaze implement the kernel_stack_pointer()
function. For now, just make microblaze use the old logic which prints the
function graph trampoline in the function tracer.
Please pull the latest ftrace-v6.13-rc2 tree, which can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
ftrace-v6.13-rc2
Tag SHA1: bdc021d764d3b571031c195506eb0eed208b2087
Head SHA1: 607f2f4884c1a66c297cd0fb35d5523538d18b90
Steven Rostedt (3):
tracing: Fix trace output when pointer hash is disabled
fgraph: Still initialize idle shadow stacks when starting
ftrace/microblaze: Do not find "true_parent" for return address
----
kernel/trace/fgraph.c | 8 +++-
kernel/trace/trace.c | 90 ++++++++++++++++++++++++++----------------
kernel/trace/trace_functions.c | 3 +-
3 files changed, 64 insertions(+), 37 deletions(-)
---------------------------
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0bf78517b5d4..ddedcb50917f 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1215,7 +1215,7 @@ void fgraph_update_pid_func(void)
static int start_graph_tracing(void)
{
unsigned long **ret_stack_list;
- int ret;
+ int ret, cpu;
ret_stack_list = kcalloc(FTRACE_RETSTACK_ALLOC_SIZE,
sizeof(*ret_stack_list), GFP_KERNEL);
@@ -1223,6 +1223,12 @@ static int start_graph_tracing(void)
if (!ret_stack_list)
return -ENOMEM;
+ /* The cpu_boot init_task->ret_stack will never be freed */
+ for_each_online_cpu(cpu) {
+ if (!idle_task(cpu)->ret_stack)
+ ftrace_graph_init_idle_task(idle_task(cpu), cpu);
+ }
+
do {
ret = alloc_retstack_tasklist(ret_stack_list);
} while (ret == -EAGAIN);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index be62f0ea1814..b44b1cdaa20e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3711,8 +3711,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
{
long text_delta = 0;
long data_delta = 0;
- const char *p = fmt;
const char *str;
+ char save_ch;
+ char *buf = NULL;
+ char *p;
bool good;
int i, j;
@@ -3720,7 +3722,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
return;
if (static_branch_unlikely(&trace_no_verify))
- goto print;
+ goto print_fmt;
/*
* When the kernel is booted with the tp_printk command line
@@ -3735,8 +3737,21 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
/* Don't bother checking when doing a ftrace_dump() */
if (iter->fmt == static_fmt_buf)
- goto print;
+ goto print_fmt;
+ if (fmt != iter->fmt) {
+ int len = strlen(fmt);
+ while (iter->fmt_size < len + 1) {
+ /*
+ * If we can't expand the copy buffer,
+ * just print it.
+ */
+ if (!trace_iter_expand_format(iter))
+ goto print_fmt;
+ }
+ strscpy(iter->fmt, fmt, iter->fmt_size);
+ }
+ p = iter->fmt;
while (*p) {
bool star = false;
int len = 0;
@@ -3748,14 +3763,6 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
* as well as %p[sS] if delta is non-zero
*/
for (i = 0; p[i]; i++) {
- if (i + 1 >= iter->fmt_size) {
- /*
- * If we can't expand the copy buffer,
- * just print it.
- */
- if (!trace_iter_expand_format(iter))
- goto print;
- }
if (p[i] == '\\' && p[i+1]) {
i++;
@@ -3788,10 +3795,11 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
if (!p[i])
break;
- /* Copy up to the %s, and print that */
- strncpy(iter->fmt, p, i);
- iter->fmt[i] = '\0';
- trace_seq_vprintf(&iter->seq, iter->fmt, ap);
+ /* Print up to the %s */
+ save_ch = p[i];
+ p[i] = '\0';
+ trace_seq_vprintf(&iter->seq, p, ap);
+ p[i] = save_ch;
/* Add delta to %pS pointers */
if (p[i+1] == 'p') {
@@ -3837,6 +3845,8 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
good = trace_safe_str(iter, str, star, len);
}
+ p += i;
+
/*
* If you hit this warning, it is likely that the
* trace event in question used %s on a string that
@@ -3849,41 +3859,51 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
fmt, seq_buf_str(&iter->seq.seq))) {
int ret;
+#define TEMP_BUFSIZ 1024
+
+ if (!buf) {
+ char *buf = kmalloc(TEMP_BUFSIZ, GFP_KERNEL);
+ if (!buf) {
+ /* Need buffer to read address */
+ trace_seq_printf(&iter->seq, "(0x%px)[UNSAFE-MEMORY]", str);
+ p += j + 1;
+ goto print;
+ }
+ }
+ if (len >= TEMP_BUFSIZ)
+ len = TEMP_BUFSIZ - 1;
/* Try to safely read the string */
if (star) {
- if (len + 1 > iter->fmt_size)
- len = iter->fmt_size - 1;
- if (len < 0)
- len = 0;
- ret = copy_from_kernel_nofault(iter->fmt, str, len);
- iter->fmt[len] = 0;
- star = false;
+ ret = copy_from_kernel_nofault(buf, str, len);
+ buf[len] = 0;
} else {
- ret = strncpy_from_kernel_nofault(iter->fmt, str,
- iter->fmt_size);
+ ret = strncpy_from_kernel_nofault(buf, str, TEMP_BUFSIZ);
}
if (ret < 0)
trace_seq_printf(&iter->seq, "(0x%px)", str);
else
- trace_seq_printf(&iter->seq, "(0x%px:%s)",
- str, iter->fmt);
- str = "[UNSAFE-MEMORY]";
- strcpy(iter->fmt, "%s");
+ trace_seq_printf(&iter->seq, "(0x%px:%s)", str, buf);
+ trace_seq_puts(&iter->seq, "[UNSAFE-MEMORY]");
} else {
- strncpy(iter->fmt, p + i, j + 1);
- iter->fmt[j+1] = '\0';
+ save_ch = p[j + 1];
+ p[j + 1] = '\0';
+ if (star)
+ trace_seq_printf(&iter->seq, p, len, str);
+ else
+ trace_seq_printf(&iter->seq, p, str);
+ p[j + 1] = save_ch;
}
- if (star)
- trace_seq_printf(&iter->seq, iter->fmt, len, str);
- else
- trace_seq_printf(&iter->seq, iter->fmt, str);
- p += i + j + 1;
+ p += j + 1;
}
print:
if (*p)
trace_seq_vprintf(&iter->seq, p, ap);
+ kfree(buf);
+ return;
+ print_fmt:
+ trace_seq_vprintf(&iter->seq, fmt, ap);
}
const char *trace_event_format(struct trace_iterator *iter, const char *fmt)
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 74c353164ca1..a75d107a45f8 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -176,7 +176,8 @@ static void function_trace_start(struct trace_array *tr)
tracing_reset_online_cpus(&tr->array_buffer);
}
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+/* Microblaze currently doesn't implement kernel_stack_pointer() */
+#if defined(CONFIG_FUNCTION_GRAPH_TRACER) && !defined(CONFIG_MICROBLAZE)
static __always_inline unsigned long
function_get_true_parent_ip(unsigned long parent_ip, struct ftrace_regs *fregs)
{
On Sat, Dec 14, 2024 at 06:21:38PM -0500, Steven Rostedt wrote: > The real fix would be to have microblaze implement the kernel_stack_pointer() > function. For now, just make microblaze use the old logic which prints the > function graph trampoline in the function tracer. Correction: kernel_stack_pointer() implementation there would make it _compile_, but it wouldn't work right. What it needs is ftrace_regs_get_stack_pointer() *and* HAVE_ARCH_FTRACE_REGS defined to suppress the use of defaults. Background, one more time: kernel_stack_pointer() would give you the kernel stack pointer by struct pt_regs (if that is possible in the first place). ftrace_regs_get_stack_pointer() gives the kernel stack pointer by whatever object _mcount() has set up. *IF* that object is identical to struct pt_regs and kernel_stack_pointer() exists, you could cast the pointer to a struct pt_regs * value and feed that to kernel_stack_pointer(). For microblaze _mcount-created object is *NOT* the same thing as struct pt_regs, so the default would've done you no good even if kernel_stack_pointer() existed on microblaze. See arch/microblaze/kernel/mcount.S SAVE_REGS and compare with their definition of struct pt_regs in arch/microblaze/include/uapi/asm/ptrace.h IOW, immediate fix is right, but the long-term fix suggested in pull request is not.
On Sun, 15 Dec 2024 04:48:43 +0000 Al Viro <viro@zeniv.linux.org.uk> wrote: > IOW, immediate fix is right, but the long-term fix suggested in pull request > is not. Actually, this requires HAVE_DYNAMIC_FTRACE_WITH_ARGS, because without that, the fregs passed in could be NULL. fregs is guaranteed to be non-NULL for all archs that have that set. As microblaze doesn't have that set, that would be the fix. I'll have to resend this pull request anyway depending on the outcome of the first commit. -- Steve
On Sun, 15 Dec 2024 00:05:22 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > I'll have to resend this pull request anyway depending on the outcome of > the first commit. Which will likely have to wait till Monday. I'm going to bed now and tomorrow's my wife's birthday. We are going to do some wine touring. -- Steve
On Sat, 14 Dec 2024 at 15:21, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> - Fix output of trace when hashing a pointer is disabled
Christ.
I'm looking at this, and my reaction is literally "tracing is doing
crazy things AGAIN".
This stuff is full of crazy special cases for things that should never
be done in the first place.
And - surprise surprise - it was buggy in odd ways as a result.
If I read the code right, this hacky code literally only catches stale
strings, but also has some *insane* code to do "text_delta" to
'%p[sS]' printouts.
And I see how it does that
text_delta = iter->tr->text_delta;
but I don't actually see where 'text_delta' would ever actually get
set. Where does tr->text_delta ever get set to anything but zero?
That code should be *removed*. It seems to be there entirely as a
sanity check, and when the sanity check itself is buggy, you don't try
to fix it, you remove the whole crazy garbage.
What makes '%s' so special in trace formats that it merits this
horrible hackery?
What makes 'text_delta' at all sane, never mind where it even gets set?
And why should we maintain that horrible hackery and make it worse?
This code shows a complete lack of taste. This needs extreme
crapectomy, not some alleged "fix".
If "people use stale pointers for '%s' and we actually care" is a real
issue, we could very possibly teach vsnprintf() about that. The code
already has a "check_pointer()" thing for just NULL pointer magic
printouts.
Linus
On Sat, 14 Dec 2024 16:37:39 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 14 Dec 2024 at 15:21, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > - Fix output of trace when hashing a pointer is disabled
>
> Christ.
>
> I'm looking at this, and my reaction is literally "tracing is doing
> crazy things AGAIN".
>
> This stuff is full of crazy special cases for things that should never
> be done in the first place.
>
> And - surprise surprise - it was buggy in odd ways as a result.
>
> If I read the code right, this hacky code literally only catches stale
> strings, but also has some *insane* code to do "text_delta" to
> '%p[sS]' printouts.
>
> And I see how it does that
>
> text_delta = iter->tr->text_delta;
>
> but I don't actually see where 'text_delta' would ever actually get
> set. Where does tr->text_delta ever get set to anything but zero?
It gets set by the persistent ring buffer code, via a call to
ring_buffer_last_boot_delta(). It's the delta between pointers from the
previous kernel boot up and the current kernel boot up. Without it, you
don't get function names in your function traces from the last boot. That
is, instead of having this:
<idle>-0 [003] d.h2. 156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: __rcu_read_unlock <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: stop_this_cpu <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: set_cpu_online <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462397: disable_local_APIC <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462398: clear_local_APIC <-disable_local_APIC
<idle>-0 [003] d.h2. 156.462574: mcheck_cpu_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: mce_intel_feature_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: lmce_supported <-mce_intel_feature_clear
You get this:
<idle>-0 [003] d.h2. 156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
<idle>-0 [003] d.h2. 156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
<idle>-0 [003] d.h2. 156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f
Due to KASLR.
>
> That code should be *removed*. It seems to be there entirely as a
> sanity check, and when the sanity check itself is buggy, you don't try
> to fix it, you remove the whole crazy garbage.
>
> What makes '%s' so special in trace formats that it merits this
> horrible hackery?
>
> What makes 'text_delta' at all sane, never mind where it even gets set?
>
> And why should we maintain that horrible hackery and make it worse?
>
> This code shows a complete lack of taste. This needs extreme
> crapectomy, not some alleged "fix".
>
> If "people use stale pointers for '%s' and we actually care" is a real
> issue, we could very possibly teach vsnprintf() about that. The code
> already has a "check_pointer()" thing for just NULL pointer magic
> printouts.
The check code was added because I was sick and tired of fixing bugs in
trace events. People would use the TRACE_EVENT() TP_printk() like a normal
printk with things like:
TP_fast_assign(
__entry->ipv6 = ptr_to_ipv6;
)
TP_printk("ipv6=%pI6", __entry->ipv6)
This is very buggy. Because the ipv6 pointer is added to the ring buffer at
the time of the trace event was triggered, but that "ipv6=%pI6" is executed
when the user reads the trace. Which can be minutes, hours, days, even
months later! In other words, it will read a pointer that has long been
freed.
With the check, instead it doesn't crash the kernel, but issues a nasty
WARN_ON() (as it is a bug in the kernel) and then reads the value using
safe mechanisms and injects into the trace "[UNSAFE-MEMORY]" by the pointer.
I haven't seen this bug show up again since I added that. That's because
developers see the warning when they add the buggy trace event, and it is
fixed before it ever reaches you.
Should I move that protection from the tracing code into vsnprintf()?
-- Steve
On Sat, 14 Dec 2024 at 19:03, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> It gets set by the persistent ring buffer code, via a call to
> ring_buffer_last_boot_delta().
That seems entirely broken, since it basically depends on the kernel
being the same one. Which isn't even checked for.
Not to mention that it wouldn't work for modules anyway.
This kind of random hackery needs to DIE.
Tell people to turn of KASLR for cross-boot tracing. Or just not do it.
> > If "people use stale pointers for '%s' and we actually care" is a real
> > issue, we could very possibly teach vsnprintf() about that. The code
> > already has a "check_pointer()" thing for just NULL pointer magic
> > printouts.
>
> The check code was added because I was sick and tired of fixing bugs in
> trace events. People would use the TRACE_EVENT() TP_printk() like a normal
> printk with things like:
>
> TP_fast_assign(
> __entry->ipv6 = ptr_to_ipv6;
> )
>
> TP_printk("ipv6=%pI6", __entry->ipv6)
I think you are confused.
You are talking about the checks in test_event_printk(). That is fine
- that's a sanity check of the trace fmt string at trace event
registration time.
But that's not at all the code that the new "fix" is all about.
The new "fix" is for the disgusting horror that is
trace_check_vprintf(), which is the "I'll take the trace format
string, and I'll print it out partly using the regular vprintf()
machinery, but I'll partly dig into it and do horrendous things to
it".
And *THAT* is the disgusting code, and it only deals with '%s' (and
the odd symbol relocation case that I think is completely broken).
And the '%s' case we could at least partly handle in lib/vsprintf.c -
not the odd tracing special cases (that you might as well just handle
in test_event_printk()), but the "we can check that it doesn't fault".
Linus
On Sat, 14 Dec 2024 20:00:57 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 14 Dec 2024 at 19:03, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > It gets set by the persistent ring buffer code, via a call to
> > ring_buffer_last_boot_delta().
>
> That seems entirely broken, since it basically depends on the kernel
> being the same one. Which isn't even checked for.
>
> Not to mention that it wouldn't work for modules anyway.
>
> This kind of random hackery needs to DIE.
>
> Tell people to turn of KASLR for cross-boot tracing. Or just not do it.
We are using this for ChromeOS in ChromeBooks where we have to have KASLR
on. And yes, I've documented that it has to be the same kernel to work and
will not work for modules. That has been accepted by the users.
>
> > > If "people use stale pointers for '%s' and we actually care" is a real
> > > issue, we could very possibly teach vsnprintf() about that. The code
> > > already has a "check_pointer()" thing for just NULL pointer magic
> > > printouts.
> >
> > The check code was added because I was sick and tired of fixing bugs in
> > trace events. People would use the TRACE_EVENT() TP_printk() like a normal
> > printk with things like:
> >
> > TP_fast_assign(
> > __entry->ipv6 = ptr_to_ipv6;
> > )
> >
> > TP_printk("ipv6=%pI6", __entry->ipv6)
>
> I think you are confused.
>
> You are talking about the checks in test_event_printk(). That is fine
> - that's a sanity check of the trace fmt string at trace event
> registration time.
Ah, I used the wrong example. But it is still fixing the same bug. For %pI6
it is caught at registration time. But the trace_check_vprintk() was added
at the same time test_event_printk() to handle strings. That's because
RCU's trace events (and other trace events) pointed to static strings in
kernel data, and because the string pointer is in the ring buffer, it had
to be checked at run time and not registration time.
My mistake in using the "%pI6" as an example because, as you stated, that
is fixed at registration time. But we have the same bug with strings:
TP_fast_assign(
__entry->str = field->name;
)
TP_printk("str=%s", __entry->str)
Which is also a bug and actually a more common bug than the "%pX" ones. But
because of RCU and other trace events that point to static strings, it has to be
tested at run time.
>
> But that's not at all the code that the new "fix" is all about.
>
> The new "fix" is for the disgusting horror that is
> trace_check_vprintf(), which is the "I'll take the trace format
> string, and I'll print it out partly using the regular vprintf()
> machinery, but I'll partly dig into it and do horrendous things to
> it".
>
> And *THAT* is the disgusting code, and it only deals with '%s' (and
> the odd symbol relocation case that I think is completely broken).
The symbol relocation is a special case that just hooked into this code.
>
> And the '%s' case we could at least partly handle in lib/vsprintf.c -
> not the odd tracing special cases (that you might as well just handle
> in test_event_printk()), but the "we can check that it doesn't fault".
Again, it was added to fix the same issue that test_event_printk(), but for
strings because there's several trace events that point to static strings,
and that will be flagged as being a bug if it was done in test_event_printk().
-- Steve
On Sat, 14 Dec 2024 at 20:23, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> We are using this for ChromeOS in ChromeBooks where we have to have KASLR
> on. And yes, I've documented that it has to be the same kernel to work and
> will not work for modules. That has been accepted by the users.
"Accepted by users" is not an argument for sh*t code quality.
This smells of all the horrors you had in tracefs. Bad, unmaintainable
code, that caused endless "fix up bugs because the code was doing bad
things".
Now you're doing the same thing here. Bad, unmaintainable code that
basically misuses the vsprintf code instead of misusing the VFS code.
I can rip it out if you refuse to do so. In the kernel we don't put in
random hacks like this that "users are ok with". Code needs to make
sense and be maintainable, not be a pile of garbage hackery that then
results in even more hackery because the first hackery was broken and
caused problems.
Linus
On Sat, 14 Dec 2024 20:37:01 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 14 Dec 2024 at 20:23, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > We are using this for ChromeOS in ChromeBooks where we have to have > > KASLR on. And yes, I've documented that it has to be the same kernel to > > work and will not work for modules. That has been accepted by the > > users. > > "Accepted by users" is not an argument for sh*t code quality. That's just for the mapped code. I can easily add logic to make it fail if the kernel signature does not match, and the deltas will be zero and the buffer is cleared, and it will not affect anything here. I was planning on adding that anyway. I can also make it ignore pointers to modules. The use case is for always on tracing and when a crash happens, you can retrieve the events that lead up to a crash. The text_delta was only added to this because it was causing this logic to trigger WARN_ON()s when it tried to read the origin address of the static strings. > > This smells of all the horrors you had in tracefs. Bad, unmaintainable > code, that caused endless "fix up bugs because the code was doing bad > things". > > Now you're doing the same thing here. Bad, unmaintainable code that > basically misuses the vsprintf code instead of misusing the VFS code. > > I can rip it out if you refuse to do so. In the kernel we don't put in > random hacks like this that "users are ok with". Code needs to make > sense and be maintainable, not be a pile of garbage hackery that then > results in even more hackery because the first hackery was broken and > caused problems. But main code of trace_check_vprintk() that you appear to be objecting to is the hacking into vsnprintf() that was added back in 2021 to handle the RCU and other events that point to their strings. I'm not sure how we can check that before hand because we don't see the pointers until the event is being printed. Sure, we can harden the "%s" to just print "(fault)" or something, but then we don't know about it until it happens, and then we will likely get trace events that have string pointers to fields that may be freed in the future popping up in the ring buffer again. Testing will not uncover this because the strings will exist when the developer tests the events. But when a user sends a report back to someone and the trace has a bunch of "(fault)"s in it where strings should be, I'm the one that's going to be told that tracing is broken. -- Steve
On Sat, 14 Dec 2024 at 20:52, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But main code of trace_check_vprintk() that you appear to be objecting to
> is the hacking into vsnprintf() that was added back in 2021 to handle the
> RCU and other events that point to their strings. I'm not sure how we can
> check that before hand because we don't see the pointers until the event is
> being printed.
I'm objecting to ALL of it.
It's a debug feature that is *broken*. And not in some kind of subtle
way. In a very fundamental way.
Linus
On Sat, 14 Dec 2024 22:04:03 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> With the check, instead it doesn't crash the kernel, but issues a nasty
> WARN_ON() (as it is a bug in the kernel) and then reads the value using
> safe mechanisms and injects into the trace "[UNSAFE-MEMORY]" by the pointer.
I will also point out that the check triggers even if the pointer is still
around (hasn't been freed). That is, it makes sure that the contents of the
pointer is in the ring buffer event that is being printed, or if a %s, it
is also OK to point to static strings (as the RCU trace points all do that).
That is, even though:
TP_fast_assign(
__entry->ipv6 = ptr_to_ipv6;
)
TP_printk("ipv6=%pI6", __entry->ipv6)
is buggy, and the check will report it immediately even if the contents of
that ptr_to_ipv6 hasn't been freed yet. But it is perfectly fine with:
TP_fast_assign(
memcpy(__entry->ipv6, ptr_to_ipv6, 16);
)
TP_printk("ipv6=%pI6", &__entry->ipv6)
As the content of the ipv6 lives in the ring buffer itself and will not be
freed at the time the event is printed.
-- Steve
On Sat, 14 Dec 2024 at 19:11, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> That is, even though:
>
> TP_fast_assign(
> __entry->ipv6 = ptr_to_ipv6;
> )
>
> TP_printk("ipv6=%pI6", __entry->ipv6)
Again, you're not talking about the code that is being fixed.
I'm complaining about the mess that you try to fix in commit
ff54c8dc3f7a ("tracing: Fix trace output when pointer hash is
disabled"), which is that eldritch horror called
trace_check_vprintf().
You are describing something entirely different: the admittedly
similarly named test_event_printk(), called from the event init code.
I think trace_check_vprintf() should be removed entirely.
And I think test_event_printk() could probably be expanded to handle
at least some cases of '%s' (because, as you say, pointers to various
static allocations are fine for it).
And no, I don't know the code all that well, so maybe I'm overlooking
something, but trace_check_vprintf() really is a horror. And all your
arguments for it seem to be missing the mark and are actually
arguments for test_event_printk().
Linus
On Sat, 14 Dec 2024 at 20:06, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And I think test_event_printk() could probably be expanded to handle
> at least some cases of '%s' (because, as you say, pointers to various
> static allocations are fine for it).
.. and I think we can at least protect against kernel faults in the
generic '%s' handling. We already do some of that in
check_pointer_msg, and it could be expanded upon with something like
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -695,13 +695,22 @@ static char *error_string(char *buf, ...
*/
static const char *check_pointer_msg(const void *ptr)
{
+ unsigned char val;
+
if (!ptr)
return "(null)";
if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
return "(efault)";
+ pagefault_disable();
+ __get_kernel_nofault(&val, (const char *)ptr, u8, Efault);
+ pagefault_enable();
return NULL;
+
+Efault:
+ pagefault_enable();
+ return "(bad address)";
}
which is admittedly
(a) ENTIRELY UNTESTED
(b) not exactly pretty
(c) only handles the first byte of the string
but it's at least very simple and cheap and fits in with the checks we
already do.
No, that vsprintf code wouldn't deal with any tracing-specific sanity
checks that say "the pointer is in the trace ring buffer", but as
mentioned, *that* code is entirely separate from the thing I'm
objecting to anyway.
Linus
On Sat, 14 Dec 2024 20:11:50 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> No, that vsprintf code wouldn't deal with any tracing-specific sanity
> checks that say "the pointer is in the trace ring buffer", but as
> mentioned, *that* code is entirely separate from the thing I'm
> objecting to anyway.
So what are you objecting to?
The current change, or the code that's already been there?
I'm also mistaken to what caused the bug and the Fixes tag is incorrect. I
thought the trace_check_vprintk() came before the code that added the
"%px", but it did come after. The "%px" code was added in 2020 and the
trace_check_vprintk() was added in 2021. The mistake was that it missed the
"hash-ptr" off case that then passes the iter->fmt as the "fmt" and the
trace_check_printk() was expecting the TP_printk() format to be the format
passed in. Which is the default case. The bug only happens when we clear
the hash-ptr option (which is on by default).
It should say:
Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events")
-- Steve
On Sat, 14 Dec 2024 at 20:38, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> So what are you objecting to?
I'm objecting to "we wrote bad code, so let's hack it up some more".
> The current change, or the code that's already been there?
Both.
The current change looked so random that it made me go "WTF?" and then
I looked at the code it's changing.
The code is literally making assumptions about how va_list arguments
even work. It calls trace_seq_printf() and expects that to keep the
"va_list" argument in sync as it walks the argument list.
The code even seems to *know* how hacky and wrong it is, because it
does a RUNTIME CHECK for this in test_can_verify(). But that really is
just testing one random implementation. The whole approach
fundamnetally doesn't work on some architectures at all, and on others
it would be broken by seq_buf_vprintf() (or vsprintf) doing a
va_copy() and using an internal version or any number of other things.
So the code IS WRONG.
It's wrong at a fundamental level. The thing where it modifies
iter->fmt in place is just a detail in the bigger picture of that
function being completely broken.
It's literally hacking together a really *bad* version of KASAN
checking, re-implementing 'vsnprintf()' *badly* by trying to use the
real vsnprintf() for the heavy lifting and literally hacking around
corner cases. And when I say that the "va_list" handling is buggy, I'm
*not* asking you to try to fix it. The va_list bug is just another
symptom of the deeper problem in that thing.
This is why I compared it to the tracefs situation. It has exactly the
same pattern: fundamentally misusing core kernel infrastructure,
hacking around it to get something that "works", but that is wrong on
a fundamental level, and then adding more hacks on top when some new
issue rears its ugly head.
I absolutely believe that coimmit ff54c8dc3f7a fixes some behavior.
But I'm looking at it and going "this code is not salvageable". All
you do is add more bandaids on top.
Do it right. And honestly, "right" in this case is almost certainly
"get rid of trace_check_vprintf() entirely".
The fact that you basically disable the thing already with a static
branch when 'va_list" doesn't work the way you want is a BIG sign.
Just disable it unconditionally.
Linus
On Sat, 14 Dec 2024 21:19:01 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: I'll start by saying that I'm fine with ripping this out, but I'm going to expect seeing bugs in trace events where people just reference strings in the TP_printk() instead of using the assign_str() logic. I use to get reports of this a few times a year, and have not had one since this code was added back in 2021. That's over 3 years of this bug going away in mainline. > The code is literally making assumptions about how va_list arguments > even work. It calls trace_seq_printf() and expects that to keep the > "va_list" argument in sync as it walks the argument list. I did my research on this before adding it. With the eventfs, I was very open about not understanding the file system logic and asked for help, which I received very little on. I had to go by tribal knowledge. > > The code even seems to *know* how hacky and wrong it is, because it > does a RUNTIME CHECK for this in test_can_verify(). But that really is > just testing one random implementation. The whole approach > fundamnetally doesn't work on some architectures at all, and on others > it would be broken by seq_buf_vprintf() (or vsprintf) doing a > va_copy() and using an internal version or any number of other things. > > So the code IS WRONG. As I said, I did my homework on this and knew exactly what it was doing. I did my research on va_list and it is possible to do this if the va_list is a reference and not a copy (as my comment states), which is what the verify was checking. > > It's wrong at a fundamental level. The thing where it modifies > iter->fmt in place is just a detail in the bigger picture of that > function being completely broken. BTW, this code had only two minor bugs in it since it was created, and the last one was on March 2022. The bug here, was simply that the temp buffer it used could also be passed into the function. That would break any algorithm. The logic itself was not the bug at all. One solution I was thinking of doing was to introduce a second temp buffer to use, and not change this code at all. > > It's literally hacking together a really *bad* version of KASAN > checking, re-implementing 'vsnprintf()' *badly* by trying to use the > real vsnprintf() for the heavy lifting and literally hacking around > corner cases. And when I say that the "va_list" handling is buggy, I'm > *not* asking you to try to fix it. The va_list bug is just another > symptom of the deeper problem in that thing. What va_list bug? That logic didn't break. The breakage came here: strncpy(iter->fmt, p, i); iter->fmt[i] = '\0'; Where it didn't expect p to be pointing to iter->fmt! Because it expected the fmt parameter to point to the "const char * printk_fmt" of the trace event, and not be pointing to the iter->fmt itself. My fault for not noticing in the call in trace_event_printf(): trace_check_vprintf(iter, trace_event_format(iter, fmt), ap); that trace_event_format() can return iter->fmt if the hash-ptr option is disabled. p was pointing into iter->fmt because iter->fmt was passed into the function as the "fmt" parameter that p was assigned to. The fmt parameter started with "comm=%s" and in this case iter->fmt = fmt and p was iterating it. When the fmt had: "comm=%s pid=%d" We get to i = 5 and since fmt = iter->fmt, p = fmt, that means p = iter->fmt! The iter->fmt[i] = '\0'; Just changed p[i] to be '\0'. That was the bug. It had nothing to do with the va_list logic. trace_seq_vprintf(&iter->seq, iter->fmt, ap); Would copy the "comm=" into the seq buffer, but then the code after that was broken, as p[i] == '\0'. strncpy(iter->fmt, p + i, j + 1); iter->fmt[j+1] = '\0'; Which just copied an empty string to iter->fmt, so when we do: trace_seq_printf(&iter->seq, iter->fmt, str); It also copied an empty string into the seq buffer. The result was: "comm= pid=1234" > > This is why I compared it to the tracefs situation. It has exactly the > same pattern: fundamentally misusing core kernel infrastructure, > hacking around it to get something that "works", but that is wrong on > a fundamental level, and then adding more hacks on top when some new > issue rears its ugly head. > > I absolutely believe that coimmit ff54c8dc3f7a fixes some behavior. > But I'm looking at it and going "this code is not salvageable". All > you do is add more bandaids on top. That commit fixes the iter->fmt being passed into the function by making the assumption that p will iterate over the temp buffer. It has nothing to do with the current code that hasn't seen a bug since 2022, and has stopped constant "%s" bugs since then. I could have replace ff54c8dc3f7a with using another temp buffer instead and not change the logic of trace_check_vprintk() at all. Hence, the va_list logic wasn't the bug. > > Do it right. And honestly, "right" in this case is almost certainly > "get rid of trace_check_vprintf() entirely". I can get rid of it, but I highly expect to see the "%s" bug start showing up within a few months again. > > The fact that you basically disable the thing already with a static > branch when 'va_list" doesn't work the way you want is a BIG sign. It gets disabled because I know there were two implementations of va_list. One that used a pass by reference and another that used a copy. This wouldn't work with the copy (32bit x86 does that) but does work with pass by reference (64 bit x86 does that). And the fact that the "%s" bug stopped showing up after I implemented it, and I was even asked by someone about the WARN_ON() triggering (but the bug never made it into your tree), I know that it was the reason the bug stopped showing up. > > Just disable it unconditionally. > I can do that, but I'm not looking forward to seeing random crashes in the trace event code again :-( Honestly, I did not like this code when I wrote it, but I have no idea how to stop the "%s" bug from happening before it gets out to production. This worked. Do you have any suggestions for alternatives? -- Steve
On Sun, 15 Dec 2024 at 02:04, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> What va_list bug? That logic didn't break.
That logic HAS NEVER EVER WORKED RIGHT.
You are literally mis-using va_list. The code is *wrong*. It depends
on the exact calling convention of varargs, and just happens to work
on many platforms.
And even when the calling conventions happened to match, it only
happens to work because of how vsnprintf() treats its incoming
argument.
So my complaint is literally that the code you are fixing is
unfixable. It's a "it happens to work" situation (except when it
didn't), not a "this code is right" situation.
And dammit, you must have known this code was garbage, because when I
do a "git blame", I see your name on that test_can_verify_check()
function.
Basically, you are walking the va_list in invalid ways, because you
want to use vsnprintf() in invalid ways. As part of that walk, you are
(a) fundamentally mis-using va_list and (b) modifying the format
string as you go along (both in the sense of actually changing the
format string, but also switching that format pointer around).
My issue with this is that (b) was buggy, but it was a bug that came
from a much more *fundamental* bug - the misuse of vsnprintf. Which
is why I'm saying "rip it out". Because your fix is fixing symptoms.
It's not fixing the fact that you are doing things that are wrong.
Looking up more, I also think the whole "replace "%p" with "%px" is
broken and wrong.
Isn't every single case of '%p' in this context from a TP_printk() situation?
IOW, instead of dynamically creating a temporary buffer and adding
that 'x' by hand, why wasn't that just a 'sed' script and done
statically?
In fact, wouldn't *most* of the sanity checking be possible to just do
entirely statically instead of at runtime?
Yeah, yeah, there's more than a few '%p' users, but
git grep 'TP_printk.*%p[^A-Za-z]' | wc
shows that it's manageable. That probably misses some multi-line
cases, but still - doing this kind of "rewrite constant string at
runtime because we didn't do it statically" seems *wrong*.
And in this case, that wrongness was literally the cause of the bug.
Linus
On Sun, 15 Dec 2024 09:23:18 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > And dammit, you must have known this code was garbage, because when I > do a "git blame", I see your name on that test_can_verify_check() > function. Yes, that was added in the same commit that added the trace_check_vprintk() code. > > Basically, you are walking the va_list in invalid ways, because you > want to use vsnprintf() in invalid ways. As part of that walk, you are > (a) fundamentally mis-using va_list and (b) modifying the format > string as you go along (both in the sense of actually changing the > format string, but also switching that format pointer around). > > My issue with this is that (b) was buggy, but it was a bug that came > from a much more *fundamental* bug - the misuse of vsnprintf. Which > is why I'm saying "rip it out". Because your fix is fixing symptoms. > It's not fixing the fact that you are doing things that are wrong. I'm not disagreeing with you. I didn't like the code when I wrote it, but I did write it because it was the only way I could stop the "%s" bug. Last night, thinking about this, I think I have another solution that can be added to the test_event_printk() and flag the fields that can be printed via "%s" to make sure that they are correct *before* processing the vsnprintf() code. That should give the same warning and checks as the trace_check_vprintf(). I would also add a check in the vsnprintf() that uses __get_kernel_nofault() like you suggested. > > Looking up more, I also think the whole "replace "%p" with "%px" is > broken and wrong. > > Isn't every single case of '%p' in this context from a TP_printk() situation? > > IOW, instead of dynamically creating a temporary buffer and adding > that 'x' by hand, why wasn't that just a 'sed' script and done > statically? > > In fact, wouldn't *most* of the sanity checking be possible to just do > entirely statically instead of at runtime? > > Yeah, yeah, there's more than a few '%p' users, but > > git grep 'TP_printk.*%p[^A-Za-z]' | wc > > shows that it's manageable. That probably misses some multi-line > cases, but still - doing this kind of "rewrite constant string at > runtime because we didn't do it statically" seems *wrong*. > > And in this case, that wrongness was literally the cause of the bug. I'm also OK with that. Should that be done for 6.13 or something to be added for 6.14? -- Steve
On Sun, 15 Dec 2024 at 17:24, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > IOW, instead of dynamically creating a temporary buffer and adding
> > that 'x' by hand, why wasn't that just a 'sed' script and done
> > statically?
>
> I'm also OK with that. Should that be done for 6.13 or something to be
> added for 6.14?
So a scripted thing could probably be done right now, assuming it's
obvious enough. Something like
git grep -l TP_printk.*%p |
xargs sed -i '/TP_printk/s/%p\([^a-zA-Z]\)/%px\1/'
would seem to do the RightThing(tm) for at least simple cases. And I
didn't actually find any cases of people using %p with precision
modifiers or anything like that, so "simple cases" seems to be all
that exists.
What the above does *not* do is handle any multi-line cases, and there
are probably several of those.
That said, honestly, looking at the resulting diff, I really think a
lot of those %p users are mindless drivel in the first place, and I am
not convinced that the real address is even wanted or needed. I think
people have completely mindlessly added "print out address" without
any actual reason for it.
I'm seeing things like just random usb_request pointers being printed
out, and I'm not convinced that is ever really useful or a situation
where a '%px' is *any* different from a plain '%p' (ie the main thing
I suspect you can do with that value is just say "it's the same
request", and the hashed value works just fine for that).
So I am not convinced that the original reason for the mindless
(runtime) conversion of '%p' to '%px' was really even well-motivated.
I think it was a bad idea, and implemented badly.
End result: instead of doing that automated sed-script (which is
certainly easy, but noisy), I actually think it might be much better
to just remove the runtime '%p' -> '%px' conversion entirely, and see
if anybody even notices - and then when somebody has a good reason for
actually caring about a random kernel data structure address, change
it at *THAT* point.
Linus
On Mon, 16 Dec 2024 09:59:35 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > That said, honestly, looking at the resulting diff, I really think a > lot of those %p users are mindless drivel in the first place, and I am > not convinced that the real address is even wanted or needed. I think > people have completely mindlessly added "print out address" without > any actual reason for it. > > I'm seeing things like just random usb_request pointers being printed > out, and I'm not convinced that is ever really useful or a situation > where a '%px' is *any* different from a plain '%p' (ie the main thing > I suspect you can do with that value is just say "it's the same > request", and the hashed value works just fine for that). > > So I am not convinced that the original reason for the mindless > (runtime) conversion of '%p' to '%px' was really even well-motivated. > I think it was a bad idea, and implemented badly. > > End result: instead of doing that automated sed-script (which is > certainly easy, but noisy), I actually think it might be much better > to just remove the runtime '%p' -> '%px' conversion entirely, and see > if anybody even notices - and then when somebody has a good reason for > actually caring about a random kernel data structure address, change > it at *THAT* point. Masami was the one that implemented this for debugging reasons. Masami, what was the motivation behind using %px? I know you needed that to debug crashes to compare some event pointers with where they exist in the crash dump. But perhaps for those events that you used for crashes, we can just change the TP_printk() over to use "%px" and remove the full conversion of all events that you have. Thoughts? -- Steve
On Sun, 15 Dec 2024 20:24:04 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > > Isn't every single case of '%p' in this context from a TP_printk() situation? > > > > IOW, instead of dynamically creating a temporary buffer and adding > > that 'x' by hand, why wasn't that just a 'sed' script and done > > statically? > > > > In fact, wouldn't *most* of the sanity checking be possible to just do > > entirely statically instead of at runtime? > > > > Yeah, yeah, there's more than a few '%p' users, but > > > > git grep 'TP_printk.*%p[^A-Za-z]' | wc > > > > shows that it's manageable. That probably misses some multi-line > > cases, but still - doing this kind of "rewrite constant string at > > runtime because we didn't do it statically" seems *wrong*. > > > > And in this case, that wrongness was literally the cause of the bug. > > I'm also OK with that. Should that be done for 6.13 or something to be > added for 6.14? One issue with just doing it manually, is that it would need to be maintained, as one or two "%p" are added per release cycle. -- Steve
On Mon, 16 Dec 2024 at 05:32, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> One issue with just doing it manually, is that it would need to be
> maintained, as one or two "%p" are added per release cycle.
See my other thing. I think people add those things entirely
mindlessly without any thinking.
If you add garbage, you get garbage. I do *not* think "people add one
or two of this every release' is an argument for it being valid at
all.
Linus
On Sun, Dec 15, 2024 at 09:23:18AM -0800, Linus Torvalds wrote: > > You are literally mis-using va_list. The code is *wrong*. It depends > on the exact calling convention of varargs, and just happens to work > on many platforms. It seems to me that the disagreement is fundamentally about whether we can depend on implementation details, or the formal abstraction of interfaces like varargs. One school of thought is that we should only depend on the formally defined abstraction, so that we are proof against random breakage cauesed by compilers trying to win benchmarketing wars by proving that they are 0.001% faster because that makes a big deal when you can advertise on the back cover of Businessweek. (OK, that's really more a trick that enterprise databases play, but you get the point.) The other school of thought is that when trying to squeeze every last cycle of performance (because *we* are the ones engaging in benchmarketing wars), it's fair game to depend on implementation details if it gets us a sufficiently large performance advantage, or if it allows us to preserve interface semantics (perhaps for something which was imprudently guaranteed by us when we or the code was younger and more foolish, but we really don't want to break programs depending on the current semantics). I've been on both sides of this debate, although when I do the second, it's often because I know something specific about my operating environment (such as $WORK's data centers will *never* worry about using big endian systems, or some such). I *have* gotten in trouble when I do this, so these days I insist on doumenting with big red flags what abstractions I am violating, and the justification for doing this, and adding tests that check to make sure that the assumptions I am making won't suddenly break with a new version of the compiler, or when someone tries to do something like introduce Rust bindings, that might not know about the terrible assumptions we are making. I'm not convinced that it's worth it in this particular case, so I think I side with Linus here; maybe all of this hackery isn't worth it? Steven, what am I missing? Why did we go down this particular path in the first place? I assume there must have been something that seemed like a good reason at the time? - Ted
[ I'm back from a lovely day with my wife and friends visiting the Finger Lake wineries which I would love to share with the Linux Plumbers attendees if we ever get to have Plumbers at the Ithaca Conference center. ;-) ] On Sun, 15 Dec 2024 16:40:34 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > I'm not convinced that it's worth it in this particular case, so I > think I side with Linus here; maybe all of this hackery isn't worth > it? Steven, what am I missing? Why did we go down this particular > path in the first place? I assume there must have been something that > seemed like a good reason at the time? It has nothing to do with performance. It has only to do with debugging and catching the issue where developers use "%s" in the TP_printk() of TRACE_EVENT() for a string that can be freed after the trace event was executed but before it is read from the ring buffer. I stated in another email that I didn't like this solution when I wrote it, but it was the only solution I could think of to catch this common bug. But I also stated in another email that I think there's another solution that doesn't deal with va_list implementations and should be acceptable by Linus. I'm not against ripping out the code. I agree with Linus that it's a hack, but the hack that caught the bugs early seemed to me worth it at the time. -- Steve
On Sun, 15 Dec 2024 05:05:17 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Honestly, I did not like this code when I wrote it, but I have no idea how > to stop the "%s" bug from happening before it gets out to production. This > worked. Do you have any suggestions for alternatives? My mind wouldn't stop thinking about this all night, and I think I found an alternative. The test_event_printk() catches everything but the "%s" issue, only because we allow events to print to static strings and not have to copy them into the ring buffer. But that function also has access to the event fields. It can figure out which field is being printed by "%s", and flag it. Then on output, it can check that the field is going to be printed before the vsnprintf() is called on the TP_printk() format. If it is a bad field, it can inject into the trace that the field is bad, not print the event at all, and trigger a WARN_ON_ONCE(). Hmm, -- Steve
On 2024-12-15 05:05, Steven Rostedt wrote: > On Sat, 14 Dec 2024 21:19:01 -0800 > Linus Torvalds <torvalds@linux-foundation.org> wrote: [...] >> >> Just disable it unconditionally. >> > > I can do that, but I'm not looking forward to seeing random crashes in the > trace event code again :-( > > Honestly, I did not like this code when I wrote it, but I have no idea how > to stop the "%s" bug from happening before it gets out to production. This > worked. Do you have any suggestions for alternatives? IMHO, deferred execution of TP_printk() code in kernel context is a fundamental mistake causing all those problems. This opens the door to store pointers to strings (or anything else really) that sit in kernel modules which can be unloaded between tracing and TP_printk() execution, or as we are seeing here pointers to data which can be mapped at different addresses across kernel reboot, into the ring buffer. If TP_printk() don't have access to load data from random kernel memory in the first place, and can only read from the buffer, we would not be having those misuses, and there would be nothing to work-around as the strings/data would all be serialized into the ring buffer. In LTTng we've taken the approach to only read the trace data at post-processing from user-space (we don't have the equivalent of TP_printk(), and that's on purpose). I wonder if we could keep the ftrace trace_pipe pretty-printing behavior, while isolating the TP_printk() execution into a userspace process which would only map the ring buffer ? This way, users trying to misuse TP_printk() would get immediate feedback about their mistake because they cannot print the trace. We could print a dmesg warning about crash of a usermode helper program, for instance. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Sun, 15 Dec 2024 07:42:35 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > On 2024-12-15 05:05, Steven Rostedt wrote: > > On Sat, 14 Dec 2024 21:19:01 -0800 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > [...] > > >> > >> Just disable it unconditionally. > >> > > > > I can do that, but I'm not looking forward to seeing random crashes in the > > trace event code again :-( > > > > Honestly, I did not like this code when I wrote it, but I have no idea how > > to stop the "%s" bug from happening before it gets out to production. This > > worked. Do you have any suggestions for alternatives? > > IMHO, deferred execution of TP_printk() code in kernel context is > a fundamental mistake causing all those problems. This opens the > door to store pointers to strings (or anything else really) > that sit in kernel modules which can be unloaded between Module unloading will clear out the ring buffers to prevent issues. > tracing and TP_printk() execution, or as we are seeing here > pointers to data which can be mapped at different addresses > across kernel reboot, into the ring buffer. > > If TP_printk() don't have access to load data from random kernel > memory in the first place, and can only read from the buffer, we > would not be having those misuses, and there would be nothing to > work-around as the strings/data would all be serialized into the > ring buffer. > > In LTTng we've taken the approach to only read the trace data > at post-processing from user-space (we don't have the equivalent > of TP_printk(), and that's on purpose). > > I wonder if we could keep the ftrace trace_pipe pretty-printing > behavior, while isolating the TP_printk() execution into a > userspace process which would only map the ring buffer ? This way, That would change the entire use of tracefs, especially in the embedded world. Note, this hasn't been a major issue since the test/check logic was put in place. It catches pretty much all issues with the delayed printing. -- Steve > users trying to misuse TP_printk() would get immediate feedback > about their mistake because they cannot print the trace. We could > print a dmesg warning about crash of a usermode helper program, > for instance.
On 2024-12-15 08:47, Steven Rostedt wrote: > On Sun, 15 Dec 2024 07:42:35 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> On 2024-12-15 05:05, Steven Rostedt wrote: >>> On Sat, 14 Dec 2024 21:19:01 -0800 >>> Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> [...] >> >>>> >>>> Just disable it unconditionally. >>>> >>> >>> I can do that, but I'm not looking forward to seeing random crashes in the >>> trace event code again :-( >>> >>> Honestly, I did not like this code when I wrote it, but I have no idea how >>> to stop the "%s" bug from happening before it gets out to production. This >>> worked. Do you have any suggestions for alternatives? >> >> IMHO, deferred execution of TP_printk() code in kernel context is >> a fundamental mistake causing all those problems. This opens the >> door to store pointers to strings (or anything else really) >> that sit in kernel modules which can be unloaded between > > Module unloading will clear out the ring buffers to prevent issues. As a side-effect issues caused by module unloading won't be observable with tracing. > >> tracing and TP_printk() execution, or as we are seeing here >> pointers to data which can be mapped at different addresses >> across kernel reboot, into the ring buffer. >> >> If TP_printk() don't have access to load data from random kernel >> memory in the first place, and can only read from the buffer, we >> would not be having those misuses, and there would be nothing to >> work-around as the strings/data would all be serialized into the >> ring buffer. >> >> In LTTng we've taken the approach to only read the trace data >> at post-processing from user-space (we don't have the equivalent >> of TP_printk(), and that's on purpose). >> >> I wonder if we could keep the ftrace trace_pipe pretty-printing >> behavior, while isolating the TP_printk() execution into a >> userspace process which would only map the ring buffer ? This way, > > That would change the entire use of tracefs, especially in the embedded > world. Note, this hasn't been a major issue since the test/check logic was > put in place. It catches pretty much all issues with the delayed printing. This is not at all what I have in mind, so let me rephrase. What I am saying is: is there a way we could execute TP_printk() in userspace mode _while preserving the trace_pipe tracefs ABI_ ? I suspect that inserting this small userspace program into the kernel image with objcopy would be a start. Then adapting the usermode helper code to run a program from a preexisting in-kernel copy could be a second step. Then modifying trace_pipe so it blocks and communicates with this helper program to consume the formatted output would come last. Thanks, Mathieu > > -- Steve > > >> users trying to misuse TP_printk() would get immediate feedback >> about their mistake because they cannot print the trace. We could >> print a dmesg warning about crash of a usermode helper program, >> for instance. -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Sun, 15 Dec 2024 05:05:17 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > That commit fixes the iter->fmt being passed into the function by making > the assumption that p will iterate over the temp buffer. It has nothing to > do with the current code that hasn't seen a bug since 2022, and has stopped > constant "%s" bugs since then. In fact, I could have fixed this with this one line change: diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index be62f0ea1814..d990687335b1 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3722,6 +3722,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, if (static_branch_unlikely(&trace_no_verify)) goto print; + /* If hash-ptr is not set, the fmt points to our temp buffer */ + if (p == iter->fmt) + goto print; + /* * When the kernel is booted with the tp_printk command line * parameter, trace events go directly through to printk(). -- Steve
© 2016 - 2026 Red Hat, Inc.