From nobody Tue Dec 2 02:52:29 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16983208D0; Tue, 18 Nov 2025 00:35:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; cv=none; b=FHS0jWumAWr4daD7JxrL9TkY42yBjkefg509hbE3Nj6eMvCA+m3cxFhHwSzwqQN8i7/1+qJwkxA/5PLFB4geWWbCHiI1o6iMZszqhOdrj524iMutDzhZgJ0zME7vuW21eLzDHFgAOJv/HkWFJJtjo747IaLCFT1o0BUtfFiadek= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; c=relaxed/simple; bh=oC+Ulzp97dIDwxpUsxsEETc/+xJVXDYRMxLtCVkv3AY=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=OOVyJFtgcnmoUD7LOm5tc0QDZf4DvHB6R/80cYLYtxRJmW0vwLTjnnrbllATAZFYOt0vNXfeC1KZUAdJrkQKf2TUgdTPPp1r2Z+Gnjt3qae+1LHZsSQxqANh9ecJz8Pjr7rJ8UNbBXik03heX73ipN+JyKQ8cqadUtBbtjICnOs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Iz/p/as8; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Iz/p/as8" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 35386C2BCB3; Tue, 18 Nov 2025 00:35:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763426106; bh=oC+Ulzp97dIDwxpUsxsEETc/+xJVXDYRMxLtCVkv3AY=; h=Date:From:To:Cc:Subject:References:From; b=Iz/p/as8dBHuwlL8CrKuGtV+9YowSs0TEcU5vMgZPwysHNjmE4egW/y0mGIc3AW0P tHGjCYMHi1k6mylwLty8sy7Yk75p7nWJcAvpFVKllLcxVjrjnfHL58inxG/7cADas/ qUPyZljgCQuUrMainpTIYU0QSF6uJTTe/x/2f9kSfESHd8ITAu40JeFHHuNmlEQQsW OBluGHW9mwUiYiZkC4kPH/cQRO236M5g1P/fH1jO7E+w0gq4Mz6ldNI53/Ts5nfi/I hPQjzGqAx0PYIUiP3mfeSlAAoJy/senJWV6NbImRulIuOkQpmM7kEqWCo8NiWnrA76 JCXr+AnKtTdNw== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1vL9gx-00000002zJz-3Iup; Mon, 17 Nov 2025 19:35:31 -0500 Message-ID: <20251118003531.644484343@kernel.org> User-Agent: quilt/0.68 Date: Mon, 17 Nov 2025 19:29:51 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Thomas Gleixner , Ian Rogers , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Douglas Raillard Subject: [POC][RFC][PATCH 1/3] tracing: Add perf events References: <20251118002950.680329246@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Add perf events into the ftrace ring buffer. Create a new ftrace event called a "perf_event". This event contains a dynamic array of u64 words. Each entry allows to read 56 bits of the raw content of a perf PMU value into the word leaving 8 bits as an identifier for what word that is. One may ask "what happens when the counter is greater than 56 bits". The answer is that you really shouldn't care. The value is written for user space to consume and do any calculations. If one wants to see the difference between two events, they can simply subtract the previous one from the next one. If there is a wrap over the 56 bits, then adding a "1ULL << 56" to the second value if it is less than the first will give the correct result. "What happens if the difference of the counters is 1 << 55 apart?" Let's look at CPU cycles, as they probably go up the quickest. At 4GHz, that would be 4,000,000,000 times a second. 1 << 55 / 400000000 =3D 9007199 seconds 9007199 / 60 =3D 150119 minutes 150119 / 60 =3D 2501 hours 2501 / 24 =3D 104 days! This will not work if you want to see the number of cycles between two events if those two events are 104 days apart. Do we care? Currently only cpu cycles and cache misses are supported, but more can be added in the future. Two new options are added: event_cache_misses and event_cpu_cycles # cd /sys/kernel/tracing # echo 1 > options/event_cache_misses # echo 1 > events/syscalls/enable # cat trace [..] bash-1009 [005] ..... 566.863956: sys_write -> 0x2 bash-1009 [005] ..... 566.863973: cache_misses: 26544738 bash-1009 [005] ..... 566.864003: sys_dup2(oldfd: 0xa, newfd: 1) bash-1009 [005] ..... 566.864004: cache_misses: 26546241 bash-1009 [005] ..... 566.864021: sys_dup2 -> 0x1 bash-1009 [005] ..... 566.864022: cache_misses: 26549598 bash-1009 [005] ..... 566.864059: sys_fcntl(fd: 0xa, cmd: 1, arg: 0) bash-1009 [005] ..... 566.864060: cache_misses: 26558778 The option will cause the perf event to be triggered after every event. If cpu_cycles is also enabled: # echo 1 > options/event_cpu_cycles # cat trace [..] bash-1009 [006] ..... 683.223244: sys_write -> 0x2 bash-1009 [006] ..... 683.223245: cpu_cycles: 273245 cache_misses: 4= 0481492 bash-1009 [006] ..... 683.223262: sys_dup2(oldfd: 0xa, newfd: 1) bash-1009 [006] ..... 683.223263: cpu_cycles: 286640 cache_misses: 4= 0483017 bash-1009 [006] ..... 683.223278: sys_dup2 -> 0x1 bash-1009 [006] ..... 683.223279: cpu_cycles: 301412 cache_misses: 4= 0486560 bash-1009 [006] ..... 683.223309: sys_fcntl(fd: 0xa, cmd: 1, arg: 0) bash-1009 [006] ..... 683.223310: cpu_cycles: 335188 cache_misses: 4= 0495672 bash-1009 [006] ..... 683.223317: sys_fcntl -> 0x1 Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 113 +++++++++++++++++++++- kernel/trace/trace.h | 28 ++++++ kernel/trace/trace_entries.h | 13 +++ kernel/trace/trace_event_perf.c | 162 ++++++++++++++++++++++++++++++++ kernel/trace/trace_output.c | 70 ++++++++++++++ 5 files changed, 385 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 59cd4ed8af6d..64d966a3ec8b 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1110,7 +1110,6 @@ void tracing_on(void) } EXPORT_SYMBOL_GPL(tracing_on); =20 - static __always_inline void __buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_eve= nt *event) { @@ -2915,6 +2914,103 @@ void trace_event_buffer_commit(struct trace_event_b= uffer *fbuffer) } EXPORT_SYMBOL_GPL(trace_event_buffer_commit); =20 +#ifdef CONFIG_PERF_EVENTS +static inline void record_perf_event(struct trace_array *tr, + struct trace_buffer *buffer, + unsigned int trace_ctx) +{ + struct ring_buffer_event *event; + struct perf_event_entry *entry; + int entries =3D READ_ONCE(tr->perf_events); + struct trace_array_cpu *data; + u64 *value; + int size; + int cpu; + + if (!entries) + return; + + guard(preempt_notrace)(); + cpu =3D smp_processor_id(); + + /* Prevent this from recursing */ + data =3D per_cpu_ptr(tr->array_buffer.data, cpu); + if (unlikely(!data) || local_read(&data->disabled)) + return; + + if (local_inc_return(&data->disabled) !=3D 1) + goto out; + + size =3D struct_size(entry, values, entries); + event =3D trace_buffer_lock_reserve(buffer, TRACE_PERF_EVENT, size, + trace_ctx); + if (!event) + goto out; + entry =3D ring_buffer_event_data(event); + value =3D entry->values; + + if (tr->trace_flags & TRACE_ITER(PERF_CYCLES)) { + *value++ =3D TRACE_PERF_VALUE(PERF_TRACE_CYCLES); + entries--; + } + + if (entries && tr->trace_flags & TRACE_ITER(PERF_CACHE)) { + *value++ =3D TRACE_PERF_VALUE(PERF_TRACE_CACHE); + entries--; + } + + /* If something changed, zero the rest */ + if (unlikely(entries)) + memset(value, 0, sizeof(u64) * entries); + + trace_buffer_unlock_commit_nostack(buffer, event); + out: + local_dec(&data->disabled); +} + +static int handle_perf_event(struct trace_array *tr, u64 mask, int enabled) +{ + int ret =3D 0; + int type; + + switch (mask) { + + case TRACE_ITER(PERF_CYCLES): + type =3D PERF_TRACE_CYCLES; + break; + case TRACE_ITER(PERF_CACHE): + type =3D PERF_TRACE_CACHE; + break; + default: + return -EINVAL; + } + + if (enabled) + ret =3D trace_perf_event_enable(type); + else + trace_perf_event_disable(type); + + if (ret < 0) + return ret; + + if (enabled) + tr->perf_events++; + else + tr->perf_events--; + + if (WARN_ON_ONCE(tr->perf_events < 0)) + tr->perf_events =3D 0; + + return 0; +} +#else +static inline void record_perf_event(struct trace_array *tr, + struct trace_buffer *buffer, + unsigned int trace_ctx) +{ +} +#endif + /* * Skip 3: * @@ -2932,6 +3028,8 @@ void trace_buffer_unlock_commit_regs(struct trace_arr= ay *tr, { __buffer_unlock_commit(buffer, event); =20 + record_perf_event(tr, buffer, trace_ctx); + /* * If regs is not set, then skip the necessary functions. * Note, we can still get here via blktrace, wakeup tracer @@ -5287,7 +5385,20 @@ int set_tracer_flag(struct trace_array *tr, u64 mask= , int enabled) update_marker_trace(tr, enabled); /* update_marker_trace updates the tr->trace_flags */ return 0; + +#ifdef CONFIG_PERF_EVENTS + case TRACE_ITER(PERF_CACHE): + case TRACE_ITER(PERF_CYCLES): + { + int ret =3D 0; + + ret =3D handle_perf_event(tr, mask, enabled); + if (ret < 0) + return ret; + break; } +#endif + } /* switch (mask) */ =20 if (enabled) tr->trace_flags |=3D mask; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 58be6d741d72..094a156b0c70 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -56,6 +56,7 @@ enum trace_type { TRACE_TIMERLAT, TRACE_RAW_DATA, TRACE_FUNC_REPEATS, + TRACE_PERF_EVENT, =20 __TRACE_LAST_TYPE, }; @@ -363,6 +364,8 @@ struct trace_array { =20 int buffer_disabled; =20 + int perf_events; + struct trace_pid_list __rcu *filtered_pids; struct trace_pid_list __rcu *filtered_no_pids; /* @@ -537,6 +540,7 @@ extern void __ftrace_bad_type(void); IF_ASSIGN(var, ent, struct hwlat_entry, TRACE_HWLAT); \ IF_ASSIGN(var, ent, struct osnoise_entry, TRACE_OSNOISE);\ IF_ASSIGN(var, ent, struct timerlat_entry, TRACE_TIMERLAT);\ + IF_ASSIGN(var, ent, struct perf_event_entry, TRACE_PERF_EVENT); \ IF_ASSIGN(var, ent, struct raw_data_entry, TRACE_RAW_DATA);\ IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \ TRACE_MMIO_RW); \ @@ -1382,6 +1386,29 @@ extern int trace_get_user(struct trace_parser *parse= r, const char __user *ubuf, # define TRACE_ITER_PROF_TEXT_OFFSET_BIT -1 #endif =20 +#ifdef CONFIG_PERF_EVENTS +#define PERF_MAKE_VALUE(type, val) (((type) << 56) | ((val) & ~(0xffULL <<= 56))) +/* Not required, but keep consistent with include/uapi/linux/perf_event.h = */ +#define PERF_TRACE_CYCLES 0ULL +#define PERF_TRACE_CACHE 5ULL +#define TRACE_PERF_VALUE(type) \ + PERF_MAKE_VALUE((type), do_trace_perf_event(type)) +#define PERF_TRACE_VALUE(val) ((val) & ~(0xffULL << 56)) +#define PERF_TRACE_TYPE(val) ((val) >> 56) +# define PERF_FLAGS \ + C(PERF_CACHE, "event_cache_misses"), \ + C(PERF_CYCLES, "event_cpu_cycles"), + +u64 do_trace_perf_event(int type); +int trace_perf_event_enable(int type); +void trace_perf_event_disable(int type); +#else +# define PERF_FLAGS +static inline u64 do_trace_perf_event(int type) { return 0; } +static inline int trace_perf_event_enable(int type) { return -ENOTSUPP; } +static inline void trace_perf_event_disable(int type) { } +#endif /* CONFIG_PERF_EVENTS */ + /* * trace_iterator_flags is an enumeration that defines bit * positions into trace_flags that controls the output. @@ -1420,6 +1447,7 @@ extern int trace_get_user(struct trace_parser *parser= , const char __user *ubuf, FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ + PERF_FLAGS \ BRANCH_FLAGS \ PROFILER_FLAGS \ FPROFILE_FLAGS diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index de294ae2c5c5..ecda463a9d8e 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -456,3 +456,16 @@ FTRACE_ENTRY(timerlat, timerlat_entry, __entry->context, __entry->timer_latency) ); + +#ifdef CONFIG_PERF_EVENTS +FTRACE_ENTRY(perf_event, perf_event_entry, + + TRACE_PERF_EVENT, + + F_STRUCT( + __dynamic_array(u64, values ) + ), + + F_printk("values: %lld\n", __entry->values[0]) +); +#endif diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_per= f.c index a6bb7577e8c5..ff864d300251 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -430,6 +430,168 @@ void perf_trace_buf_update(void *record, u16 type) } NOKPROBE_SYMBOL(perf_trace_buf_update); =20 +static void perf_callback(struct perf_event *event, + struct perf_sample_data *data, + struct pt_regs *regs) +{ + /* nop */ +} + +struct trace_perf_event { + struct perf_event *event; +}; + +static struct trace_perf_event __percpu *perf_cache_events; +static struct trace_perf_event __percpu *perf_cycles_events; +static DEFINE_MUTEX(perf_event_mutex); +static int perf_cache_cnt; +static int perf_cycles_cnt; + +static inline int set_perf_type(int type, int *ptype, int *pconfig, int **= pcount, + struct trace_perf_event __percpu ***pevents) +{ + switch (type) { + case PERF_TRACE_CYCLES: + if (ptype) + *ptype =3D PERF_TYPE_HARDWARE; + if (pconfig) + *pconfig =3D PERF_COUNT_HW_CPU_CYCLES; + *pcount =3D &perf_cycles_cnt; + *pevents =3D &perf_cycles_events; + return 0; + + case PERF_TRACE_CACHE: + if (ptype) + *ptype =3D PERF_TYPE_HW_CACHE; + if (pconfig) + *pconfig =3D PERF_COUNT_HW_CACHE_MISSES; + *pcount =3D &perf_cache_cnt; + *pevents =3D &perf_cache_events; + return 0; + } + + return -EINVAL; +} + +u64 do_trace_perf_event(int type) +{ + struct trace_perf_event __percpu **pevents; + struct trace_perf_event __percpu *events; + struct perf_event *e; + int *count; + int cpu; + + if (set_perf_type(type, NULL, NULL, &count, &pevents) < 0) + return 0; + + if (!*count) + return 0; + + guard(preempt)(); + + events =3D READ_ONCE(*pevents); + if (!events) + return 0; + + cpu =3D smp_processor_id(); + + e =3D per_cpu_ptr(events, cpu)->event; + if (!e) + return 0; + + e->pmu->read(e); + return local64_read(&e->count); +} + +static void __free_trace_perf_events(struct trace_perf_event __percpu *eve= nts) +{ + struct perf_event *e; + int cpu; + + for_each_online_cpu(cpu) { + e =3D per_cpu_ptr(events, cpu)->event; + per_cpu_ptr(events, cpu)->event =3D NULL; + perf_event_release_kernel(e); + } +} + +int trace_perf_event_enable(int type) +{ + struct perf_event_attr __free(kfree) *attr =3D NULL; + struct trace_perf_event __percpu **pevents; + struct trace_perf_event __percpu *events; + struct perf_event *e; + int *count; + int config; + int cpu; + + if (set_perf_type(type, &config, &type, &count, &pevents) < 0) + return -EINVAL; + + guard(mutex)(&perf_event_mutex); + + if (*count) { + (*count)++; + return 0; + } + + attr =3D kzalloc(sizeof(*attr), GFP_KERNEL); + if (!attr) + return -ENOMEM; + + events =3D alloc_percpu(struct trace_perf_event); + if (!events) + return -ENOMEM; + + attr->type =3D type; + attr->config =3D config; + attr->size =3D sizeof(struct perf_event_attr); + attr->pinned =3D 1; + + /* initialize in case of failure */ + for_each_possible_cpu(cpu) { + per_cpu_ptr(events, cpu)->event =3D NULL; + } + + for_each_online_cpu(cpu) { + e =3D perf_event_create_kernel_counter(attr, cpu, NULL, + perf_callback, NULL); + if (IS_ERR_OR_NULL(e)) { + __free_trace_perf_events(events); + return PTR_ERR(e);; + } + per_cpu_ptr(events, cpu)->event =3D e; + } + + WRITE_ONCE(*pevents, events); + (*count)++; + + return 0; +} + +void trace_perf_event_disable(int type) +{ + struct trace_perf_event __percpu **pevents; + struct trace_perf_event __percpu *events; + int *count; + + if (set_perf_type(type, NULL, NULL, &count, &pevents) < 0) + return; + + guard(mutex)(&perf_event_mutex); + + if (WARN_ON_ONCE(!*count)) + return; + + if (--(*count)) + return; + + events =3D READ_ONCE(*pevents); + WRITE_ONCE(*pevents, NULL); + + __free_trace_perf_events(events); +} + #ifdef CONFIG_FUNCTION_TRACER static void perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip, diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index ebbab3e9622b..a0f21cec9eed 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1661,6 +1661,75 @@ static struct trace_event trace_timerlat_event =3D { .funcs =3D &trace_timerlat_funcs, }; =20 +/* TRACE_PERF_EVENT */ + +static enum print_line_t +trace_perf_event_print(struct trace_iterator *iter, int flags, + struct trace_event *event) +{ + struct trace_entry *entry =3D iter->ent; + struct trace_seq *s =3D &iter->seq; + struct perf_event_entry *field; + u64 value; + u64 *val; + u64 *end; + + end =3D (u64 *)((long)iter->ent + iter->ent_size); + + trace_assign_type(field, entry); + + for (val =3D field->values; val < end; val++) { + if (val !=3D field->values) + trace_seq_putc(s, ' '); + value =3D PERF_TRACE_VALUE(*val); + switch (PERF_TRACE_TYPE(*val)) { + case PERF_TRACE_CYCLES: + trace_seq_printf(s, "cpu_cycles: %lld", value); + break; + case PERF_TRACE_CACHE: + trace_seq_printf(s, "cache_misses: %lld", value); + break; + default: + trace_seq_printf(s, "unkown(%d): %lld", + (int)PERF_TRACE_TYPE(*val), value); + } + } + trace_seq_putc(s, '\n'); + return trace_handle_return(s); +} + +static enum print_line_t +trace_perf_event_raw(struct trace_iterator *iter, int flags, + struct trace_event *event) +{ + struct perf_event_entry *field; + struct trace_seq *s =3D &iter->seq; + u64 *val; + u64 *end; + + end =3D (u64 *)((long)iter->ent + iter->ent_size); + + trace_assign_type(field, iter->ent); + + for (val =3D field->values; val < end; val++) { + if (val !=3D field->values) + trace_seq_putc(s, ' '); + trace_seq_printf(s, "%lld\n", *val); + } + trace_seq_putc(s, '\n'); + return trace_handle_return(s); +} + +static struct trace_event_functions trace_perf_event_funcs =3D { + .trace =3D trace_perf_event_print, + .raw =3D trace_perf_event_raw, +}; + +static struct trace_event trace_perf_event_event =3D { + .type =3D TRACE_PERF_EVENT, + .funcs =3D &trace_perf_event_funcs, +}; + /* TRACE_BPUTS */ static enum print_line_t trace_bputs_print(struct trace_iterator *iter, int flags, @@ -1878,6 +1947,7 @@ static struct trace_event *events[] __initdata =3D { &trace_timerlat_event, &trace_raw_data_event, &trace_func_repeats_event, + &trace_perf_event_event, NULL }; =20 --=20 2.51.0 From nobody Tue Dec 2 02:52:29 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16A8921ABC1; Tue, 18 Nov 2025 00:35:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; cv=none; b=OAHmhe7/mB0Yp0+D5VxwW+qk6O65lqB/FpPS2SlyvJuTPvrYCW/xap9caOsgYyZSjQUoSzCUzajAQ8qk5OVngR1trFZwFZnpVwB8SnzzXHf4Mga/YjwtQHib9o5zqDXPdubAlziEQeKM1z4rJo6dnKAXKrPPA1/v5tDJkaffltE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; c=relaxed/simple; bh=QWpOXG3mobTX5Sv/5uXO57DFJ5NCFhZBBdeoZ3xnVkk=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=aOuc2gJ5w7aNP3QxcKMhmS86LEhbTSgX8CMMhCgBYZBK8RBP2IZlKVizEaZrfbWch50X3rEwO2ucqB7eceYnKxUevMU6smJV6/InGT206LpF/lotSKydCmTq79qpa4cGsPDRzAJXJKy6jYsqQeoXnV4XXSEBh/D7IxB07TR+Y4w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XvoCw06w; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XvoCw06w" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6B8DAC2BCB1; Tue, 18 Nov 2025 00:35:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763426106; bh=QWpOXG3mobTX5Sv/5uXO57DFJ5NCFhZBBdeoZ3xnVkk=; h=Date:From:To:Cc:Subject:References:From; b=XvoCw06wQXZdzRlAMsunEVxwdLg7b6x5ABzgko/4H1jwaErghOKO96NIoJA9HGcUA NhGesXHpl+Yb7eMx0iZDLLn662ATSSlhSldxV1Sj3LV6jUgLwxCaPv+f2yDcTxbzSc 1IUahbwq5AdOhy16Mgpm8OAJF5fqLDmMy4vAz6cQnu0f5jX3RkyvN5wJ3kI3eZdhdy +coFYVPP4lEOmghJ7eY3xY4FS0RlA8K6oviVt8OIA1Cg9+4+2yqYvnmETM++Z5Iooi /KnEYk0xQnz6TryywA4ThbZAg0w8atykpz1XR8q4Mrue6IBGNVlWpLyfJ5AXJaHnyS 2M876VdqZS4FA== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1vL9gx-00000002zKU-3slK; Mon, 17 Nov 2025 19:35:31 -0500 Message-ID: <20251118003531.805883570@kernel.org> User-Agent: quilt/0.68 Date: Mon, 17 Nov 2025 19:29:52 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Thomas Gleixner , Ian Rogers , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Douglas Raillard Subject: [POC][RFC][PATCH 2/3] ftrace: Add perf counters to function tracing References: <20251118002950.680329246@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Add option to trigger perf events to function tracing. Two new options are added: func-cpu-cycles and func-cache-misses # cd /sys/kernel/tracing # echo 1 > options/func-cache-misses # echo function > current_tracer # cat trace [..] sshd-session-1014 [005] ..... 327.836708: __x64_sys_read <-do_sysc= all_64 sshd-session-1014 [005] ..... 327.836708: cache_misses: 741719054 sshd-session-1014 [005] ..... 327.836712: ksys_read <-do_syscall_64 sshd-session-1014 [005] ..... 327.836713: cache_misses: 741720271 sshd-session-1014 [005] ..... 327.836716: fdget_pos <-ksys_read sshd-session-1014 [005] ..... 327.836717: cache_misses: 741721483 sshd-session-1014 [005] ..... 327.836720: vfs_read <-ksys_read sshd-session-1014 [005] ..... 327.836721: cache_misses: 741722726 sshd-session-1014 [005] ..... 327.836724: rw_verify_area <-vfs_read sshd-session-1014 [005] ..... 327.836725: cache_misses: 741723940 sshd-session-1014 [005] ..... 327.836728: security_file_permission= <-rw_verify_area sshd-session-1014 [005] ..... 327.836729: cache_misses: 741725151 The option will cause the perf event to be triggered after every function called. If cpu_cycles is also enabled: # echo 1 > options/func-cpu-cycles # cat trace [..] sshd-session-1014 [005] b..1. 536.844538: preempt_count_sub <-_raw= _spin_unlock sshd-session-1014 [005] b..1. 536.844539: cpu_cycles: 1919425978 c= ache_misses: 3431216952 sshd-session-1014 [005] b.... 536.844545: validate_xmit_skb_list <= -sch_direct_xmit sshd-session-1014 [005] b.... 536.844545: cpu_cycles: 1919429935 c= ache_misses: 3431218535 sshd-session-1014 [005] b.... 536.844551: validate_xmit_skb.isra.0= <-validate_xmit_skb_list sshd-session-1014 [005] b.... 536.844552: cpu_cycles: 1919433763 c= ache_misses: 3431220112 sshd-session-1014 [005] b.... 536.844557: netif_skb_features <-val= idate_xmit_skb.isra.0 sshd-session-1014 [005] b.... 536.844558: cpu_cycles: 1919437574 c= ache_misses: 3431221688 Signed-off-by: Steven Rostedt (Google) --- include/linux/trace_recursion.h | 5 +- kernel/trace/trace.c | 58 ++++++++++++--- kernel/trace/trace.h | 6 ++ kernel/trace/trace_functions.c | 124 ++++++++++++++++++++++++++++++-- 4 files changed, 178 insertions(+), 15 deletions(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursio= n.h index ae04054a1be3..c42d86d81afa 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -132,9 +132,12 @@ static __always_inline int trace_test_and_set_recursio= n(unsigned long ip, unsign * will think a recursion occurred, and the event will be dropped. * Let a single instance happen via the TRANSITION_BIT to * not drop those events. + * + * When ip is zero, the caller is purposely trying causing + * recursion. Don't record it. */ bit =3D TRACE_CTX_TRANSITION + start; - if (val & (1 << bit)) { + if ((val & (1 << bit)) && ip) { do_ftrace_record_recursion(ip, pip); return -1; } diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 64d966a3ec8b..42bf1c046de1 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2915,21 +2915,18 @@ void trace_event_buffer_commit(struct trace_event_b= uffer *fbuffer) EXPORT_SYMBOL_GPL(trace_event_buffer_commit); =20 #ifdef CONFIG_PERF_EVENTS -static inline void record_perf_event(struct trace_array *tr, - struct trace_buffer *buffer, - unsigned int trace_ctx) +static inline void trace_perf_event(struct trace_array *tr, + struct trace_buffer *buffer, + int entries, u64 flags, + unsigned int trace_ctx) { struct ring_buffer_event *event; struct perf_event_entry *entry; - int entries =3D READ_ONCE(tr->perf_events); struct trace_array_cpu *data; u64 *value; int size; int cpu; =20 - if (!entries) - return; - guard(preempt_notrace)(); cpu =3D smp_processor_id(); =20 @@ -2949,12 +2946,12 @@ static inline void record_perf_event(struct trace_a= rray *tr, entry =3D ring_buffer_event_data(event); value =3D entry->values; =20 - if (tr->trace_flags & TRACE_ITER(PERF_CYCLES)) { + if (flags & TRACE_ITER(PERF_CYCLES)) { *value++ =3D TRACE_PERF_VALUE(PERF_TRACE_CYCLES); entries--; } =20 - if (entries && tr->trace_flags & TRACE_ITER(PERF_CACHE)) { + if (entries && flags & TRACE_ITER(PERF_CACHE)) { *value++ =3D TRACE_PERF_VALUE(PERF_TRACE_CACHE); entries--; } @@ -2968,6 +2965,49 @@ static inline void record_perf_event(struct trace_ar= ray *tr, local_dec(&data->disabled); } =20 +static inline void record_perf_event(struct trace_array *tr, + struct trace_buffer *buffer, + unsigned int trace_ctx) +{ + int entries =3D READ_ONCE(tr->perf_events); + + if (!entries) + return; + + trace_perf_event(tr, buffer, entries, tr->trace_flags, trace_ctx); +} + +#ifdef CONFIG_FUNCTION_TRACER +void ftrace_perf_events(struct trace_array *tr, int perf_events, + u64 perf_mask, unsigned int trace_ctx) +{ + struct trace_buffer *buffer; + int bit; + + /* + * Prevent any ftrace recursion. + * The ftrace_test_recursion_trylock() allows one nested loop + * to handle the case where an interrupt comes in and traces + * before the preempt_count is updated to the new context. + * This one instance allows that function to still be traced. + * + * The trace_perf_cache_misses() will call functions that function + * tracing will want to trace. Prevent this one loop from happening + * by taking the the lock again. If an interrupt comes in now, + * it may still be dropped, but there's really nothing that can + * be done about that until all those locations get fixed. + */ + bit =3D ftrace_test_recursion_trylock(0, 0); + + buffer =3D tr->array_buffer.buffer; + trace_perf_event(tr, buffer, perf_events, perf_mask, trace_ctx); + + /* bit < 0 means the trylock failed and does not need to be unlocked */ + if (bit >=3D 0) + ftrace_test_recursion_unlock(bit); +} +#endif + static int handle_perf_event(struct trace_array *tr, u64 mask, int enabled) { int ret =3D 0; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 094a156b0c70..bb764a2255c7 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -365,6 +365,8 @@ struct trace_array { int buffer_disabled; =20 int perf_events; + int ftrace_perf_events; + u64 ftrace_perf_mask; =20 struct trace_pid_list __rcu *filtered_pids; struct trace_pid_list __rcu *filtered_no_pids; @@ -1402,6 +1404,10 @@ extern int trace_get_user(struct trace_parser *parse= r, const char __user *ubuf, u64 do_trace_perf_event(int type); int trace_perf_event_enable(int type); void trace_perf_event_disable(int type); +#ifdef CONFIG_FUNCTION_TRACER +void ftrace_perf_events(struct trace_array *tr, int perf_events, + u64 perf_mask, unsigned int trace_ctx); +#endif #else # define PERF_FLAGS static inline u64 do_trace_perf_event(int type) { return 0; } diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c index c12795c2fb39..97f46ac7ef21 100644 --- a/kernel/trace/trace_functions.c +++ b/kernel/trace/trace_functions.c @@ -47,8 +47,12 @@ enum { TRACE_FUNC_OPT_NO_REPEATS =3D 0x2, TRACE_FUNC_OPT_ARGS =3D 0x4, =20 - /* Update this to next highest bit. */ - TRACE_FUNC_OPT_HIGHEST_BIT =3D 0x8 + /* Update this to next highest function bit. */ + TRACE_FUNC_OPT_HIGHEST_BIT =3D 0x8, + + /* These are just other options */ + TRACE_FUNC_OPT_PERF_CYCLES =3D 0x10, + TRACE_FUNC_OPT_PERF_CACHE =3D 0x20, }; =20 #define TRACE_FUNC_OPT_MASK (TRACE_FUNC_OPT_HIGHEST_BIT - 1) @@ -143,6 +147,105 @@ static bool handle_func_repeats(struct trace_array *t= r, u32 flags_val) return true; } =20 +#ifdef CONFIG_PERF_EVENTS +static inline void +do_trace_function(struct trace_array *tr, unsigned long ip, + unsigned long parent_ip, unsigned int trace_ctx, + struct ftrace_regs *fregs) +{ + trace_function(tr, ip, parent_ip, trace_ctx, fregs); + + if (likely(!tr->ftrace_perf_events)) + return; + + ftrace_perf_events(tr, tr->ftrace_perf_events, tr->ftrace_perf_mask, trac= e_ctx); +} + +static bool handle_perf_event_flag(struct trace_array *tr, int bit, int se= t, int *err) +{ + u64 mask; + int type; + + *err =3D 0; + + switch (bit) { + case TRACE_FUNC_OPT_PERF_CYCLES: + mask =3D TRACE_ITER(PERF_CYCLES); + type =3D PERF_TRACE_CYCLES; + break; + + case TRACE_FUNC_OPT_PERF_CACHE: + mask =3D TRACE_ITER(PERF_CACHE); + type =3D PERF_TRACE_CACHE; + break; + + default: + return 0; + } + + if (set) + *err =3D trace_perf_event_enable(type); + else + trace_perf_event_disable(type); + + if (*err < 0) + return 1; + + if (set) { + tr->ftrace_perf_events++; + tr->ftrace_perf_mask |=3D mask; + } else { + tr->ftrace_perf_mask &=3D ~mask; + tr->ftrace_perf_events--; + } + return 1; +} + +static void ftrace_perf_enable(struct trace_array *tr, int bit) +{ + int err; + + if (!(tr->current_trace_flags->val & bit)) + return; + + handle_perf_event_flag(tr, bit, 1, &err); + if (err < 0) + tr->current_trace_flags->val &=3D ~bit; +} + +static void ftrace_perf_disable(struct trace_array *tr, int bit) +{ + int err; + + /* Only disable if it was enabled */ + if (!(tr->current_trace_flags->val & bit)) + return; + + handle_perf_event_flag(tr, bit, 0, &err); +} + +static void ftrace_perf_init(struct trace_array *tr) +{ + ftrace_perf_enable(tr, TRACE_FUNC_OPT_PERF_CYCLES); + ftrace_perf_enable(tr, TRACE_FUNC_OPT_PERF_CACHE); +} + +static void ftrace_perf_reset(struct trace_array *tr) +{ + ftrace_perf_disable(tr, TRACE_FUNC_OPT_PERF_CYCLES); + ftrace_perf_disable(tr, TRACE_FUNC_OPT_PERF_CACHE); +} +#else +#define do_trace_function trace_function +static inline bool handle_perf_event_flag(struct trace_array *tr, int bit, + int set, int *err) +{ + return 0; +} +static inline void ftrace_perf_init(struct trace_array *tr) { } +static inline void ftrace_perf_reset(struct trace_array *tr) { } +#endif /* CONFIG_PERF_EVENTS */ + static int function_trace_init(struct trace_array *tr) { ftrace_func_t func; @@ -165,6 +268,8 @@ static int function_trace_init(struct trace_array *tr) =20 tr->array_buffer.cpu =3D raw_smp_processor_id(); =20 + ftrace_perf_init(tr); + tracing_start_cmdline_record(); tracing_start_function_trace(tr); return 0; @@ -172,6 +277,7 @@ static int function_trace_init(struct trace_array *tr) =20 static void function_trace_reset(struct trace_array *tr) { + ftrace_perf_reset(tr); tracing_stop_function_trace(tr); tracing_stop_cmdline_record(); ftrace_reset_array_ops(tr); @@ -223,7 +329,7 @@ function_trace_call(unsigned long ip, unsigned long par= ent_ip, =20 trace_ctx =3D tracing_gen_ctx_dec(); =20 - trace_function(tr, ip, parent_ip, trace_ctx, NULL); + do_trace_function(tr, ip, parent_ip, trace_ctx, NULL); =20 ftrace_test_recursion_unlock(bit); } @@ -245,7 +351,7 @@ function_args_trace_call(unsigned long ip, unsigned lon= g parent_ip, =20 trace_ctx =3D tracing_gen_ctx(); =20 - trace_function(tr, ip, parent_ip, trace_ctx, fregs); + do_trace_function(tr, ip, parent_ip, trace_ctx, fregs); =20 ftrace_test_recursion_unlock(bit); } @@ -372,7 +478,7 @@ function_no_repeats_trace_call(unsigned long ip, unsign= ed long parent_ip, trace_ctx =3D tracing_gen_ctx_dec(); process_repeats(tr, ip, parent_ip, last_info, trace_ctx); =20 - trace_function(tr, ip, parent_ip, trace_ctx, NULL); + do_trace_function(tr, ip, parent_ip, trace_ctx, NULL); =20 out: ftrace_test_recursion_unlock(bit); @@ -428,6 +534,10 @@ static struct tracer_opt func_opts[] =3D { { TRACER_OPT(func-no-repeats, TRACE_FUNC_OPT_NO_REPEATS) }, #ifdef CONFIG_FUNCTION_TRACE_ARGS { TRACER_OPT(func-args, TRACE_FUNC_OPT_ARGS) }, +#endif +#if CONFIG_PERF_EVENTS + { TRACER_OPT(func-cpu-cycles, TRACE_FUNC_OPT_PERF_CYCLES) }, + { TRACER_OPT(func-cache-misses, TRACE_FUNC_OPT_PERF_CACHE) }, #endif { } /* Always set a last empty entry */ }; @@ -457,6 +567,7 @@ func_set_flag(struct trace_array *tr, u32 old_flags, u3= 2 bit, int set) { ftrace_func_t func; u32 new_flags; + int err; =20 /* Do nothing if already set. */ if (!!set =3D=3D !!(tr->current_trace_flags->val & bit)) @@ -466,6 +577,9 @@ func_set_flag(struct trace_array *tr, u32 old_flags, u3= 2 bit, int set) if (tr->current_trace !=3D &function_trace) return 0; =20 + if (handle_perf_event_flag(tr, bit, set, &err)) + return err; + new_flags =3D (tr->current_trace_flags->val & ~bit) | (set ? bit : 0); func =3D select_trace_function(new_flags); if (!func) --=20 2.51.0 From nobody Tue Dec 2 02:52:29 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF3FB2E7BB5; Tue, 18 Nov 2025 00:35:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; cv=none; b=b1hCOMws4dj3Q5kYk/bYbEnzHHkcO3pD9AmBaF3CAjdTpvkI9hLJz2Ee1N5Iq/fpfcClDTQpNTLrb/1mRzZmm7PhBwtm/ljFnM26OD3XqTvqg7mMm09MnRbNkzgCcpjZiqXtDErPbDt6hnrl12Wezdl097CIULp1sCAmnedCeDM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763426108; c=relaxed/simple; bh=tdCAgiBfFbLdnu9LTX04d6mRD0R2Wcol8CgqeSTQTb0=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=u4mHZ9kWi0+JzMuZFwt3BuOsRupBpqzi55N2KmPs2iJNAqOAtf1DEj5zPncbFjyGVuGB2TIblEOqGkDasbAPt6aNMDRIdkTkJIqIV83dZreih4wVfojMej3+jNtPTANU2kle/aPtlFxeHeKwunxPb4Xvk8ltvK1ZDXWAx8PxmvU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Mb4HH3C9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Mb4HH3C9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 771D8C2BCB8; Tue, 18 Nov 2025 00:35:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763426107; bh=tdCAgiBfFbLdnu9LTX04d6mRD0R2Wcol8CgqeSTQTb0=; h=Date:From:To:Cc:Subject:References:From; b=Mb4HH3C9l00ruA5SBJMNyTaeWLa6vQo3lTaMF66ecOnlLchTGxe80rp6sEnMFRuw5 GzaS458KHJ9RjWwUfggLRBn+L8Bw8bpAnFazIXSOTDn1xs4WA7HZhYs35qIVVhSa3s cxSk7U3B9zCmTCNmpN+4FvqQ76c0nt+i1X4YyjZ3d2i5+b/GhMiek6brGbNBydfukV 4WAiGWkC5QHZolPQFePI1LlVdLw71Igf325ov5Dpjya9mKWHMBuzUYe9i+RJj9cKbW WO1h/mX7pJw61SnCBJ2LCssk1glloTrbH2t04fy7eYn4D2aApGXDloywe6e4qLuwy8 lzihCP9+xuG7g== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1vL9gy-00000002zKy-0MTb; Mon, 17 Nov 2025 19:35:32 -0500 Message-ID: <20251118003531.944170593@kernel.org> User-Agent: quilt/0.68 Date: Mon, 17 Nov 2025 19:29:53 -0500 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Thomas Gleixner , Ian Rogers , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Douglas Raillard Subject: [POC][RFC][PATCH 3/3] fgraph: Add perf counters to function graph tracer References: <20251118002950.680329246@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Add option to trigger perf events to function graph tracing. Two new options are added: funcgraph-cpu-cycles and funcgraph-cache-misses This adds the perf event right after the start of a function and again just before the end of a function. # cd /sys/kernel/tracing # echo 1 > options/funcgraph-cache-misses # echo vfs_read > set_graph_function # echo function_graph > current_tracer # cat trace [..] 5) | vfs_read() { 5) | /* cache_misses: 822565 */ 5) | rw_verify_area() { 5) | /* cache_misses: 824003 */ 5) | security_file_permission() { 5) | /* cache_misses: 825440 */ 5) | apparmor_file_permission() { 5) | /* cache_misses: 826875 */ 5) | aa_file_perm() { 5) | /* cache_misses: 828326 */ 5) | __rcu_read_lock() { 5) | /* cache_misses: 829766 */ 5) | /* cache_misses: 830785 */ 5) 5.116 us | } 5) | __rcu_read_unlock() { 5) | /* cache_misses: 832611 */ 5) | /* cache_misses: 833632 */ 5) 5.223 us | } 5) | /* cache_misses: 835043 */ 5) + 25.462 us | } 5) | /* cache_misses: 836454 */ 5) + 35.518 us | } 5) | bpf_lsm_file_permission() { 5) | /* cache_misses: 838276 */ 5) | /* cache_misses: 839292 */ 5) 4.613 us | } 5) | /* cache_misses: 840697 */ 5) + 54.684 us | } 5) | /* cache_misses: 842107 */ 5) + 64.449 us | } The option will cause the perf event to be triggered after every function called. If cpu_cycles is also enabled: # echo 1 > options/funcgraph-cpu-cycles # cat trace [..] 3) | vfs_read() { 3) | /* cpu_cycles: 2947481793 cache_misses: 2002984031 */ 3) | rw_verify_area() { 3) | /* cpu_cycles: 2947488061 cache_misses: 2002985922= */ 3) | security_file_permission() { 3) | /* cpu_cycles: 2947492867 cache_misses: 20029878= 12 */ 3) | apparmor_file_permission() { 3) | /* cpu_cycles: 2947497713 cache_misses: 200298= 9700 */ 3) | aa_file_perm() { 3) | /* cpu_cycles: 2947502560 cache_misses: 2002= 991604 */ 3) | __rcu_read_lock() { 3) | /* cpu_cycles: 2947507398 cache_misses: 20= 02993497 */ 3) | /* cpu_cycles: 2947512435 cache_misses: 20= 02994969 */ 3) 7.586 us | } 3) | __rcu_read_unlock() { 3) | /* cpu_cycles: 2947518226 cache_misses: 20= 02997248 */ 3) | /* cpu_cycles: 2947522328 cache_misses: 20= 02998722 */ 3) 7.211 us | } 3) | /* cpu_cycles: 2947527067 cache_misses: 2003= 000586 */ 3) + 37.581 us | } 3) | /* cpu_cycles: 2947531727 cache_misses: 200300= 2450 */ 3) + 52.061 us | } 3) | bpf_lsm_file_permission() { 3) | /* cpu_cycles: 2947537274 cache_misses: 200300= 4725 */ 3) | /* cpu_cycles: 2947541104 cache_misses: 200300= 6194 */ 3) 7.029 us | } 3) | /* cpu_cycles: 2947545762 cache_misses: 20030080= 52 */ 3) + 80.971 us | } 3) | /* cpu_cycles: 2947550459 cache_misses: 2003009915= */ 3) + 95.515 us | } Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.h | 4 + kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++++-- 2 files changed, 116 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index bb764a2255c7..64cdb6fda3fb 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -366,7 +366,9 @@ struct trace_array { =20 int perf_events; int ftrace_perf_events; + int fgraph_perf_events; u64 ftrace_perf_mask; + u64 fgraph_perf_mask; =20 struct trace_pid_list __rcu *filtered_pids; struct trace_pid_list __rcu *filtered_no_pids; @@ -946,6 +948,8 @@ static __always_inline bool ftrace_hash_empty(struct ft= race_hash *hash) #define TRACE_GRAPH_PRINT_RETVAL_HEX 0x1000 #define TRACE_GRAPH_PRINT_RETADDR 0x2000 #define TRACE_GRAPH_ARGS 0x4000 +#define TRACE_GRAPH_PERF_CACHE 0x8000 +#define TRACE_GRAPH_PERF_CYCLES 0x10000 #define TRACE_GRAPH_PRINT_FILL_SHIFT 28 #define TRACE_GRAPH_PRINT_FILL_MASK (0x3 << TRACE_GRAPH_PRINT_FILL_SHIFT) =20 diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_func= tions_graph.c index 44d5dc5031e2..e618dd12ca0c 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -22,6 +22,8 @@ static int ftrace_graph_skip_irqs; /* Do not record function time when task is sleeping */ unsigned int fgraph_no_sleep_time; =20 +static struct tracer graph_trace; + struct fgraph_cpu_data { pid_t last_pid; int depth; @@ -88,6 +90,11 @@ static struct tracer_opt trace_opts[] =3D { /* Include sleep time (scheduled out) between entry and return */ { TRACER_OPT(sleep-time, TRACE_GRAPH_SLEEP_TIME) }, =20 +#ifdef CONFIG_PERF_EVENTS + { TRACER_OPT(funcgraph-cache-misses, TRACE_GRAPH_PERF_CACHE) }, + { TRACER_OPT(funcgraph-cpu-cycles, TRACE_GRAPH_PERF_CYCLES) }, +#endif + { } /* Empty entry */ }; =20 @@ -104,6 +111,97 @@ static bool tracer_flags_is_set(struct trace_array *tr= , u32 flags) return (tr->current_trace_flags->val & flags) =3D=3D flags; } =20 +#ifdef CONFIG_PERF_EVENTS +static inline void handle_perf_event(struct trace_array *tr, unsigned int = trace_ctx) +{ + if (!tr->fgraph_perf_events) + return; + ftrace_perf_events(tr, tr->fgraph_perf_events, tr->fgraph_perf_mask, trac= e_ctx); +} + +static int ftrace_graph_perf_event(struct trace_array *tr, int set, int bi= t) +{ + u64 mask; + int type; + int ret =3D 0; + + /* Do nothing if the current tracer is not this tracer */ + if (tr->current_trace !=3D &graph_trace) + return 0; + + switch (bit) { + case TRACE_GRAPH_PERF_CACHE: + mask =3D TRACE_ITER(PERF_CACHE); + type =3D PERF_TRACE_CACHE; + break; + case TRACE_GRAPH_PERF_CYCLES: + mask =3D TRACE_ITER(PERF_CYCLES); + type =3D PERF_TRACE_CYCLES; + break; + } + + if (set) + ret =3D trace_perf_event_enable(type); + else + trace_perf_event_disable(type); + + if (ret < 0) + return ret; + + if (set) { + tr->fgraph_perf_events++; + tr->fgraph_perf_mask |=3D mask; + } else { + tr->fgraph_perf_mask &=3D ~mask; + tr->fgraph_perf_events--; + } + return 0; +} + +static void ftrace_graph_perf_enable(struct trace_array *tr, int bit) +{ + int err; + + if (!(tr->current_trace_flags->val & bit)) + return; + + err =3D ftrace_graph_perf_event(tr, 1, bit); + if (err < 0) + tr->current_trace_flags->val &=3D ~bit; +} + +static void ftrace_graph_perf_disable(struct trace_array *tr, int bit) +{ + /* Only disable if it was enabled */ + if (!(tr->current_trace_flags->val & bit)) + return; + + ftrace_graph_perf_event(tr, 0, bit); +} + +static void fgraph_perf_init(struct trace_array *tr) +{ + ftrace_graph_perf_enable(tr, TRACE_GRAPH_PERF_CYCLES); + ftrace_graph_perf_enable(tr, TRACE_GRAPH_PERF_CACHE); +} + +static void fgraph_perf_reset(struct trace_array *tr) +{ + ftrace_graph_perf_disable(tr, TRACE_GRAPH_PERF_CYCLES); + ftrace_graph_perf_disable(tr, TRACE_GRAPH_PERF_CACHE); +} +#else +static inline void handle_perf_event(struct trace_array *tr, unsigned int = trace_ctx) +{ +} +static inline void fgraph_perf_init(struct trace_array *tr) +{ +} +static inline void fgraph_perf_reset(struct trace_array *tr) +{ +} +#endif + /* * DURATION column is being also used to display IRQ signs, * following values are used by print_graph_irq and others @@ -272,6 +370,9 @@ static int graph_entry(struct ftrace_graph_ent *trace, ret =3D __graph_entry(tr, trace, trace_ctx, fregs); } =20 + if (ret) + handle_perf_event(tr, trace_ctx); + return ret; } =20 @@ -324,6 +425,8 @@ void __trace_graph_return(struct trace_array *tr, struct trace_buffer *buffer =3D tr->array_buffer.buffer; struct ftrace_graph_ret_entry *entry; =20 + handle_perf_event(tr, trace_ctx); + event =3D trace_buffer_lock_reserve(buffer, TRACE_GRAPH_RET, sizeof(*entry), trace_ctx); if (!event) @@ -465,6 +568,8 @@ static int graph_trace_init(struct trace_array *tr) if (!tracer_flags_is_set(tr, TRACE_GRAPH_SLEEP_TIME)) fgraph_no_sleep_time++; =20 + fgraph_perf_init(tr); + /* Make gops functions visible before we start tracing */ smp_mb(); =20 @@ -476,8 +581,6 @@ static int graph_trace_init(struct trace_array *tr) return 0; } =20 -static struct tracer graph_trace; - static int ftrace_graph_trace_args(struct trace_array *tr, int set) { trace_func_graph_ent_t entry; @@ -512,6 +615,7 @@ static void graph_trace_reset(struct trace_array *tr) if (WARN_ON_ONCE(fgraph_no_sleep_time < 0)) fgraph_no_sleep_time =3D 0; =20 + fgraph_perf_reset(tr); tracing_stop_cmdline_record(); unregister_ftrace_graph(tr->gops); } @@ -1684,9 +1788,12 @@ func_graph_set_flag(struct trace_array *tr, u32 old_= flags, u32 bit, int set) ftrace_graph_skip_irqs =3D 0; break; =20 - case TRACE_GRAPH_ARGS: - return ftrace_graph_trace_args(tr, set); - } +#ifdef CONFIG_PERF_EVENTS + case TRACE_GRAPH_PERF_CACHE: + case TRACE_GRAPH_PERF_CYCLES: + return ftrace_graph_perf_event(tr, set, bit); +#endif + }; =20 return 0; } --=20 2.51.0