tracing: Add perf events to trace buffer

[POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

This series adds a perf event to the ftrace ring buffer.
It is currently a proof of concept as I'm not happy with the interface
and I also think the recorded perf event format may be changed too.

This proof-of-concept interface (which I have no plans on using), currently
just adds 6 new trace options.

  event_cache_misses
  event_cpu_cycles
  func-cache-misses
  func-cpu-cycles
  funcgraph-cache-misses
  funcgraph-cpu-cycles

The first two trigger a perf event after every event, the second two trigger
a perf event after every function and the last two trigger a perf event
right after the start of a function and again at the end of the function.

As this will eventual work with many more perf events than just cache-misses
and cpu-cycles , using options is not appropriate. Especially since the
options are limited to a 64 bit bitmask, and that can easily go much higher.
I'm thinking about having a file instead that will act as a way to enable
perf events for events, function and function graph tracing.

  set_event_perf, set_ftrace_perf, set_fgraph_perf

And an available_perf_events that show what can be written into these files,
(similar to how set_ftrace_filter works). But for now, it was just easier to
implement them as options.

As for the perf event that is triggered. It currently is a dynamic array of
64 bit values. Each value is broken up into 8 bits for what type of perf
event it is, and 56 bits for the counter. It only writes a per CPU raw
counter and does not do any math. That would be needed to be done by any
post processing.

Since the values are for user space to do the subtraction to figure out the
difference between events, for example, the function_graph tracer may have:

             is_vmalloc_addr() {
               /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
               /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
             }

User space would subtract 2869006049 - 2869004572 = 1477

Then 56 bits should be plenty.

  2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
  416 / 4 = 104

If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
days. This tooling is not for seeing how many cycles run over 104 days.
User space tooling would just need to be aware that the vale is 56 bits and
when calculating the difference between start and end do something like:

  if (start > end)
      end |= 1ULL << 56;

  delta = end - start;

The next question is how to label the perf events to be in the 8 bit
portion. It could simply be a value that is registered, and listed in the
available_perf_events file.

  cpu_cycles:1
  cach_misses:2
  [..]

And this would need to be recorded by any tooling reading the events
so that it knows how to map the events with their attached ids.

But again, this is just a proof-of-concept. How this will eventually be
implemented is yet to be determined.

But to test these patches (which are based on top of my linux-next branch,
which should now be in linux-next):

  # cd /sys/kernel/tracing
  # echo 1 > options/event_cpu_cycles
  # echo 1 > options/event_cache_misses
  # echo 1 > events/syscalls/enable
  # cat trace
[..]
            bash-995     [007] .....    98.255252: sys_write -> 0x2
            bash-995     [007] .....    98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
            bash-995     [007] .....    98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
            bash-995     [007] .....    98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
            bash-995     [007] .....    98.255305: sys_dup2 -> 0x1
            bash-995     [007] .....    98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
            bash-995     [007] .....    98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
            bash-995     [007] .....    98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
            bash-995     [007] .....    98.255352: sys_fcntl -> 0x1
            bash-995     [007] .....    98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
            bash-995     [007] .....    98.255361: sys_close(fd: 0xa)
            bash-995     [007] .....    98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
            bash-995     [007] .....    98.255369: sys_close -> 0x0



Comments welcomed.


Steven Rostedt (3):
      tracing: Add perf events
      ftrace: Add perf counters to function tracing
      fgraph: Add perf counters to function graph tracer

----
 include/linux/trace_recursion.h      |   5 +-
 kernel/trace/trace.c                 | 153 ++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h                 |  38 ++++++++
 kernel/trace/trace_entries.h         |  13 +++
 kernel/trace/trace_event_perf.c      | 162 +++++++++++++++++++++++++++++++++++
 kernel/trace/trace_functions.c       | 124 +++++++++++++++++++++++++--
 kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
 kernel/trace/trace_output.c          |  70 +++++++++++++++
 8 files changed, 670 insertions(+), 12 deletions(-)

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Namhyung Kim 2 months, 3 weeks ago

Hi Steve,

On Mon, Nov 17, 2025 at 07:29:50PM -0500, Steven Rostedt wrote:
> 
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
> 
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
> 
>   event_cache_misses
>   event_cpu_cycles
>   func-cache-misses
>   func-cpu-cycles
>   funcgraph-cache-misses
>   funcgraph-cpu-cycles

Unfortunately the hardware cache event is ambiguous on which level it
refers to and architectures define it differently.  There are
encodings to clearly define the cache levels and accesses but the
support depends on the hardware capabilities.

> 
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
> 
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
> 
>   set_event_perf, set_ftrace_perf, set_fgraph_perf
> 
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
> 
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.

If you want to keep the perf events per CPU, you may consider CPU
migrations for the func-graph case.  Otherwise userspace may not
calculate the diff from the begining correctly.

> 
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
> 
>              is_vmalloc_addr() {
>                /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
>                /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
>              }
> 
> User space would subtract 2869006049 - 2869004572 = 1477
> 
> Then 56 bits should be plenty.
> 
>   2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
>   416 / 4 = 104
> 
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
> 
>   if (start > end)
>       end |= 1ULL << 56;
> 
>   delta = end - start;
> 
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
> 
>   cpu_cycles:1
>   cach_misses:2
>   [..]
> 
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
> 
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
> 
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
> 
>   # cd /sys/kernel/tracing
>   # echo 1 > options/event_cpu_cycles
>   # echo 1 > options/event_cache_misses
>   # echo 1 > events/syscalls/enable
>   # cat trace
> [..]
>             bash-995     [007] .....    98.255252: sys_write -> 0x2
>             bash-995     [007] .....    98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
>             bash-995     [007] .....    98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
>             bash-995     [007] .....    98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
>             bash-995     [007] .....    98.255305: sys_dup2 -> 0x1
>             bash-995     [007] .....    98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
>             bash-995     [007] .....    98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
>             bash-995     [007] .....    98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
>             bash-995     [007] .....    98.255352: sys_fcntl -> 0x1
>             bash-995     [007] .....    98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
>             bash-995     [007] .....    98.255361: sys_close(fd: 0xa)
>             bash-995     [007] .....    98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
>             bash-995     [007] .....    98.255369: sys_close -> 0x0
> 
> 
> 
> Comments welcomed.

Just FYI, I did the similar thing (like fgraph case) in uftrace and I
grouped two related events to produce a metric.

  $ uftrace -T a@read=pmu-cycle ~/tmp/abc
  # DURATION     TID      FUNCTION
              [ 521741] | main() {
              [ 521741] |   a() {
              [ 521741] |     /* read:pmu-cycle (cycles=482 instructions=38) */
              [ 521741] |     b() {
              [ 521741] |       c() {
     0.659 us [ 521741] |         getpid();
     1.600 us [ 521741] |       } /* c */
     1.780 us [ 521741] |     } /* b */
              [ 521741] |     /* diff:pmu-cycle (cycles=+7361 instructions=+3955 IPC=0.54) */
    24.485 us [ 521741] |   } /* a */
    34.797 us [ 521741] | } /* main */

It reads cycles and instructions events (specified by 'pmu-cycle') at
entry and exit of the given function ('a') and shows the diff with the
metric IPC.

Thanks,
Namhyung

> 
> 
> Steven Rostedt (3):
>       tracing: Add perf events
>       ftrace: Add perf counters to function tracing
>       fgraph: Add perf counters to function graph tracer
> 
> ----
>  include/linux/trace_recursion.h      |   5 +-
>  kernel/trace/trace.c                 | 153 ++++++++++++++++++++++++++++++++-
>  kernel/trace/trace.h                 |  38 ++++++++
>  kernel/trace/trace_entries.h         |  13 +++
>  kernel/trace/trace_event_perf.c      | 162 +++++++++++++++++++++++++++++++++++
>  kernel/trace/trace_functions.c       | 124 +++++++++++++++++++++++++--
>  kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
>  kernel/trace/trace_output.c          |  70 +++++++++++++++
>  8 files changed, 670 insertions(+), 12 deletions(-)

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

On Mon, 17 Nov 2025 23:25:56 -0800
Namhyung Kim <namhyung@kernel.org> wrote:

> > As for the perf event that is triggered. It currently is a dynamic array of
> > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > counter and does not do any math. That would be needed to be done by any
> > post processing.  
> 
> If you want to keep the perf events per CPU, you may consider CPU
> migrations for the func-graph case.  Otherwise userspace may not
> calculate the diff from the begining correctly.

That's easily solved by the user space too adding a sched_switch perf event
trigger. ;-)

> 
> Just FYI, I did the similar thing (like fgraph case) in uftrace and I
> grouped two related events to produce a metric.
> 
>   $ uftrace -T a@read=pmu-cycle ~/tmp/abc
>   # DURATION     TID      FUNCTION
>               [ 521741] | main() {
>               [ 521741] |   a() {
>               [ 521741] |     /* read:pmu-cycle (cycles=482 instructions=38) */
>               [ 521741] |     b() {
>               [ 521741] |       c() {
>      0.659 us [ 521741] |         getpid();
>      1.600 us [ 521741] |       } /* c */
>      1.780 us [ 521741] |     } /* b */
>               [ 521741] |     /* diff:pmu-cycle (cycles=+7361 instructions=+3955 IPC=0.54) */
>     24.485 us [ 521741] |   } /* a */
>     34.797 us [ 521741] | } /* main */
> 
> It reads cycles and instructions events (specified by 'pmu-cycle') at
> entry and exit of the given function ('a') and shows the diff with the
> metric IPC.

I originally tried to implement this, but then it became more complex than
I wanted in the kernel. As then I need to add a hook in the sched_switch
and record the perf event counter there, and keep track of it for every
task. That would require memory to be saved somewhere. I started adding it
to the function graph shadow stack and then just decided that it would be
so much easier to let user space figure it out.

By running function graph tracer and showing the start and end counters, as
well as the counters at the sched_switch trace event, user space could do
all the math and accounting, and the code in the kernel can remain simple.

-- Steve

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Masami Hiramatsu (Google) 2 months, 3 weeks ago

Hi Steve,

Thanks for the great idea!

On Mon, 17 Nov 2025 19:29:50 -0500
Steven Rostedt <rostedt@kernel.org> wrote:

> 
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
> 
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
> 
>   event_cache_misses
>   event_cpu_cycles
>   func-cache-misses
>   func-cpu-cycles
>   funcgraph-cache-misses
>   funcgraph-cpu-cycles
> 
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
> 
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
> 
>   set_event_perf, set_ftrace_perf, set_fgraph_perf

What about adding a global `trigger` action file so that user can
add these "perf" actions to write into it. It is something like
stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
into it too)

For pre-defined/software counters:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger

For some hardware event sources (see /sys/bus/event_source/devices/):
# echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger

echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger

If we need to set those counters for tracers and events separately,
we can add `events/trigger` and `tracer-trigger` files.

echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger

To disable counters, we can use '!' as same as event triggers.

echo !perf:cpu_cycles > trigger

To add more than 2 counters, connect it with ':'.
(or, we will allow to append new perf counters)
This allows user to set perf counter options for each events.

Maybe we also should move 'stacktrace'/'userstacktrace' option
flags to it too eventually.



> 
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
> 
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
> 
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
> 
>              is_vmalloc_addr() {
>                /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
>                /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
>              }

Just a style question: Would this mean the first line is for function entry
and the second one is function return?

> 
> User space would subtract 2869006049 - 2869004572 = 1477
> 
> Then 56 bits should be plenty.
> 
>   2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
>   416 / 4 = 104
> 
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
> 
>   if (start > end)
>       end |= 1ULL << 56;
> 
>   delta = end - start;
> 
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
> 
>   cpu_cycles:1
>   cach_misses:2
>   [..]

Looks good to me. I think pre-definied events of `perf list`
will be there and have fixed numbers.

Thank you,

> 
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
> 
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
> 
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
> 
>   # cd /sys/kernel/tracing
>   # echo 1 > options/event_cpu_cycles
>   # echo 1 > options/event_cache_misses
>   # echo 1 > events/syscalls/enable
>   # cat trace
> [..]
>             bash-995     [007] .....    98.255252: sys_write -> 0x2
>             bash-995     [007] .....    98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
>             bash-995     [007] .....    98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
>             bash-995     [007] .....    98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
>             bash-995     [007] .....    98.255305: sys_dup2 -> 0x1
>             bash-995     [007] .....    98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
>             bash-995     [007] .....    98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
>             bash-995     [007] .....    98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
>             bash-995     [007] .....    98.255352: sys_fcntl -> 0x1
>             bash-995     [007] .....    98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
>             bash-995     [007] .....    98.255361: sys_close(fd: 0xa)
>             bash-995     [007] .....    98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
>             bash-995     [007] .....    98.255369: sys_close -> 0x0
> 
> 
> 
> Comments welcomed.
> 
> 
> Steven Rostedt (3):
>       tracing: Add perf events
>       ftrace: Add perf counters to function tracing
>       fgraph: Add perf counters to function graph tracer
> 
> ----
>  include/linux/trace_recursion.h      |   5 +-
>  kernel/trace/trace.c                 | 153 ++++++++++++++++++++++++++++++++-
>  kernel/trace/trace.h                 |  38 ++++++++
>  kernel/trace/trace_entries.h         |  13 +++
>  kernel/trace/trace_event_perf.c      | 162 +++++++++++++++++++++++++++++++++++
>  kernel/trace/trace_functions.c       | 124 +++++++++++++++++++++++++--
>  kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
>  kernel/trace/trace_output.c          |  70 +++++++++++++++
>  8 files changed, 670 insertions(+), 12 deletions(-)


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 18 Nov 2025 12:08:21 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Hi Steve,
> 
> Thanks for the great idea!

Thanks!

> > 
> > As this will eventual work with many more perf events than just cache-misses
> > and cpu-cycles , using options is not appropriate. Especially since the
> > options are limited to a 64 bit bitmask, and that can easily go much higher.
> > I'm thinking about having a file instead that will act as a way to enable
> > perf events for events, function and function graph tracing.
> > 
> >   set_event_perf, set_ftrace_perf, set_fgraph_perf  
> 
> What about adding a global `trigger` action file so that user can
> add these "perf" actions to write into it. It is something like
> stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
> into it too)
> 
> For pre-defined/software counters:
> # echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger

For events, it would make more sense to put it into the events directory:

 # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger

As there is already a events/enable

Heck we could even add it per system:

 # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/syscalls/trigger

> 
> For some hardware event sources (see /sys/bus/event_source/devices/):
> # echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
> 
> echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger

Still need a way to add an identifier list. Currently, if the size of
the type identifier is one byte, then it can only support up to 256 events.

Do we need every event for this? Or just have a subset of events that
would be supported?


> 
> If we need to set those counters for tracers and events separately,
> we can add `events/trigger` and `tracer-trigger` files.

As I mentioned, the trigger for events should be in the events directory.

We could add a ftrace_trigger that can affect both function and
function graph tracer.

> 
> echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
> 
> To disable counters, we can use '!' as same as event triggers.
> 
> echo !perf:cpu_cycles > trigger

Yes, it would follow the current way to disable a trigger.

> 
> To add more than 2 counters, connect it with ':'.
> (or, we will allow to append new perf counters)
> This allows user to set perf counter options for each events.
> 
> Maybe we also should move 'stacktrace'/'userstacktrace' option
> flags to it too eventually.

We can add them, but may never be able to remove them due to backward
compatibility.

> > 
> > And an available_perf_events that show what can be written into these files,
> > (similar to how set_ftrace_filter works). But for now, it was just easier to
> > implement them as options.
> > 
> > As for the perf event that is triggered. It currently is a dynamic array of
> > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > counter and does not do any math. That would be needed to be done by any
> > post processing.
> > 
> > Since the values are for user space to do the subtraction to figure out the
> > difference between events, for example, the function_graph tracer may have:
> > 
> >              is_vmalloc_addr() {
> >                /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> >                /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> >              }  
> 
> Just a style question: Would this mean the first line is for function entry
> and the second one is function return?

Yes.

Perhaps we could add field to the perf event to allow for annotation,
so the above could look like:

              is_vmalloc_addr() {
               /* --> cpu_cycles: 5582263593 cache_misses: 2869004572 */
               /* <-- cpu_cycles: 5582267527 cache_misses: 2869006049 */
             }  

Or something similar?



> > The next question is how to label the perf events to be in the 8 bit
> > portion. It could simply be a value that is registered, and listed in the
> > available_perf_events file.
> > 
> >   cpu_cycles:1
> >   cach_misses:2
> >   [..]  
> 
> Looks good to me. I think pre-definied events of `perf list`
> will be there and have fixed numbers.

Thanks for looking at this,

-- Steve

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Masami Hiramatsu (Google) 2 months, 3 weeks ago

On Mon, 17 Nov 2025 22:42:27 -0500
Steven Rostedt <rostedt@kernel.org> wrote:

> > > As this will eventual work with many more perf events than just cache-misses
> > > and cpu-cycles , using options is not appropriate. Especially since the
> > > options are limited to a 64 bit bitmask, and that can easily go much higher.
> > > I'm thinking about having a file instead that will act as a way to enable
> > > perf events for events, function and function graph tracing.
> > > 
> > >   set_event_perf, set_ftrace_perf, set_fgraph_perf  
> > 
> > What about adding a global `trigger` action file so that user can
> > add these "perf" actions to write into it. It is something like
> > stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
> > into it too)
> > 
> > For pre-defined/software counters:
> > # echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
> 
> For events, it would make more sense to put it into the events directory:
> 
>  # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
> 
> As there is already a events/enable
> 
> Heck we could even add it per system:
> 
>  # echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/syscalls/trigger

Yes, this will be very useful!


> 
> > 
> > For some hardware event sources (see /sys/bus/event_source/devices/):
> > # echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
> > 
> > echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
> 
> Still need a way to add an identifier list. Currently, if the size of
> the type identifier is one byte, then it can only support up to 256 events.

Yes, so if user adds more than that, it will return -ENOSPC.

> 
> Do we need every event for this? Or just have a subset of events that
> would be supported?

For the event tracing, maybe those are used as measuring delta between
paired events. For such use case, user may want to set it only on those
events.

> 
> 
> > 
> > If we need to set those counters for tracers and events separately,
> > we can add `events/trigger` and `tracer-trigger` files.
> 
> As I mentioned, the trigger for events should be in the events directory.

Agreed.

> 
> We could add a ftrace_trigger that can affect both function and
> function graph tracer.

Got it.

> 
> > 
> > echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
> > 
> > To disable counters, we can use '!' as same as event triggers.
> > 
> > echo !perf:cpu_cycles > trigger
> 
> Yes, it would follow the current way to disable a trigger.
> 
> > 
> > To add more than 2 counters, connect it with ':'.
> > (or, we will allow to append new perf counters)
> > This allows user to set perf counter options for each events.
> > 
> > Maybe we also should move 'stacktrace'/'userstacktrace' option
> > flags to it too eventually.
> 
> We can add them, but may never be able to remove them due to backward
> compatibility.

Ah, indeed.

> 
> > > 
> > > And an available_perf_events that show what can be written into these files,
> > > (similar to how set_ftrace_filter works). But for now, it was just easier to
> > > implement them as options.
> > > 
> > > As for the perf event that is triggered. It currently is a dynamic array of
> > > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > > counter and does not do any math. That would be needed to be done by any
> > > post processing.
> > > 
> > > Since the values are for user space to do the subtraction to figure out the
> > > difference between events, for example, the function_graph tracer may have:
> > > 
> > >              is_vmalloc_addr() {
> > >                /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> > >                /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> > >              }  
> > 
> > Just a style question: Would this mean the first line is for function entry
> > and the second one is function return?
> 
> Yes.
> 
> Perhaps we could add field to the perf event to allow for annotation,
> so the above could look like:
> 
>               is_vmalloc_addr() {
>                /* --> cpu_cycles: 5582263593 cache_misses: 2869004572 */
>                /* <-- cpu_cycles: 5582267527 cache_misses: 2869006049 */
>              }  
> 
> Or something similar?

Yeah, it looks more readable.

Thank you!

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 18 Nov 2025 17:11:47 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > > If we need to set those counters for tracers and events separately,
> > > we can add `events/trigger` and `tracer-trigger` files.  
> > 
> > As I mentioned, the trigger for events should be in the events directory.  
> 
> Agreed.
> 
> > 
> > We could add a ftrace_trigger that can affect both function and
> > function graph tracer.  
> 

Actually, I should add "trigger" files in the ftrace events:

  events/ftrace/function/trigger
  events/ftrace/funcgraph_entry/tigger
  events/ftrace/funcgraph_exit/tigger


Hmm,

-- Steve

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 18 Nov 2025 17:11:47 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > > echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger  
> > 
> > Still need a way to add an identifier list. Currently, if the size of
> > the type identifier is one byte, then it can only support up to 256 events.  
> 
> Yes, so if user adds more than that, it will return -ENOSPC.

The issue is that the ids are defined by what is possible, not by what the
user enables.

-- Steve

Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Posted by Steven Rostedt 2 months, 3 weeks ago

On Tue, 18 Nov 2025 08:53:24 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > Yes, so if user adds more than that, it will return -ENOSPC.  
> 
> The issue is that the ids are defined by what is possible, not by what the
> user enables.

Now we could take 4 more bits from the mask and bring the raw value down to
just 52 bits. 2^51 at 4GHz is still 6 days. Which is plenty more than required.

This will make the id 12 bits, or 4096 different defined events.

-- Steve