From nobody Sun Oct 5 21:56:35 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B0A02264D9; Tue, 29 Jul 2025 16:18:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753805939; cv=none; b=EbECcs5dH/jLbukAMCKKsq7yvSkf73DSsCIAAAYXw3du8rxvbD5NpJVtt9Vl+FC+DwJ4l0iVMmkvUPTr9VzIhKUHdRxXLIq/DYOpZie+LyCpgPzsVKAm3f8bWpEgpsXPeg4ym5axEC2b/8CudZNzFzVTY7ET6cN2v3yDElrkeMA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753805939; c=relaxed/simple; bh=MACs6CtFPbl3YZbDXaewzpaZzAe81IethFputwbMMUk=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=rCMFoY8Q17FuNU0I17td79qKEAe8tg3PqWwlZSGUgaCkr/ud866EHBk3JGRDnmLnf5yjXFXMtrKlDV1Vof7c+Qtk12/MJ2QZ545ehalMu7w6bGF6w3eCBIASb1DsmExRdKZerPSbS3aOp8B0Fkwu+HYlqI8EpFmlqnIDfnW4n8o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qYg+ovsl; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qYg+ovsl" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D2FD7C4CEFA; Tue, 29 Jul 2025 16:18:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753805937; bh=MACs6CtFPbl3YZbDXaewzpaZzAe81IethFputwbMMUk=; h=Date:From:To:Cc:Subject:References:From; b=qYg+ovsl0165s6MwV/MTZjhSGWugFsTtwdZgH4JTrV+AmGgC/cJKi2AfwCq4LVaAF 1hpn2DmkgkVnaoFjUIGQ5rmiJkL9iZfqDxvt9m68JedXs5FZ4gQ/qdjU8Iyk+jm/Qo Ulbe91UMFUBqC/8te3AinQpJ8Pm6EQDcE9jOATCBIWT0KJSBbgyCOGfYrZUkT1wFvw RiT5tMm5zkSBKeRqdLstkF6Kvd6AL9+AcTLVPRbtlM+fldMNbtSQDa8g4GksBhtI2E QgIGF67sMbZ9ONYSbpE/w81HhFOMK0kZliSX87HkDY1DYzCNxtzw9nXSJhtsp4abtA 575snwfEUfcxg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1ugn2m-00000004ybw-1ZXZ; Tue, 29 Jul 2025 12:19:12 -0400 Message-ID: <20250729161912.226505358@kernel.org> User-Agent: quilt/0.68 Date: Tue, 29 Jul 2025 12:18:18 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Namhyung Kim , Jonathan Corbet , Randy Dunlap Subject: [PATCH v2 2/2] Documentation: tracing: Add documentation about eprobes References: <20250729161816.678462962@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Eprobes was added back in 5.15, but was never documented. It became a "secret" interface even though it has been a topic of several presentations. For some reason, when eprobes was added, documenting it never became a priority, until now. Signed-off-by: Steven Rostedt (Google) Acked-by: Masami Hiramatsu (Google) Reviewed-by: Randy Dunlap --- Changes since v1: https://lore.kernel.org/20250728171522.7d54e116@batman.lo= cal.home - Renamed to eprobetrace.rst (Masami Hiramatsu) - Fixed title of document (Masami Hiramatsu) - Fixed grammar and spellings (Randy Dunlap) Documentation/trace/eprobetrace.rst | 269 ++++++++++++++++++++++++++++ Documentation/trace/index.rst | 1 + 2 files changed, 270 insertions(+) create mode 100644 Documentation/trace/eprobetrace.rst diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/epro= betrace.rst new file mode 100644 index 000000000000..6d8946983466 --- /dev/null +++ b/Documentation/trace/eprobetrace.rst @@ -0,0 +1,269 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D +Eprobe - Event-based Probe Tracing +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +:Author: Steven Rostedt + +- Written for v6.17 + +Overview +=3D=3D=3D=3D=3D=3D=3D=3D + +Eprobes are dynamic events that are placed on existing events to either +dereference a field that is a pointer, or simply to limit what fields are +recorded in the trace event. + +Eprobes depend on kprobe events so to enable this feature; build your kern= el +with CONFIG_EPROBE_EVENTS=3Dy. + +Eprobes are created via the /sys/kernel/tracing/dynamic_events file. + +Synopsis of eprobe_events +------------------------- +:: + + e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe + -:[EGRP/][EEVENT] : Clear a probe + + EGRP : Group name of the new event. If omitted, use "eprobes" for it. + EEVENT : Event name. If omitted, the event name is generated and will + be the same event name as the event it attached to. + GRP : Group name of the event to attach to. + EVENT : Event name of the event to attach to. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + $FIELD : Fetch the value of the event field called FIELD. + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbo= l) + $comm : Fetch current task comm. + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\= *4) + \IMM : Store an immediate value to the argument. + NAME=3DFETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", + "string", "ustring", "symbol", "symstr" and "bitfield" a= re + supported. + +Types +----- +The FETCHARGS above is very similar to the kprobe events as described in +Documentation/trace/kprobetrace.rst. + +The difference between eprobes and kprobes FETCHARGS is that eprobes has a +$FIELD command that returns the content of the event field of the event +that is attached. Eprobes do not have access to registers, stacks and func= tion +arguments that kprobes has. + +If a field argument is a pointer, it may be dereferenced just like a memory +address using the FETCHARGS syntax. + + +Attaching to dynamic events +--------------------------- + +Eprobes may attach to dynamic events as well as to normal events. It may +attach to a kprobe event, a synthetic event or a fprobe event. This is use= ful +if the type of a field needs to be changed. See Example 2 below. + +Usage examples +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D + +Example 1 +--------- + +The basic usage of eprobes is to limit the data that is being recorded into +the tracing buffer. For example, a common event to trace is the sched_swit= ch +trace event. That has a format of:: + + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:char prev_comm[16]; offset:8; size:16; signed:0; + field:pid_t prev_pid; offset:24; size:4; signed:1; + field:int prev_prio; offset:28; size:4; signed:1; + field:long prev_state; offset:32; size:8; signed:1; + field:char next_comm[16]; offset:40; size:16; signed:0; + field:pid_t next_pid; offset:56; size:4; signed:1; + field:int next_prio; offset:60; size:4; signed:1; + +The first four fields are common to all events and can not be limited. But= the +rest of the event has 60 bytes of information. It records the names of the +previous and next tasks being scheduled out and in, as well as their pids = and +priorities. It also records the state of the previous task. If only the pi= ds +of the tasks are of interest, why waste the ring buffer with all the other +fields? + +An eprobe can limit what gets recorded. Note, it does not help in performa= nce, +as all the fields are recorded in a temporary buffer to process the eprobe. +:: + + # echo 'e:sched/switch sched.sched_switch prev=3D$prev_pid:u32 next=3D$ne= xt_pid:u32' >> /sys/kernel/tracing/dynamic_events + # echo 1 > /sys/kernel/tracing/events/sched/switch/enable + # cat /sys/kernel/tracing/trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 2721/2721 #P:8 + # + # _-----=3D> irqs-off/BH-disabled + # / _----=3D> need-resched + # | / _---=3D> hardirq/softirq + # || / _--=3D> preempt-depth + # ||| / _-=3D> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_s= witch) prev=3D1082 next=3D0 + bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_s= witch) prev=3D1085 next=3D141 + kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_s= witch) prev=3D141 next=3D1085 + -0 [004] d..4. 5041.240354: switch: (sched.sched_s= witch) prev=3D0 next=3D1082 + bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_s= witch) prev=3D1085 next=3D141 + kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_s= witch) prev=3D141 next=3D1085 + bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_s= witch) prev=3D1085 next=3D0 + sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_s= witch) prev=3D1082 next=3D0 + -0 [001] d..4. 5041.247524: switch: (sched.sched_s= witch) prev=3D0 next=3D90 + -0 [002] d..4. 5041.247545: switch: (sched.sched_s= witch) prev=3D0 next=3D16 + kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_s= witch) prev=3D90 next=3D0 + rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_s= witch) prev=3D16 next=3D0 + -0 [002] d..4. 5041.257536: switch: (sched.sched_s= witch) prev=3D0 next=3D16 + rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_s= witch) prev=3D16 next=3D0 + +Note, without adding the "u32" after the prev_pid and next_pid, the values +would default showing in hexadecimal. + +Example 2 +--------- + +If a specific system call is to be recorded but the syscalls events are not +enabled, the raw_syscalls can still be used (syscalls are system call +events are not normal events, but are created from the raw_syscalls events +within the kernel). In order to trace the openat system call, one can crea= te +an event probe on top of the raw_syscalls event: +:: + + # cd /sys/kernel/tracing + # cat events/raw_syscalls/sys_enter/format + name: sys_enter + ID: 395 + format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:long id; offset:8; size:8; signed:1; + field:unsigned long args[6]; offset:16; size:48; signed:0; + + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0]= , REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] + +From the source code, the sys_openat() has: +:: + + int sys_openat(int dirfd, const char *path, int flags, mode_t mode) + { + return my_syscall4(__NR_openat, dirfd, path, flags, mode); + } + +The path is the second parameter, and that is what is wanted. +:: + + # echo 'e:openat raw_syscalls.sys_enter nr=3D$id filename=3D+8($args):ust= ring' >> dynamic_events + +This is being run on x86_64 where the word size is 8 bytes and the openat +system call __NR_openat is set at 257. +:: + + # echo 'nr =3D=3D 257' > events/eprobes/openat/filter + +Now enable the event and look at the trace. +:: + + # echo 1 > events/eprobes/openat/enable + # cat trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 4/4 #P:8 + # + # _-----=3D> irqs-off/BH-disabled + # / _----=3D> need-resched + # | / _---=3D> hardirq/softirq + # || / _--=3D> preempt-depth + # ||| / _-=3D> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.= sys_enter) nr=3D0x101 filename=3D(fault) + cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.= sys_enter) nr=3D0x101 filename=3D(fault) + cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.= sys_enter) nr=3D0x101 filename=3D(fault) + cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.= sys_enter) nr=3D0x101 filename=3D(fault) + +The filename shows "(fault)". This is likely because the filename has not = been +pulled into memory yet and currently trace events cannot fault in memory t= hat +is not present. When an eprobe tries to read memory that has not been faul= ted +in yet, it will show the "(fault)" text. + +To get around this, as the kernel will likely pull in this filename and ma= ke +it present, attaching it to a synthetic event that can pass the address of= the +filename from the entry of the event to the end of the event, this can be = used +to show the filename when the system call returns. + +Remove the old eprobe:: + + # echo 1 > events/eprobes/openat/enable + # echo '-:openat' >> dynamic_events + +This time make an eprobe where the address of the filename is saved:: + + # echo 'e:openat_start raw_syscalls.sys_enter nr=3D$id filename=3D+8($arg= s):x64' >> dynamic_events + +Create a synthetic event that passes the address of the filename to the +end of the event:: + + # echo 's:filename u64 file' >> dynamic_events + # echo 'hist:keys=3Dcommon_pid:f=3Dfilename if nr =3D=3D 257' > events/ep= robes/openat_start/trigger + # echo 'hist:keys=3Dcommon_pid:file=3D$f:onmatch(eprobes.openat_start).tr= ace(filename,$file) if id =3D=3D 257' > events/raw_syscalls/sys_exit/trigger + +Now that the address of the filename has been passed to the end of the +system call, create another eprobe to attach to the exit event to show the +string:: + + # echo 'e:openat synthetic.filename filename=3D+0($file):ustring' >> dyna= mic_events + # echo 1 > events/eprobes/openat/enable + # cat trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 4/4 #P:8 + # + # _-----=3D> irqs-off/BH-disabled + # / _----=3D> need-resched + # | / _---=3D> hardirq/softirq + # || / _--=3D> preempt-depth + # ||| / _-=3D> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + cat-1331 [001] ...5. 2944.787977: openat: (synthetic.fil= ename) filename=3D"/etc/ld.so.cache" + cat-1331 [001] ...5. 2944.788480: openat: (synthetic.fil= ename) filename=3D"/lib/x86_64-linux-gnu/libc.so.6" + cat-1331 [001] ...5. 2944.793426: openat: (synthetic.fil= ename) filename=3D"/usr/lib/locale/locale-archive" + cat-1331 [001] ...5. 2944.831362: openat: (synthetic.fil= ename) filename=3D"trace" + +Example 3 +--------- + +If syscall trace events are available, the above would not need the first +eprobe, but it would still need the last one:: + + # echo 's:filename u64 file' >> dynamic_events + # echo 'hist:keys=3Dcommon_pid:f=3Dfilename' > events/syscalls/sys_enter_= openat/trigger + # echo 'hist:keys=3Dcommon_pid:file=3D$f:onmatch(syscalls.sys_enter_opena= t).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger + # echo 'e:openat synthetic.filename filename=3D+0($file):ustring' >> dyna= mic_events + # echo 1 > events/eprobes/openat/enable + +And this would produce the same result as Example 2. diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index cc1dc5a087e8..b4a429dc4f7a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -36,6 +36,7 @@ the Linux kernel. kprobes kprobetrace fprobetrace + eprobetrace fprobe ring-buffer-design =20 --=20 2.47.2