include/uapi/linux/perf_event.h | 19 ++- kernel/events/core.c | 315 +++++++++++++++++++++++++++++++++------- 2 files changed, 280 insertions(+), 54 deletions(-)
Hi
Here is a stab at adding an ioctl for sideband events.
This is to overcome races when reading the same information
from /proc.
To keep it simple, the ioctl is limited to emitting existing
sideband events (fork, namespaces, comm, mmap) to an already
active context.
There are not yet any perf tools patches at this stage.
Adrian Hunter (5):
perf: Add ioctl to emit sideband events
perf: Add fork to the sideband ioctl
perf: Add namespaces to the sideband ioctl
perf: Add comm to the sideband ioctl
perf: Add mmap to the sideband ioctl
include/uapi/linux/perf_event.h | 19 ++-
kernel/events/core.c | 315 +++++++++++++++++++++++++++++++++-------
2 files changed, 280 insertions(+), 54 deletions(-)
Regards
Adrian
On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > Hi > > Here is a stab at adding an ioctl for sideband events. > > This is to overcome races when reading the same information > from /proc. What races? Are you talking about reading old state in /proc the kernel delivering a sideband event for the new state, and then you writing the old state out? Surely that's something perf tool can fix without kernel changes?
On 17/04/23 14:02, Peter Zijlstra wrote: > On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: >> Hi >> >> Here is a stab at adding an ioctl for sideband events. >> >> This is to overcome races when reading the same information >> from /proc. > > What races? Are you talking about reading old state in /proc the kernel > delivering a sideband event for the new state, and then you writing the > old state out? > > Surely that's something perf tool can fix without kernel changes? Yes, and it was a bit of a brain fart not to realise that. There may still be corner cases, where different kinds of events are interdependent, perhaps NAMESPACES events vs MMAP events could have ordering issues. Putting that aside, the ioctl may be quicker than reading from /proc. I could get some numbers and see what people think.
On 18/04/23 09:18, Adrian Hunter wrote:
> On 17/04/23 14:02, Peter Zijlstra wrote:
>> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote:
>>> Hi
>>>
>>> Here is a stab at adding an ioctl for sideband events.
>>>
>>> This is to overcome races when reading the same information
>>> from /proc.
>>
>> What races? Are you talking about reading old state in /proc the kernel
>> delivering a sideband event for the new state, and then you writing the
>> old state out?
>>
>> Surely that's something perf tool can fix without kernel changes?
>
> Yes, and it was a bit of a brain fart not to realise that.
>
> There may still be corner cases, where different kinds of events are
> interdependent, perhaps NAMESPACES events vs MMAP events could
> have ordering issues.
>
> Putting that aside, the ioctl may be quicker than reading from
> /proc. I could get some numbers and see what people think.
>
Here's a result with a quick hack to use the ioctl but without
handling the buffer becoming full (hence the -m4M)
# ps -e | wc -l
1171
# perf.old stat -- perf.old record -o old.data --namespaces -a true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.095 MB old.data (100 samples) ]
Performance counter stats for 'perf.old record -o old.data --namespaces -a true':
498.15 msec task-clock # 0.987 CPUs utilized
126 context-switches # 252.935 /sec
64 cpu-migrations # 128.475 /sec
4396 page-faults # 8.825 K/sec
1927096347 cycles # 3.868 GHz
4563059399 instructions # 2.37 insn per cycle
914232559 branches # 1.835 G/sec
6618052 branch-misses # 0.72% of all branches
9633787105 slots # 19.339 G/sec
4394300990 topdown-retiring # 38.8% Retiring
3693815286 topdown-bad-spec # 32.6% Bad Speculation
1692356927 topdown-fe-bound # 14.9% Frontend Bound
1544151518 topdown-be-bound # 13.6% Backend Bound
0.504636742 seconds time elapsed
0.158237000 seconds user
0.340625000 seconds sys
# perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.095 MB new.data (103 samples) ]
Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true':
386.61 msec task-clock # 0.988 CPUs utilized
100 context-switches # 258.658 /sec
65 cpu-migrations # 168.128 /sec
4935 page-faults # 12.765 K/sec
1495905137 cycles # 3.869 GHz
3647660473 instructions # 2.44 insn per cycle
735822370 branches # 1.903 G/sec
5765668 branch-misses # 0.78% of all branches
7477722620 slots # 19.342 G/sec
3415835954 topdown-retiring # 39.5% Retiring
2748625759 topdown-bad-spec # 31.8% Bad Speculation
1221594670 topdown-fe-bound # 14.1% Frontend Bound
1256150733 topdown-be-bound # 14.5% Backend Bound
0.391472763 seconds time elapsed
0.141207000 seconds user
0.246277000 seconds sys
# ls -lh old.data
-rw------- 1 root root 1.2M Apr 18 13:19 old.data
# ls -lh new.data
-rw------- 1 root root 1.2M Apr 18 13:19 new.data
#
On Tue, Apr 18, 2023 at 6:36 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 18/04/23 09:18, Adrian Hunter wrote: > > On 17/04/23 14:02, Peter Zijlstra wrote: > >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > >>> Hi > >>> > >>> Here is a stab at adding an ioctl for sideband events. > >>> > >>> This is to overcome races when reading the same information > >>> from /proc. > >> > >> What races? Are you talking about reading old state in /proc the kernel > >> delivering a sideband event for the new state, and then you writing the > >> old state out? > >> > >> Surely that's something perf tool can fix without kernel changes? > > > > Yes, and it was a bit of a brain fart not to realise that. > > > > There may still be corner cases, where different kinds of events are > > interdependent, perhaps NAMESPACES events vs MMAP events could > > have ordering issues. > > > > Putting that aside, the ioctl may be quicker than reading from > > /proc. I could get some numbers and see what people think. > > > > Here's a result with a quick hack to use the ioctl but without > handling the buffer becoming full (hence the -m4M) > > # ps -e | wc -l > 1171 > # perf.old stat -- perf.old record -o old.data --namespaces -a true > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.095 MB old.data (100 samples) ] > > Performance counter stats for 'perf.old record -o old.data --namespaces -a true': > > 498.15 msec task-clock # 0.987 CPUs utilized > 126 context-switches # 252.935 /sec > 64 cpu-migrations # 128.475 /sec > 4396 page-faults # 8.825 K/sec > 1927096347 cycles # 3.868 GHz > 4563059399 instructions # 2.37 insn per cycle > 914232559 branches # 1.835 G/sec > 6618052 branch-misses # 0.72% of all branches > 9633787105 slots # 19.339 G/sec > 4394300990 topdown-retiring # 38.8% Retiring > 3693815286 topdown-bad-spec # 32.6% Bad Speculation > 1692356927 topdown-fe-bound # 14.9% Frontend Bound > 1544151518 topdown-be-bound # 13.6% Backend Bound > > 0.504636742 seconds time elapsed > > 0.158237000 seconds user > 0.340625000 seconds sys > > # perf.old stat -- perf.new record -o new.data -m4M --namespaces -a true > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.095 MB new.data (103 samples) ] > > Performance counter stats for 'perf.new record -o new.data -m4M --namespaces -a true': > > 386.61 msec task-clock # 0.988 CPUs utilized > 100 context-switches # 258.658 /sec > 65 cpu-migrations # 168.128 /sec > 4935 page-faults # 12.765 K/sec > 1495905137 cycles # 3.869 GHz > 3647660473 instructions # 2.44 insn per cycle > 735822370 branches # 1.903 G/sec > 5765668 branch-misses # 0.78% of all branches > 7477722620 slots # 19.342 G/sec > 3415835954 topdown-retiring # 39.5% Retiring > 2748625759 topdown-bad-spec # 31.8% Bad Speculation > 1221594670 topdown-fe-bound # 14.1% Frontend Bound > 1256150733 topdown-be-bound # 14.5% Backend Bound > > 0.391472763 seconds time elapsed > > 0.141207000 seconds user > 0.246277000 seconds sys > > # ls -lh old.data > -rw------- 1 root root 1.2M Apr 18 13:19 old.data > # ls -lh new.data > -rw------- 1 root root 1.2M Apr 18 13:19 new.data > # Cool, so the headline is a ~20% or 1billion instruction reduction in perf startup overhead? Thanks, Ian
On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: > > Hi > > > > Here is a stab at adding an ioctl for sideband events. > > > > This is to overcome races when reading the same information > > from /proc. > > What races? Are you talking about reading old state in /proc the kernel > delivering a sideband event for the new state, and then you writing the > old state out? > > Surely that's something perf tool can fix without kernel changes? So my reading is that during event synthesis there are races between reading the different /proc files. There is still, I believe, a race in with perf record/top with uid filtering which reminds me of this. The uid filtering race is that we scan /proc to find processes (pids) for a uid, we then synthesize the maps for each of these pids but if a pid starts or exits we either error out or don't sample that pid. I believe the error out behavior is easy to hit 100% of the time making uid mode of limited use. This may be for something other than synthesis, but for synthesis a few points are: - as servers get bigger and consequently more jobs get consolidated on them, synthesis is slow (hence --num-thread-synthesize) and also the events dominate the perf.data file - perhaps >90% of the file size, and a lot of that will be for processes with no samples in them. Another issue here is that all those file descriptors don't come for free in the kernel. - BPF has buildid+offset stack traces that remove the need for synthesis by having more expensive stack generation. I believe this is unpopular as adding this as a variant for every kind of event would be hard, but perhaps we can do some low-hanging fruit like instructions and cycles. - I believe Jiri looked at doing synthesis with BPF. Perhaps we could do something similar to the off-cpu and tail-synthesize, where more things happen at the tail end of perf. Off-cpu records data in maps that it then synthesizes into samples. There is also a long standing issue around not sampling munmap (or mremap) that causes plenty of issues. Perhaps if we had less mmap in the perf.data file we could add these. Thanks, Ian
On 17/04/23 19:37, Ian Rogers wrote: > On Mon, Apr 17, 2023 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Fri, Apr 14, 2023 at 11:22:55AM +0300, Adrian Hunter wrote: >>> Hi >>> >>> Here is a stab at adding an ioctl for sideband events. >>> >>> This is to overcome races when reading the same information >>> from /proc. >> >> What races? Are you talking about reading old state in /proc the kernel >> delivering a sideband event for the new state, and then you writing the >> old state out? >> >> Surely that's something perf tool can fix without kernel changes? > > So my reading is that during event synthesis there are races between > reading the different /proc files. There is still, I believe, a race > in with perf record/top with uid filtering which reminds me of this. > The uid filtering race is that we scan /proc to find processes (pids) > for a uid, we then synthesize the maps for each of these pids but if a > pid starts or exits we either error out or don't sample that pid. I > believe the error out behavior is easy to hit 100% of the time making > uid mode of limited use. > > This may be for something other than synthesis, but for synthesis a > few points are: > - as servers get bigger and consequently more jobs get consolidated > on them, synthesis is slow (hence --num-thread-synthesize) and also > the events dominate the perf.data file - perhaps >90% of the file > size, and a lot of that will be for processes with no samples in them. Note also, for hardware tracing, it isn't generally possible to know that during tracing, and figuring it out afterwards and working backwards may not be feasible. > Another issue here is that all those file descriptors don't come for > free in the kernel. > - BPF has buildid+offset stack traces that remove the need for > synthesis by having more expensive stack generation. I believe this is > unpopular as adding this as a variant for every kind of event would be > hard, but perhaps we can do some low-hanging fruit like instructions > and cycles. > - I believe Jiri looked at doing synthesis with BPF. Perhaps we could > do something similar to the off-cpu and tail-synthesize, where more > things happen at the tail end of perf. Off-cpu records data in maps > that it then synthesizes into samples. > > There is also a long standing issue around not sampling munmap (or > mremap) that causes plenty of issues. Perhaps if we had less mmap in > the perf.data file we could add these. > > Thanks, > Ian
© 2016 - 2025 Red Hat, Inc.