tools/lib/api/Makefile | 2 +- tools/lib/api/io_dir.h | 91 ++++++++++++++++++++++++++++++ tools/perf/util/header.c | 31 +++++----- tools/perf/util/hwmon_pmu.c | 42 ++++++-------- tools/perf/util/machine.c | 19 +++---- tools/perf/util/parse-events.c | 32 ++++++----- tools/perf/util/pmu.c | 46 +++++++-------- tools/perf/util/pmus.c | 30 ++++------ tools/perf/util/synthetic-events.c | 22 ++++---- 9 files changed, 194 insertions(+), 121 deletions(-) create mode 100644 tools/lib/api/io_dir.h
glibc's opendir allocates a minimum of 32kb, when called recursively
for a directory tree the memory consumption can add up - nearly 300kb
during perf start-up when processing modules. Add a stack allocated
variant of readdir sized a little more than 1kb
v2: Remove the feature test and always use a perf supplied getdents64
to workaround an Alpine Linux issue in v1:
https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
As suggested by Krzysztof Łopatowski
<krzysztof.m.lopatowski@gmail.com> who also pointed to the perf
trace performance improvements in start-up time eliminating stat
calls can achieve:
https://lore.kernel.org/lkml/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com/
Convert parse-events and hwmon_pmu to use io_dir.
v1: This was previously part of the memory saving change set:
https://lore.kernel.org/lkml/20231127220902.1315692-1-irogers@google.com/
It is separated here and a feature check and syscall workaround
for missing getdents64 added.
Ian Rogers (7):
tools lib api: Add io_dir an allocation free readdir alternative
perf maps: Switch modules tree walk to io_dir__readdir
perf pmu: Switch to io_dir__readdir
perf header: Switch mem topology to io_dir__readdir
perf events: Remove scandir in thread synthesis
perf parse-events: Switch tracepoints to io_dir__readdir
perf hwmon_pmu: Switch event discovery to io_dir__readdir
tools/lib/api/Makefile | 2 +-
tools/lib/api/io_dir.h | 91 ++++++++++++++++++++++++++++++
tools/perf/util/header.c | 31 +++++-----
tools/perf/util/hwmon_pmu.c | 42 ++++++--------
tools/perf/util/machine.c | 19 +++----
tools/perf/util/parse-events.c | 32 ++++++-----
tools/perf/util/pmu.c | 46 +++++++--------
tools/perf/util/pmus.c | 30 ++++------
tools/perf/util/synthetic-events.c | 22 ++++----
9 files changed, 194 insertions(+), 121 deletions(-)
create mode 100644 tools/lib/api/io_dir.h
--
2.48.1.502.g6dc24dfdaf-goog
On Fri, Feb 07, 2025 at 03:24:41PM -0800, Ian Rogers wrote: > glibc's opendir allocates a minimum of 32kb, when called recursively > for a directory tree the memory consumption can add up - nearly 300kb > during perf start-up when processing modules. Add a stack allocated > variant of readdir sized a little more than 1kb > > v2: Remove the feature test and always use a perf supplied getdents64 > to workaround an Alpine Linux issue in v1: > https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/ > As suggested by Krzysztof Łopatowski > <krzysztof.m.lopatowski@gmail.com> who also pointed to the perf > trace performance improvements in start-up time eliminating stat > calls can achieve: > https://lore.kernel.org/lkml/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com/ Let me pick up Krzysztof's patch first. Thanks, Namhyung > Convert parse-events and hwmon_pmu to use io_dir. > v1: This was previously part of the memory saving change set: > https://lore.kernel.org/lkml/20231127220902.1315692-1-irogers@google.com/ > It is separated here and a feature check and syscall workaround > for missing getdents64 added. > > Ian Rogers (7): > tools lib api: Add io_dir an allocation free readdir alternative > perf maps: Switch modules tree walk to io_dir__readdir > perf pmu: Switch to io_dir__readdir > perf header: Switch mem topology to io_dir__readdir > perf events: Remove scandir in thread synthesis > perf parse-events: Switch tracepoints to io_dir__readdir > perf hwmon_pmu: Switch event discovery to io_dir__readdir > > tools/lib/api/Makefile | 2 +- > tools/lib/api/io_dir.h | 91 ++++++++++++++++++++++++++++++ > tools/perf/util/header.c | 31 +++++----- > tools/perf/util/hwmon_pmu.c | 42 ++++++-------- > tools/perf/util/machine.c | 19 +++---- > tools/perf/util/parse-events.c | 32 ++++++----- > tools/perf/util/pmu.c | 46 +++++++-------- > tools/perf/util/pmus.c | 30 ++++------ > tools/perf/util/synthetic-events.c | 22 ++++---- > 9 files changed, 194 insertions(+), 121 deletions(-) > create mode 100644 tools/lib/api/io_dir.h > > -- > 2.48.1.502.g6dc24dfdaf-goog >
On Fri, 7 Feb 2025 15:24:41 -0800 Ian Rogers <irogers@google.com> wrote: > glibc's opendir allocates a minimum of 32kb, when called recursively > for a directory tree the memory consumption can add up - nearly 300kb > during perf start-up when processing modules. Add a stack allocated > variant of readdir sized a little more than 1kb Does 300kB really matter? You seem to be trading memory allocation against system calls. My 'gut feel' is that the system call cost will dominate. (Unless all the directories are small.) There might, of course, be other code in glibc that is designed to make it all slower than necessary. David
On Sat, Feb 8, 2025 at 2:58 AM David Laight <david.laight.linux@gmail.com> wrote: > > On Fri, 7 Feb 2025 15:24:41 -0800 > Ian Rogers <irogers@google.com> wrote: > > > glibc's opendir allocates a minimum of 32kb, when called recursively > > for a directory tree the memory consumption can add up - nearly 300kb > > during perf start-up when processing modules. Add a stack allocated > > variant of readdir sized a little more than 1kb > > Does 300kB really matter? For someone on a laptop, probably no. For a company running the command a large number of times in consolidated and heavily loaded data centers, where binaries run in memory constrained containers, memory usage contributes to latency of the 99th percentile slow executions, which because of the number of machines is something seen all the time. That said, do you need to optimize usages where you recurse over the kernel modules? That's something you wouldn't try to do in a data centre, so there isn't so much need. > You seem to be trading memory allocation against system calls. > My 'gut feel' is that the system call cost will dominate. > (Unless all the directories are small.) It isn't so much directories being small as the filenames within them. getdents64 will vary the size of each entry based on the filename size which can be up to 256 bytes. As an example, for trace points: ``` $ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/ total 0 drwxr-xr-x 1 root root 0 Feb 8 03:23 . drwxr-xr-x 1 root root 0 Feb 8 03:20 .. -rw-r----- 1 root root 0 Feb 8 03:23 enable -rw-r----- 1 root root 0 Feb 8 03:23 filter drwxr-xr-x 1 root root 0 Feb 8 03:23 sys_enter drwxr-xr-x 1 root root 0 Feb 8 03:23 sys_exit $ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter total 0 drwxr-xr-x 1 root root 0 Feb 8 03:23 . drwxr-xr-x 1 root root 0 Feb 8 03:23 .. -rw-r----- 1 root root 0 Feb 8 03:23 enable -rw-r----- 1 root root 0 Feb 8 03:23 filter -r--r----- 1 root root 0 Feb 8 03:23 format -r--r----- 1 root root 0 Feb 8 03:23 hist -r--r----- 1 root root 0 Feb 8 03:23 id -rw-r----- 1 root root 0 Feb 8 03:23 trigger ``` So each d_entry is going to cost you like 20 bytes plus the length of the filename. So 6 entries with filenames varying up to 9 characters, you need 24 or 32 bytes per entry after padding. If each entry were 32 bytes then you'd need a buffer to hold all the data of size 192 bytes, the code here is using 1kb. The value used by glibc is between 32KB and 1MB, which is pretty generous when all you need is 192 bytes. An issue in perf is that we may recursively descend directories and use opendir on each level, so the buffer size gets multiplied by the depth of the tree. So at 100s of KB these memory allocations are some of the largest seen in profiles of perf - BPF has some large allocations too. So why not fix opendir? I've found trying to get things landed in glibc is slow to glacial [1], and someone once thought those buffer sizes were good so who am I to argue? The LLVM libc and similar are making similar smaller than glibc buffer size choices. The addition of io_dir means we can tune for the use cases that matter to perf, such as kernel modules, procfs, sysfs, .. in general we don't see directories with lots of entries and we're a long way from the 256 byte file name max length. While the code could potentially increase the number of system calls, in practice it doesn't. Thanks, Ian [1] I sent patches bringing back the "stack:tid" convention in /prod/pid/maps by naming stacks on allocation (an extra syscall). This can be used, for example, to identify stack vs heap memory accesses. Linux used to support this but Linus himself removed the support - the code to do it had quadratic execution time due to searching for each stack allocation the corresponding thread stack pointer. Fixing opendir memory size and stack naming in glibc would be good things imo. Were libc better we may not need this series (we still know our use cases better than a generic implementation). > There might, of course, be other code in glibc that is designed > to make it all slower than necessary. > > David
© 2016 - 2026 Red Hat, Inc.