[PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir

Ian Rogers posted 7 patches 1 year ago
There is a newer version of this series
tools/lib/api/Makefile             |  2 +-
tools/lib/api/io_dir.h             | 91 ++++++++++++++++++++++++++++++
tools/perf/util/header.c           | 31 +++++-----
tools/perf/util/hwmon_pmu.c        | 42 ++++++--------
tools/perf/util/machine.c          | 19 +++----
tools/perf/util/parse-events.c     | 32 ++++++-----
tools/perf/util/pmu.c              | 46 +++++++--------
tools/perf/util/pmus.c             | 30 ++++------
tools/perf/util/synthetic-events.c | 22 ++++----
9 files changed, 194 insertions(+), 121 deletions(-)
create mode 100644 tools/lib/api/io_dir.h
[PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir
Posted by Ian Rogers 1 year ago
glibc's opendir allocates a minimum of 32kb, when called recursively
for a directory tree the memory consumption can add up - nearly 300kb
during perf start-up when processing modules. Add a stack allocated
variant of readdir sized a little more than 1kb

v2: Remove the feature test and always use a perf supplied getdents64
    to workaround an Alpine Linux issue in v1:
    https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
    As suggested by Krzysztof Łopatowski
    <krzysztof.m.lopatowski@gmail.com> who also pointed to the perf
    trace performance improvements in start-up time eliminating stat
    calls can achieve:
    https://lore.kernel.org/lkml/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com/
    Convert parse-events and hwmon_pmu to use io_dir.
v1: This was previously part of the memory saving change set:
    https://lore.kernel.org/lkml/20231127220902.1315692-1-irogers@google.com/
    It is separated here and a feature check and syscall workaround
    for missing getdents64 added.

Ian Rogers (7):
  tools lib api: Add io_dir an allocation free readdir alternative
  perf maps: Switch modules tree walk to io_dir__readdir
  perf pmu: Switch to io_dir__readdir
  perf header: Switch mem topology to io_dir__readdir
  perf events: Remove scandir in thread synthesis
  perf parse-events: Switch tracepoints to io_dir__readdir
  perf hwmon_pmu: Switch event discovery to io_dir__readdir

 tools/lib/api/Makefile             |  2 +-
 tools/lib/api/io_dir.h             | 91 ++++++++++++++++++++++++++++++
 tools/perf/util/header.c           | 31 +++++-----
 tools/perf/util/hwmon_pmu.c        | 42 ++++++--------
 tools/perf/util/machine.c          | 19 +++----
 tools/perf/util/parse-events.c     | 32 ++++++-----
 tools/perf/util/pmu.c              | 46 +++++++--------
 tools/perf/util/pmus.c             | 30 ++++------
 tools/perf/util/synthetic-events.c | 22 ++++----
 9 files changed, 194 insertions(+), 121 deletions(-)
 create mode 100644 tools/lib/api/io_dir.h

-- 
2.48.1.502.g6dc24dfdaf-goog
Re: [PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir
Posted by Namhyung Kim 11 months, 3 weeks ago
On Fri, Feb 07, 2025 at 03:24:41PM -0800, Ian Rogers wrote:
> glibc's opendir allocates a minimum of 32kb, when called recursively
> for a directory tree the memory consumption can add up - nearly 300kb
> during perf start-up when processing modules. Add a stack allocated
> variant of readdir sized a little more than 1kb
> 
> v2: Remove the feature test and always use a perf supplied getdents64
>     to workaround an Alpine Linux issue in v1:
>     https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
>     As suggested by Krzysztof Łopatowski
>     <krzysztof.m.lopatowski@gmail.com> who also pointed to the perf
>     trace performance improvements in start-up time eliminating stat
>     calls can achieve:
>     https://lore.kernel.org/lkml/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com/

Let me pick up Krzysztof's patch first.

Thanks,
Namhyung


>     Convert parse-events and hwmon_pmu to use io_dir.
> v1: This was previously part of the memory saving change set:
>     https://lore.kernel.org/lkml/20231127220902.1315692-1-irogers@google.com/
>     It is separated here and a feature check and syscall workaround
>     for missing getdents64 added.
> 
> Ian Rogers (7):
>   tools lib api: Add io_dir an allocation free readdir alternative
>   perf maps: Switch modules tree walk to io_dir__readdir
>   perf pmu: Switch to io_dir__readdir
>   perf header: Switch mem topology to io_dir__readdir
>   perf events: Remove scandir in thread synthesis
>   perf parse-events: Switch tracepoints to io_dir__readdir
>   perf hwmon_pmu: Switch event discovery to io_dir__readdir
> 
>  tools/lib/api/Makefile             |  2 +-
>  tools/lib/api/io_dir.h             | 91 ++++++++++++++++++++++++++++++
>  tools/perf/util/header.c           | 31 +++++-----
>  tools/perf/util/hwmon_pmu.c        | 42 ++++++--------
>  tools/perf/util/machine.c          | 19 +++----
>  tools/perf/util/parse-events.c     | 32 ++++++-----
>  tools/perf/util/pmu.c              | 46 +++++++--------
>  tools/perf/util/pmus.c             | 30 ++++------
>  tools/perf/util/synthetic-events.c | 22 ++++----
>  9 files changed, 194 insertions(+), 121 deletions(-)
>  create mode 100644 tools/lib/api/io_dir.h
> 
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
> 
Re: [PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir
Posted by David Laight 12 months ago
On Fri,  7 Feb 2025 15:24:41 -0800
Ian Rogers <irogers@google.com> wrote:

> glibc's opendir allocates a minimum of 32kb, when called recursively
> for a directory tree the memory consumption can add up - nearly 300kb
> during perf start-up when processing modules. Add a stack allocated
> variant of readdir sized a little more than 1kb

Does 300kB really matter?

You seem to be trading memory allocation against system calls.
My 'gut feel' is that the system call cost will dominate.
(Unless all the directories are small.)

There might, of course, be other code in glibc that is designed
to make it all slower than necessary.

	David
Re: [PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir
Posted by Ian Rogers 12 months ago
On Sat, Feb 8, 2025 at 2:58 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Fri,  7 Feb 2025 15:24:41 -0800
> Ian Rogers <irogers@google.com> wrote:
>
> > glibc's opendir allocates a minimum of 32kb, when called recursively
> > for a directory tree the memory consumption can add up - nearly 300kb
> > during perf start-up when processing modules. Add a stack allocated
> > variant of readdir sized a little more than 1kb
>
> Does 300kB really matter?

For someone on a laptop, probably no. For a company running the
command a large number of times in consolidated and heavily loaded
data centers, where binaries run in memory constrained containers,
memory usage contributes to latency of the 99th percentile slow
executions, which because of the number of machines is something seen
all the time.

That said, do you need to optimize usages where you recurse over the
kernel modules? That's something you wouldn't try to do in a data
centre, so there isn't so much need.

> You seem to be trading memory allocation against system calls.
> My 'gut feel' is that the system call cost will dominate.
> (Unless all the directories are small.)

It isn't so much directories being small as the filenames within them.
getdents64 will vary the size of each entry based on the filename size
which can be up to 256 bytes. As an example, for trace points:
```
$ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/
total 0
drwxr-xr-x 1 root root 0 Feb  8 03:23 .
drwxr-xr-x 1 root root 0 Feb  8 03:20 ..
-rw-r----- 1 root root 0 Feb  8 03:23 enable
-rw-r----- 1 root root 0 Feb  8 03:23 filter
drwxr-xr-x 1 root root 0 Feb  8 03:23 sys_enter
drwxr-xr-x 1 root root 0 Feb  8 03:23 sys_exit
$ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter
total 0
drwxr-xr-x 1 root root 0 Feb  8 03:23 .
drwxr-xr-x 1 root root 0 Feb  8 03:23 ..
-rw-r----- 1 root root 0 Feb  8 03:23 enable
-rw-r----- 1 root root 0 Feb  8 03:23 filter
-r--r----- 1 root root 0 Feb  8 03:23 format
-r--r----- 1 root root 0 Feb  8 03:23 hist
-r--r----- 1 root root 0 Feb  8 03:23 id
-rw-r----- 1 root root 0 Feb  8 03:23 trigger
```
So each d_entry is going to cost you like 20 bytes plus the length of
the filename. So 6 entries with filenames varying up to 9 characters,
you need 24 or 32 bytes per entry after padding. If each entry were 32
bytes then you'd need a buffer to hold all the data of size 192 bytes,
the code here is using 1kb. The value used by glibc is between 32KB
and 1MB, which is pretty generous when all you need is 192 bytes.

An issue in perf is that we may recursively descend directories and
use opendir on each level, so the buffer size gets multiplied by the
depth of the tree. So at 100s of KB these memory allocations are some
of the largest seen in profiles of perf - BPF has some large
allocations too.

So why not fix opendir? I've found trying to get things landed in
glibc is slow to glacial [1], and someone once thought those buffer
sizes were good so who am I to argue? The LLVM libc and similar are
making similar smaller than glibc buffer size choices. The addition of
io_dir means we can tune for the use cases that matter to perf, such
as kernel modules, procfs, sysfs, .. in general we don't see
directories with lots of entries and we're a long way from the 256
byte file name max length. While the code could potentially increase
the number of system calls, in practice it doesn't.

Thanks,
Ian

[1] I sent patches bringing back the "stack:tid" convention in
/prod/pid/maps by naming stacks on allocation (an extra syscall). This
can be used, for example, to identify stack vs heap memory accesses.
Linux used to support this but Linus himself removed the support - the
code to do it had quadratic execution time due to searching for each
stack allocation the corresponding thread stack pointer. Fixing
opendir memory size and stack naming in glibc would be good things
imo. Were libc better we may not need this series (we still know our
use cases better than a generic implementation).

> There might, of course, be other code in glibc that is designed
> to make it all slower than necessary.
>
>         David