[PATCH] perf: Improve startup time by reducing unnecessary stat() calls

Krzysztof Łopatowski posted 1 patch 10 months, 1 week ago
tools/perf/util/machine.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
[PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Krzysztof Łopatowski 10 months, 1 week ago
When testing perf trace on NixOS, I noticed significant startup delays:
- `ls`: ~2ms
- `strace ls`: ~10ms
- `perf trace ls`: ~550ms

Profiling showed that 51% of the time is spent reading files,
26% in loading BPF programs, and 11% in `newfstatat`.

This patch optimizes module path exploration by avoiding `stat()` calls
unless necessary. For filesystems that do not implement `d_type`
(DT_UNKNOWN), it falls back to the old behavior.
See `readdir(3)` for details.

This reduces `perf trace ls` time to ~500ms.

A more thorough startup optimization based on command parameters would
be ideal, but that is a larger effort.

Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com>
---
 tools/perf/util/machine.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2d51badfbf2e..76b857fd752b 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1354,7 +1354,7 @@ static int maps__set_module_path(struct maps *maps, const char *path, struct kmo
 
 static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, int depth)
 {
-	struct dirent *dent;
+	const struct dirent *dent;
 	DIR *dir = opendir(dir_name);
 	int ret = 0;
 
@@ -1365,14 +1365,20 @@ static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, i
 
 	while ((dent = readdir(dir)) != NULL) {
 		char path[PATH_MAX];
-		struct stat st;
+		unsigned char d_type = dent->d_type;
 
-		/*sshfs might return bad dent->d_type, so we have to stat*/
 		path__join(path, sizeof(path), dir_name, dent->d_name);
-		if (stat(path, &st))
-			continue;
 
-		if (S_ISDIR(st.st_mode)) {
+		if (d_type == DT_UNKNOWN) {
+			struct stat st;
+
+			if (stat(path, &st))
+				continue;
+			if (S_ISDIR(st.st_mode))
+				d_type = DT_DIR;
+		}
+
+		if (d_type == DT_DIR) {
 			if (!strcmp(dent->d_name, ".") ||
 			    !strcmp(dent->d_name, ".."))
 				continue;
-- 
2.47.2

Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Namhyung Kim 9 months, 3 weeks ago
On Thu, 06 Feb 2025 12:33:15 +0100, Krzysztof Łopatowski wrote:
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms
> 
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.
> 
> [...]
Applied to perf-tools-next, thanks!

Best regards,
Namhyung


Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Howard Chu 10 months, 1 week ago
Hello Krzysztof,

On Thu, Feb 6, 2025 at 3:36 AM Krzysztof Łopatowski
<krzysztof.m.lopatowski@gmail.com> wrote:
>
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms

Thank you so much for this insight.

>
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.

That's a valuable piece of info, thank you.

>
> This patch optimizes module path exploration by avoiding `stat()` calls
> unless necessary. For filesystems that do not implement `d_type`
> (DT_UNKNOWN), it falls back to the old behavior.
> See `readdir(3)` for details.
>
> This reduces `perf trace ls` time to ~500ms.

Before:

perf $ time sudo ./perf trace -a --max-events=1 -S 2 >&1 1>/dev/null
         ? (         ): :2278675/2278675  ... [continued]: read())
                                        = 1
real    0m0.833s

So 833ms.

After:

perf $ time sudo ./perf trace -a --max-events=1 -S 2 >&1 1>/dev/null
         ? (         ): :2274650/2274650  ... [2: No such file or directory
<SNIP>
real    0m0.805s
user    0m0.004s
sys     0m0.011s

805ms.

And the results are consistent. Apparently works for me, thank you :)

>
> A more thorough startup optimization based on command parameters would
> be ideal, but that is a larger effort.
>
> Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com>
> ---
>  tools/perf/util/machine.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 2d51badfbf2e..76b857fd752b 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1354,7 +1354,7 @@ static int maps__set_module_path(struct maps *maps, const char *path, struct kmo
>
>  static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, int depth)
>  {
> -       struct dirent *dent;
> +       const struct dirent *dent;

Is it necessary to constify dent?

>         DIR *dir = opendir(dir_name);
>         int ret = 0;
>
> @@ -1365,14 +1365,20 @@ static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, i
>
>         while ((dent = readdir(dir)) != NULL) {
>                 char path[PATH_MAX];
> -               struct stat st;
> +               unsigned char d_type = dent->d_type;
>
> -               /*sshfs might return bad dent->d_type, so we have to stat*/
>                 path__join(path, sizeof(path), dir_name, dent->d_name);
> -               if (stat(path, &st))
> -                       continue;
>
> -               if (S_ISDIR(st.st_mode)) {
> +               if (d_type == DT_UNKNOWN) {
> +                       struct stat st;
> +
> +                       if (stat(path, &st))
> +                               continue;
> +                       if (S_ISDIR(st.st_mode))
> +                               d_type = DT_DIR;
> +               }
> +
> +               if (d_type == DT_DIR) {
>                         if (!strcmp(dent->d_name, ".") ||
>                             !strcmp(dent->d_name, ".."))
>                                 continue;
> --
> 2.47.2
>
>

Acked-by: Howard Chu <howardchu95@gmail.com>

Thanks,
Howard
Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Krzysztof Łopatowski 10 months, 1 week ago
Hi Howard,

> Is it necessary to constify dent?

It's not necessary. I have used it because I wasn't sure what libc would
do when we wrote to that memory.

Thanks,
Krzysztof
Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Ian Rogers 10 months, 1 week ago
On Thu, Feb 6, 2025 at 3:35 AM Krzysztof Łopatowski
<krzysztof.m.lopatowski@gmail.com> wrote:
>
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms
>
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.
>
> This patch optimizes module path exploration by avoiding `stat()` calls
> unless necessary. For filesystems that do not implement `d_type`
> (DT_UNKNOWN), it falls back to the old behavior.
> See `readdir(3)` for details.
>
> This reduces `perf trace ls` time to ~500ms.
>
> A more thorough startup optimization based on command parameters would
> be ideal, but that is a larger effort.


 Hi Krzysztof,

Thanks for the contribution! I did a series and a new io_dir set of
primitives. The last version of which is:
https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
That did something very similar along with mainly memory usage
optimizations. In patch2:
```
+static inline bool io_dir__is_dir(const struct io_dir *iod, struct
io_dirent64 *dent)
+{
+ if (dent->d_type == DT_UNKNOWN) {
+ struct stat st;
+
+ if (fstatat(iod->dirfd, dent->d_name, &st, /*flags=*/0))
+ return false;
+
+ if (S_ISDIR(st.st_mode)) {
+ dent->d_type = DT_DIR;
+ return true;
+ }
+ }
+ return dent->d_type == DT_DIR;
+}
```
I stopped pursuing the series as the maintainers were complaining
about unpopular libcs/platforms missing system call definitions
(getdents) and the series breaking on those platforms. I tried to go
the usual feature testing route, etc. but we seemed to have entered
into wac-a-mole wrt those libcs/platforms and my patience had worn
thin. I carry the changes in Google's tree where the libc/platform
issue isn't a concern.

I mention this as I think that series may be a better route than this
change as it solves a little bit more of the performance issue. I can
and do rebase the changes for Google in the tree:
https://github.com/googleprodkernel/linux-perf
I don't mind this patch as an expedient, obvious performance win.

Thanks,
Ian
Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Krzysztof Łopatowski 10 months, 1 week ago
Hi Ian,
Thank you for taking the time to look into this.

> I did a series and a new io_dir set of primitives.
> The last version of which is:
> https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
> I mention this as I think that series may be a better route than this
> change as it solves a little bit more of the performance issue.

I'd much prefer to have your solution merged, as it covers more instances
of the same directory exploration pattern and provides an explicit
approach to memory allocation.

> I stopped pursuing the series as the maintainers were complaining
> about unpopular libcs/platforms missing system call definitions
> (getdents) and the series breaking on those platforms.

Yeah, I agree. I also don't think doing an #undef because of muslc is a
good approach. Would you and Namhyung be open to bypassing libc and
calling SYS_getdents64 directly instead?

Best,
Krzysztof
Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
Posted by Namhyung Kim 10 months ago
Hello,

On Thu, Feb 06, 2025 at 10:45:02PM +0100, Krzysztof Łopatowski wrote:
> Hi Ian,
> Thank you for taking the time to look into this.
> 
> > I did a series and a new io_dir set of primitives.
> > The last version of which is:
> > https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
> > I mention this as I think that series may be a better route than this
> > change as it solves a little bit more of the performance issue.
> 
> I'd much prefer to have your solution merged, as it covers more instances
> of the same directory exploration pattern and provides an explicit
> approach to memory allocation.
> 
> > I stopped pursuing the series as the maintainers were complaining
> > about unpopular libcs/platforms missing system call definitions
> > (getdents) and the series breaking on those platforms.
> 
> Yeah, I agree. I also don't think doing an #undef because of muslc is a
> good approach. Would you and Namhyung be open to bypassing libc and
> calling SYS_getdents64 directly instead?

Yep, I'm ok with that.

Thanks,
Namhyung