linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
@ 2025-02-06 11:33 Krzysztof Łopatowski
  2025-02-06 17:54 ` Ian Rogers
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Krzysztof Łopatowski @ 2025-02-06 11:33 UTC (permalink / raw)
  To: namhyung, acme, irogers
  Cc: linux-perf-users, linux-kernel, Krzysztof Łopatowski

When testing perf trace on NixOS, I noticed significant startup delays:
- `ls`: ~2ms
- `strace ls`: ~10ms
- `perf trace ls`: ~550ms

Profiling showed that 51% of the time is spent reading files,
26% in loading BPF programs, and 11% in `newfstatat`.

This patch optimizes module path exploration by avoiding `stat()` calls
unless necessary. For filesystems that do not implement `d_type`
(DT_UNKNOWN), it falls back to the old behavior.
See `readdir(3)` for details.

This reduces `perf trace ls` time to ~500ms.

A more thorough startup optimization based on command parameters would
be ideal, but that is a larger effort.

Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com>
---
 tools/perf/util/machine.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 2d51badfbf2e..76b857fd752b 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1354,7 +1354,7 @@ static int maps__set_module_path(struct maps *maps, const char *path, struct kmo
 
 static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, int depth)
 {
-	struct dirent *dent;
+	const struct dirent *dent;
 	DIR *dir = opendir(dir_name);
 	int ret = 0;
 
@@ -1365,14 +1365,20 @@ static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, i
 
 	while ((dent = readdir(dir)) != NULL) {
 		char path[PATH_MAX];
-		struct stat st;
+		unsigned char d_type = dent->d_type;
 
-		/*sshfs might return bad dent->d_type, so we have to stat*/
 		path__join(path, sizeof(path), dir_name, dent->d_name);
-		if (stat(path, &st))
-			continue;
 
-		if (S_ISDIR(st.st_mode)) {
+		if (d_type == DT_UNKNOWN) {
+			struct stat st;
+
+			if (stat(path, &st))
+				continue;
+			if (S_ISDIR(st.st_mode))
+				d_type = DT_DIR;
+		}
+
+		if (d_type == DT_DIR) {
 			if (!strcmp(dent->d_name, ".") ||
 			    !strcmp(dent->d_name, ".."))
 				continue;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-06 11:33 [PATCH] perf: Improve startup time by reducing unnecessary stat() calls Krzysztof Łopatowski
@ 2025-02-06 17:54 ` Ian Rogers
  2025-02-06 21:45   ` Krzysztof Łopatowski
  2025-02-07  1:04 ` Howard Chu
  2025-02-20 18:26 ` Namhyung Kim
  2 siblings, 1 reply; 7+ messages in thread
From: Ian Rogers @ 2025-02-06 17:54 UTC (permalink / raw)
  To: Krzysztof Łopatowski; +Cc: namhyung, acme, linux-perf-users, linux-kernel

On Thu, Feb 6, 2025 at 3:35 AM Krzysztof Łopatowski
<krzysztof.m.lopatowski@gmail.com> wrote:
>
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms
>
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.
>
> This patch optimizes module path exploration by avoiding `stat()` calls
> unless necessary. For filesystems that do not implement `d_type`
> (DT_UNKNOWN), it falls back to the old behavior.
> See `readdir(3)` for details.
>
> This reduces `perf trace ls` time to ~500ms.
>
> A more thorough startup optimization based on command parameters would
> be ideal, but that is a larger effort.


 Hi Krzysztof,

Thanks for the contribution! I did a series and a new io_dir set of
primitives. The last version of which is:
https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
That did something very similar along with mainly memory usage
optimizations. In patch2:
```
+static inline bool io_dir__is_dir(const struct io_dir *iod, struct
io_dirent64 *dent)
+{
+ if (dent->d_type == DT_UNKNOWN) {
+ struct stat st;
+
+ if (fstatat(iod->dirfd, dent->d_name, &st, /*flags=*/0))
+ return false;
+
+ if (S_ISDIR(st.st_mode)) {
+ dent->d_type = DT_DIR;
+ return true;
+ }
+ }
+ return dent->d_type == DT_DIR;
+}
```
I stopped pursuing the series as the maintainers were complaining
about unpopular libcs/platforms missing system call definitions
(getdents) and the series breaking on those platforms. I tried to go
the usual feature testing route, etc. but we seemed to have entered
into wac-a-mole wrt those libcs/platforms and my patience had worn
thin. I carry the changes in Google's tree where the libc/platform
issue isn't a concern.

I mention this as I think that series may be a better route than this
change as it solves a little bit more of the performance issue. I can
and do rebase the changes for Google in the tree:
https://github.com/googleprodkernel/linux-perf
I don't mind this patch as an expedient, obvious performance win.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-06 17:54 ` Ian Rogers
@ 2025-02-06 21:45   ` Krzysztof Łopatowski
  2025-02-12  2:10     ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Krzysztof Łopatowski @ 2025-02-06 21:45 UTC (permalink / raw)
  To: Ian Rogers; +Cc: namhyung, acme, linux-perf-users, linux-kernel

Hi Ian,
Thank you for taking the time to look into this.

> I did a series and a new io_dir set of primitives.
> The last version of which is:
> https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
> I mention this as I think that series may be a better route than this
> change as it solves a little bit more of the performance issue.

I'd much prefer to have your solution merged, as it covers more instances
of the same directory exploration pattern and provides an explicit
approach to memory allocation.

> I stopped pursuing the series as the maintainers were complaining
> about unpopular libcs/platforms missing system call definitions
> (getdents) and the series breaking on those platforms.

Yeah, I agree. I also don't think doing an #undef because of muslc is a
good approach. Would you and Namhyung be open to bypassing libc and
calling SYS_getdents64 directly instead?

Best,
Krzysztof

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-06 11:33 [PATCH] perf: Improve startup time by reducing unnecessary stat() calls Krzysztof Łopatowski
  2025-02-06 17:54 ` Ian Rogers
@ 2025-02-07  1:04 ` Howard Chu
  2025-02-07  8:31   ` Krzysztof Łopatowski
  2025-02-20 18:26 ` Namhyung Kim
  2 siblings, 1 reply; 7+ messages in thread
From: Howard Chu @ 2025-02-07  1:04 UTC (permalink / raw)
  To: Krzysztof Łopatowski
  Cc: namhyung, acme, irogers, linux-perf-users, linux-kernel

Hello Krzysztof,

On Thu, Feb 6, 2025 at 3:36 AM Krzysztof Łopatowski
<krzysztof.m.lopatowski@gmail.com> wrote:
>
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms

Thank you so much for this insight.

>
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.

That's a valuable piece of info, thank you.

>
> This patch optimizes module path exploration by avoiding `stat()` calls
> unless necessary. For filesystems that do not implement `d_type`
> (DT_UNKNOWN), it falls back to the old behavior.
> See `readdir(3)` for details.
>
> This reduces `perf trace ls` time to ~500ms.

Before:

perf $ time sudo ./perf trace -a --max-events=1 -S 2 >&1 1>/dev/null
         ? (         ): :2278675/2278675  ... [continued]: read())
                                        = 1
real    0m0.833s

So 833ms.

After:

perf $ time sudo ./perf trace -a --max-events=1 -S 2 >&1 1>/dev/null
         ? (         ): :2274650/2274650  ... [2: No such file or directory
<SNIP>
real    0m0.805s
user    0m0.004s
sys     0m0.011s

805ms.

And the results are consistent. Apparently works for me, thank you :)

>
> A more thorough startup optimization based on command parameters would
> be ideal, but that is a larger effort.
>
> Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com>
> ---
>  tools/perf/util/machine.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 2d51badfbf2e..76b857fd752b 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1354,7 +1354,7 @@ static int maps__set_module_path(struct maps *maps, const char *path, struct kmo
>
>  static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, int depth)
>  {
> -       struct dirent *dent;
> +       const struct dirent *dent;

Is it necessary to constify dent?

>         DIR *dir = opendir(dir_name);
>         int ret = 0;
>
> @@ -1365,14 +1365,20 @@ static int maps__set_modules_path_dir(struct maps *maps, const char *dir_name, i
>
>         while ((dent = readdir(dir)) != NULL) {
>                 char path[PATH_MAX];
> -               struct stat st;
> +               unsigned char d_type = dent->d_type;
>
> -               /*sshfs might return bad dent->d_type, so we have to stat*/
>                 path__join(path, sizeof(path), dir_name, dent->d_name);
> -               if (stat(path, &st))
> -                       continue;
>
> -               if (S_ISDIR(st.st_mode)) {
> +               if (d_type == DT_UNKNOWN) {
> +                       struct stat st;
> +
> +                       if (stat(path, &st))
> +                               continue;
> +                       if (S_ISDIR(st.st_mode))
> +                               d_type = DT_DIR;
> +               }
> +
> +               if (d_type == DT_DIR) {
>                         if (!strcmp(dent->d_name, ".") ||
>                             !strcmp(dent->d_name, ".."))
>                                 continue;
> --
> 2.47.2
>
>

Acked-by: Howard Chu <howardchu95@gmail.com>

Thanks,
Howard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-07  1:04 ` Howard Chu
@ 2025-02-07  8:31   ` Krzysztof Łopatowski
  0 siblings, 0 replies; 7+ messages in thread
From: Krzysztof Łopatowski @ 2025-02-07  8:31 UTC (permalink / raw)
  To: Howard Chu; +Cc: namhyung, acme, irogers, linux-perf-users, linux-kernel

Hi Howard,

> Is it necessary to constify dent?

It's not necessary. I have used it because I wasn't sure what libc would
do when we wrote to that memory.

Thanks,
Krzysztof

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-06 21:45   ` Krzysztof Łopatowski
@ 2025-02-12  2:10     ` Namhyung Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2025-02-12  2:10 UTC (permalink / raw)
  To: Krzysztof Łopatowski
  Cc: Ian Rogers, acme, linux-perf-users, linux-kernel

Hello,

On Thu, Feb 06, 2025 at 10:45:02PM +0100, Krzysztof Łopatowski wrote:
> Hi Ian,
> Thank you for taking the time to look into this.
> 
> > I did a series and a new io_dir set of primitives.
> > The last version of which is:
> > https://lore.kernel.org/lkml/20231207050433.1426834-1-irogers@google.com/
> > I mention this as I think that series may be a better route than this
> > change as it solves a little bit more of the performance issue.
> 
> I'd much prefer to have your solution merged, as it covers more instances
> of the same directory exploration pattern and provides an explicit
> approach to memory allocation.
> 
> > I stopped pursuing the series as the maintainers were complaining
> > about unpopular libcs/platforms missing system call definitions
> > (getdents) and the series breaking on those platforms.
> 
> Yeah, I agree. I also don't think doing an #undef because of muslc is a
> good approach. Would you and Namhyung be open to bypassing libc and
> calling SYS_getdents64 directly instead?

Yep, I'm ok with that.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf: Improve startup time by reducing unnecessary stat() calls
  2025-02-06 11:33 [PATCH] perf: Improve startup time by reducing unnecessary stat() calls Krzysztof Łopatowski
  2025-02-06 17:54 ` Ian Rogers
  2025-02-07  1:04 ` Howard Chu
@ 2025-02-20 18:26 ` Namhyung Kim
  2 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2025-02-20 18:26 UTC (permalink / raw)
  To: acme, irogers, Krzysztof Łopatowski; +Cc: linux-perf-users, linux-kernel

On Thu, 06 Feb 2025 12:33:15 +0100, Krzysztof Łopatowski wrote:
> When testing perf trace on NixOS, I noticed significant startup delays:
> - `ls`: ~2ms
> - `strace ls`: ~10ms
> - `perf trace ls`: ~550ms
> 
> Profiling showed that 51% of the time is spent reading files,
> 26% in loading BPF programs, and 11% in `newfstatat`.
> 
> [...]
Applied to perf-tools-next, thanks!

Best regards,
Namhyung



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-02-20 18:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06 11:33 [PATCH] perf: Improve startup time by reducing unnecessary stat() calls Krzysztof Łopatowski
2025-02-06 17:54 ` Ian Rogers
2025-02-06 21:45   ` Krzysztof Łopatowski
2025-02-12  2:10     ` Namhyung Kim
2025-02-07  1:04 ` Howard Chu
2025-02-07  8:31   ` Krzysztof Łopatowski
2025-02-20 18:26 ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).