* [PATCH v1 0/4] Perf tool LTO support
@ 2023-07-24 20:12 Ian Rogers
  2023-07-24 20:12 ` [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config Ian Rogers
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Nathan Chancellor, Nick Desaulniers,
	Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria, Carsten Haitzler,
	Zhengjun Xing, James Clark, linux-perf-users, linux-kernel, bpf,
	llvm
  Cc: maskray
Add a build flag, LTO=1, so that perf is built with the -flto
flag. Address some build errors this configuration throws up.
For me on my Debian derived OS, "CC=clang CXX=clang++ LD=ld.lld" works
fine. With GCC LTO this fails with:
```
lto-wrapper: warning: using serial compilation of 50 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information
/usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x28): undefined reference to `memset_orig'
/usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x40): undefined reference to `__memset'
/usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x28): undefined reference to `memcpy_orig'
/usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x40): undefined reference to `__memcpy'
/usr/bin/ld: /tmp/ccK8kXAu.ltrans44.ltrans.o: in function `test__arch_unwind_sample':
/home/irogers/kernel.org/tools/perf/arch/x86/tests/dwarf-unwind.c:72: undefined reference to `perf_regs_load'
collect2: error: ld returned 1 exit status
```
The issue is that we build multiple .o files in a directory and then
link them into a .o with "ld -r" (cmd_ld_multi). This early link step
appears to trigger GCC to remove the .S file definition of the symbol
and break the later link step (the perf-in.o shows perf_regs_load, for
example, going from the text section to being undefined at the link
step which doesn't happen with clang or without LTO). It is possible
to work around this by taking the final perf link command and adding
the .o files generated from .S back into it, namely:
arch/x86/tests/regs_load.o
bench/mem-memset-x86-64-asm.o
bench/mem-memcpy-x86-64-asm.o
A quick performance check and the performance improvements from LTO
are noticeable:
Non-LTO
```
$ perf bench internals synthesize
 # Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 202.216 usec (+- 0.160 usec)
  Average num. events: 51.000 (+- 0.000)
  Average time per event 3.965 usec
  Average data synthesis took: 230.875 usec (+- 0.285 usec)
  Average num. events: 271.000 (+- 0.000)
  Average time per event 0.852 usec
```
LTO
```
$ perf bench internals synthesize
 # Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
  Average synthesis took: 104.530 usec (+- 0.074 usec)
  Average num. events: 51.000 (+- 0.000)
  Average time per event 2.050 usec
  Average data synthesis took: 112.660 usec (+- 0.114 usec)
  Average num. events: 273.000 (+- 0.000)
  Average time per event 0.413 usec
```
Ian Rogers (4):
  perf stat: Avoid uninitialized use of perf_stat_config
  perf parse-events: Avoid use uninitialized warning
  perf test: Avoid weak symbol for arch_tests
  perf build: Add LTO build option
 tools/perf/Makefile.config      |  5 +++++
 tools/perf/tests/builtin-test.c | 11 ++++++++++-
 tools/perf/tests/stat.c         |  2 +-
 tools/perf/util/parse-events.c  |  2 +-
 tools/perf/util/stat.c          |  2 +-
 5 files changed, 18 insertions(+), 4 deletions(-)
-- 
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
@ 2023-07-24 20:12 ` Ian Rogers
  2023-07-24 21:09   ` Nick Desaulniers
  2023-07-24 20:12 ` [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning Ian Rogers
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Nathan Chancellor, Nick Desaulniers,
	Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria, Carsten Haitzler,
	Zhengjun Xing, James Clark, linux-perf-users, linux-kernel, bpf,
	llvm
  Cc: maskray
perf_event__read_stat_config will assign values based on number of
tags and tag values. Initialize the structs to zero before they are
assigned so that no uninitialized values can be seen.
This potential error was reported by GCC with LTO enabled.
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/tests/stat.c | 2 +-
 tools/perf/util/stat.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/tests/stat.c b/tools/perf/tests/stat.c
index 500974040fe3..706780fb5695 100644
--- a/tools/perf/tests/stat.c
+++ b/tools/perf/tests/stat.c
@@ -27,7 +27,7 @@ static int process_stat_config_event(struct perf_tool *tool __maybe_unused,
 				     struct machine *machine __maybe_unused)
 {
 	struct perf_record_stat_config *config = &event->stat_config;
-	struct perf_stat_config stat_config;
+	struct perf_stat_config stat_config = {};
 
 #define HAS(term, val) \
 	has_term(config, PERF_STAT_CONFIG_TERM__##term, val)
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 967e583392c7..ec3506042217 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -729,7 +729,7 @@ size_t perf_event__fprintf_stat_round(union perf_event *event, FILE *fp)
 
 size_t perf_event__fprintf_stat_config(union perf_event *event, FILE *fp)
 {
-	struct perf_stat_config sc;
+	struct perf_stat_config sc = {};
 	size_t ret;
 
 	perf_event__read_stat_config(&sc, &event->stat_config);
-- 
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
  2023-07-24 20:12 ` [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config Ian Rogers
@ 2023-07-24 20:12 ` Ian Rogers
  2023-07-24 21:01   ` Nick Desaulniers
  2023-07-24 20:12 ` [PATCH v1 3/4] perf test: Avoid weak symbol for arch_tests Ian Rogers
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Nathan Chancellor, Nick Desaulniers,
	Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria, Carsten Haitzler,
	Zhengjun Xing, James Clark, linux-perf-users, linux-kernel, bpf,
	llvm
  Cc: maskray
With GCC LTO a potential use uninitialized is spotted:
```
In function ‘parse_events_config_bpf’,
    inlined from ‘parse_events_load_bpf’ at util/parse-events.c:874:8:
util/parse-events.c:792:37: error: ‘error_pos’ may be used uninitialized [-Werror=maybe-uninitialized]
  792 |                                 idx = term->err_term + error_pos;
      |                                     ^
util/parse-events.c: In function ‘parse_events_load_bpf’:
util/parse-events.c:765:13: note: ‘error_pos’ was declared here
  765 |         int error_pos;
      |             ^
```
So initialize at declaration.
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/parse-events.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index acde097e327c..da29061ecf49 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -762,7 +762,7 @@ parse_events_config_bpf(struct parse_events_state *parse_state,
 			struct list_head *head_config)
 {
 	struct parse_events_term *term;
-	int error_pos;
+	int error_pos = 0;
 
 	if (!head_config || list_empty(head_config))
 		return 0;
-- 
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [PATCH v1 3/4] perf test: Avoid weak symbol for arch_tests
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
  2023-07-24 20:12 ` [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config Ian Rogers
  2023-07-24 20:12 ` [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning Ian Rogers
@ 2023-07-24 20:12 ` Ian Rogers
  2023-07-24 20:12 ` [PATCH v1 4/4] perf build: Add LTO build option Ian Rogers
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Nathan Chancellor, Nick Desaulniers,
	Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria, Carsten Haitzler,
	Zhengjun Xing, James Clark, linux-perf-users, linux-kernel, bpf,
	llvm
  Cc: maskray
GCC LTO will complain that the array length varies for the arch_tests
weak symbol. Use extern/static and architecture determining #if to
workaround this problem.
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/tests/builtin-test.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 1f6557ce3b0a..5291fb5f54d7 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -33,9 +33,18 @@
 static bool dont_fork;
 const char *dso_to_test;
 
-struct test_suite *__weak arch_tests[] = {
+/*
+ * List of architecture specific tests. Not a weak symbol as the array length is
+ * dependent on the initialization, as such GCC with LTO complains of
+ * conflicting definitions with a weak symbol.
+ */
+#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__)
+extern struct test_suite *arch_tests[];
+#else
+static struct test_suite *arch_tests[] = {
 	NULL,
 };
+#endif
 
 static struct test_suite *generic_tests[] = {
 	&suite__vmlinux_matches_kallsyms,
-- 
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [PATCH v1 4/4] perf build: Add LTO build option
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
                   ` (2 preceding siblings ...)
  2023-07-24 20:12 ` [PATCH v1 3/4] perf test: Avoid weak symbol for arch_tests Ian Rogers
@ 2023-07-24 20:12 ` Ian Rogers
  2023-07-24 21:15 ` [PATCH v1 0/4] Perf tool LTO support Nick Desaulniers
  2023-07-24 21:29 ` Arnaldo Carvalho de Melo
  5 siblings, 0 replies; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Nathan Chancellor, Nick Desaulniers,
	Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria, Carsten Haitzler,
	Zhengjun Xing, James Clark, linux-perf-users, linux-kernel, bpf,
	llvm
  Cc: maskray
Add an LTO build option, that sets the appropriate CFLAGS and CXXFLAGS
values.
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/Makefile.config | 5 +++++
 1 file changed, 5 insertions(+)
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index c5db0de49868..a9cfe83638a9 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -256,6 +256,11 @@ ifdef PARSER_DEBUG
   $(call detected_var,PARSER_DEBUG_FLEX)
 endif
 
+ifdef LTO
+  CORE_CFLAGS += -flto
+  CXXFLAGS += -flto
+endif
+
 # Try different combinations to accommodate systems that only have
 # python[2][3]-config in weird combinations in the following order of
 # priority from lowest to highest:
-- 
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning
  2023-07-24 20:12 ` [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning Ian Rogers
@ 2023-07-24 21:01   ` Nick Desaulniers
  0 siblings, 0 replies; 12+ messages in thread
From: Nick Desaulniers @ 2023-07-24 21:01 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 1:13 PM Ian Rogers <irogers@google.com> wrote:
>
> With GCC LTO a potential use uninitialized is spotted:
> ```
> In function ‘parse_events_config_bpf’,
>     inlined from ‘parse_events_load_bpf’ at util/parse-events.c:874:8:
> util/parse-events.c:792:37: error: ‘error_pos’ may be used uninitialized [-Werror=maybe-uninitialized]
>   792 |                                 idx = term->err_term + error_pos;
>       |                                     ^
> util/parse-events.c: In function ‘parse_events_load_bpf’:
> util/parse-events.c:765:13: note: ‘error_pos’ was declared here
>   765 |         int error_pos;
>       |             ^
> ```
> So initialize at declaration.
This common pattern in C is error prone (conditional assignment in the
callee; callers maybe forget to initialize, then unconditionally use
the value). Clang's static analyzer can spot these, but isn't run for
tools/ AFAIK.
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/parse-events.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index acde097e327c..da29061ecf49 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -762,7 +762,7 @@ parse_events_config_bpf(struct parse_events_state *parse_state,
>                         struct list_head *head_config)
>  {
>         struct parse_events_term *term;
> -       int error_pos;
> +       int error_pos = 0;
>
>         if (!head_config || list_empty(head_config))
>                 return 0;
> --
> 2.41.0.487.g6d72f3e995-goog
>
-- 
Thanks,
~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config
  2023-07-24 20:12 ` [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config Ian Rogers
@ 2023-07-24 21:09   ` Nick Desaulniers
  0 siblings, 0 replies; 12+ messages in thread
From: Nick Desaulniers @ 2023-07-24 21:09 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 1:12 PM Ian Rogers <irogers@google.com> wrote:
>
> perf_event__read_stat_config will assign values based on number of
> tags and tag values. Initialize the structs to zero before they are
> assigned so that no uninitialized values can be seen.
>
> This potential error was reported by GCC with LTO enabled.
>
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/tests/stat.c | 2 +-
>  tools/perf/util/stat.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/tests/stat.c b/tools/perf/tests/stat.c
> index 500974040fe3..706780fb5695 100644
> --- a/tools/perf/tests/stat.c
> +++ b/tools/perf/tests/stat.c
> @@ -27,7 +27,7 @@ static int process_stat_config_event(struct perf_tool *tool __maybe_unused,
>                                      struct machine *machine __maybe_unused)
>  {
>         struct perf_record_stat_config *config = &event->stat_config;
> -       struct perf_stat_config stat_config;
> +       struct perf_stat_config stat_config = {};
^ how did this code ever work?
1. stat_config is not initialized
2. perf_event__read_stat_config maybe assigns to &stat_config->__val
3. process_stat_config_event() tests other members of stat_config
I hope I've missed something obvious.
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
>
>  #define HAS(term, val) \
>         has_term(config, PERF_STAT_CONFIG_TERM__##term, val)
> diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
> index 967e583392c7..ec3506042217 100644
> --- a/tools/perf/util/stat.c
> +++ b/tools/perf/util/stat.c
> @@ -729,7 +729,7 @@ size_t perf_event__fprintf_stat_round(union perf_event *event, FILE *fp)
>
>  size_t perf_event__fprintf_stat_config(union perf_event *event, FILE *fp)
>  {
> -       struct perf_stat_config sc;
> +       struct perf_stat_config sc = {};
>         size_t ret;
>
>         perf_event__read_stat_config(&sc, &event->stat_config);
> --
> 2.41.0.487.g6d72f3e995-goog
>
-- 
Thanks,
~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 0/4] Perf tool LTO support
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
                   ` (3 preceding siblings ...)
  2023-07-24 20:12 ` [PATCH v1 4/4] perf build: Add LTO build option Ian Rogers
@ 2023-07-24 21:15 ` Nick Desaulniers
  2023-07-24 21:48   ` Ian Rogers
  2023-07-24 21:29 ` Arnaldo Carvalho de Melo
  5 siblings, 1 reply; 12+ messages in thread
From: Nick Desaulniers @ 2023-07-24 21:15 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 1:12 PM Ian Rogers <irogers@google.com> wrote:
>
> Add a build flag, LTO=1, so that perf is built with the -flto
> flag. Address some build errors this configuration throws up.
Hi Ian,
Thanks for the performance numbers. Any sense of what the build time
numbers might look like for building perf with LTO?
Does `-flto=thin` in clang's case make a meaningful difference of
`-flto`? I'd recommend that over "full LTO" `-flto` when the
performance difference of the result isn't too meaningful.  ThinLTO
should be faster to build, but I don't know that I've ever built perf,
so IDK what to expect.
-- 
Thanks,
~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 0/4] Perf tool LTO support
  2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
                   ` (4 preceding siblings ...)
  2023-07-24 21:15 ` [PATCH v1 0/4] Perf tool LTO support Nick Desaulniers
@ 2023-07-24 21:29 ` Arnaldo Carvalho de Melo
  5 siblings, 0 replies; 12+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-07-24 21:29 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Nathan Chancellor,
	Nick Desaulniers, Tom Rix, Kan Liang, Yang Jihong, Ravi Bangoria,
	Carsten Haitzler, Zhengjun Xing, James Clark, linux-perf-users,
	linux-kernel, bpf, llvm, maskray
Em Mon, Jul 24, 2023 at 01:12:43PM -0700, Ian Rogers escreveu:
> Add a build flag, LTO=1, so that perf is built with the -flto
> flag. Address some build errors this configuration throws up.
> 
> For me on my Debian derived OS, "CC=clang CXX=clang++ LD=ld.lld" works
> fine. With GCC LTO this fails with:
> ```
> lto-wrapper: warning: using serial compilation of 50 LTRANS jobs
> lto-wrapper: note: see the ‘-flto’ option documentation for more information
> /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x28): undefined reference to `memset_orig'
> /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel.ro+0x40): undefined reference to `__memset'
> /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x28): undefined reference to `memcpy_orig'
> /usr/bin/ld: /tmp/ccK8kXAu.ltrans10.ltrans.o:(.data.rel+0x40): undefined reference to `__memcpy'
> /usr/bin/ld: /tmp/ccK8kXAu.ltrans44.ltrans.o: in function `test__arch_unwind_sample':
> /home/irogers/kernel.org/tools/perf/arch/x86/tests/dwarf-unwind.c:72: undefined reference to `perf_regs_load'
> collect2: error: ld returned 1 exit status
> ```
> 
> The issue is that we build multiple .o files in a directory and then
> link them into a .o with "ld -r" (cmd_ld_multi). This early link step
> appears to trigger GCC to remove the .S file definition of the symbol
> and break the later link step (the perf-in.o shows perf_regs_load, for
> example, going from the text section to being undefined at the link
> step which doesn't happen with clang or without LTO). It is possible
> to work around this by taking the final perf link command and adding
> the .o files generated from .S back into it, namely:
> arch/x86/tests/regs_load.o
> bench/mem-memset-x86-64-asm.o
> bench/mem-memcpy-x86-64-asm.o
> 
> A quick performance check and the performance improvements from LTO
> are noticeable:
> 
> Non-LTO
> ```
> $ perf bench internals synthesize
>  # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 202.216 usec (+- 0.160 usec)
>   Average num. events: 51.000 (+- 0.000)
>   Average time per event 3.965 usec
>   Average data synthesis took: 230.875 usec (+- 0.285 usec)
>   Average num. events: 271.000 (+- 0.000)
>   Average time per event 0.852 usec
> ```
> 
> LTO
> ```
> $ perf bench internals synthesize
>  # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>   Average synthesis took: 104.530 usec (+- 0.074 usec)
>   Average num. events: 51.000 (+- 0.000)
>   Average time per event 2.050 usec
>   Average data synthesis took: 112.660 usec (+- 0.114 usec)
>   Average num. events: 273.000 (+- 0.000)
>   Average time per event 0.413 usec
Cool stuff! Applied locally, test building now on the container suite.
- Arnaldo
> ```
> 
> Ian Rogers (4):
>   perf stat: Avoid uninitialized use of perf_stat_config
>   perf parse-events: Avoid use uninitialized warning
>   perf test: Avoid weak symbol for arch_tests
>   perf build: Add LTO build option
> 
>  tools/perf/Makefile.config      |  5 +++++
>  tools/perf/tests/builtin-test.c | 11 ++++++++++-
>  tools/perf/tests/stat.c         |  2 +-
>  tools/perf/util/parse-events.c  |  2 +-
>  tools/perf/util/stat.c          |  2 +-
>  5 files changed, 18 insertions(+), 4 deletions(-)
> 
> -- 
> 2.41.0.487.g6d72f3e995-goog
> 
-- 
- Arnaldo
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 0/4] Perf tool LTO support
  2023-07-24 21:15 ` [PATCH v1 0/4] Perf tool LTO support Nick Desaulniers
@ 2023-07-24 21:48   ` Ian Rogers
  2023-07-24 22:27     ` Nick Desaulniers
  0 siblings, 1 reply; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 21:48 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 2:15 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Mon, Jul 24, 2023 at 1:12 PM Ian Rogers <irogers@google.com> wrote:
> >
> > Add a build flag, LTO=1, so that perf is built with the -flto
> > flag. Address some build errors this configuration throws up.
>
> Hi Ian,
> Thanks for the performance numbers. Any sense of what the build time
> numbers might look like for building perf with LTO?
>
> Does `-flto=thin` in clang's case make a meaningful difference of
> `-flto`? I'd recommend that over "full LTO" `-flto` when the
> performance difference of the result isn't too meaningful.  ThinLTO
> should be faster to build, but I don't know that I've ever built perf,
> so IDK what to expect.
Hi Nick,
I'm not sure how much the perf build will benefit from LTO to say
whether thin is good enough or not. Things like "perf record" are
designed to spend the majority of their time blocking on a poll system
call. We have benchmarks at least :-)
I grabbed some clang build times in an unscientific way on my loaded laptop:
no LTO
real    0m48.846s
user    3m11.452s
sys     0m29.598s
-flto=thin
real    0m55.910s
user    4m2.342s
sys     0m30.120s
real    0m50.330s
user    3m36.986s
sys     0m28.519s
-flto
real    1m12.002s
user    3m27.676s
sys     0m30.305s
real    1m5.187s
user    3m19.348s
sys     0m29.031s
So perhaps thin LTO increases total build time by 10%, whilst full LTO
increases it by 50%.
Gathering some clang performance numbers:
no LTO
$ perf bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 178.694 usec (+- 0.171 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 3.436 usec
 Average data synthesis took: 194.545 usec (+- 0.088 usec)
 Average num. events: 277.000 (+- 0.000)
 Average time per event 0.702 usec
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 175.381 usec (+- 0.105 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 3.373 usec
 Average data synthesis took: 188.980 usec (+- 0.071 usec)
 Average num. events: 278.000 (+- 0.000)
 Average time per event 0.680 usec
-flto=thin
$ perf bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 183.122 usec (+- 0.082 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 3.522 usec
 Average data synthesis took: 196.468 usec (+- 0.102 usec)
 Average num. events: 277.000 (+- 0.000)
 Average time per event 0.709 usec
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 177.684 usec (+- 0.094 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 3.417 usec
 Average data synthesis took: 190.079 usec (+- 0.077 usec)
 Average num. events: 275.000 (+- 0.000)
 Average time per event 0.691 usec
-flto
$ perf bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 112.599 usec (+- 0.040 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 2.165 usec
 Average data synthesis took: 119.012 usec (+- 0.070 usec)
 Average num. events: 278.000 (+- 0.000)
 Average time per event 0.428 usec
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
 Average synthesis took: 107.606 usec (+- 0.147 usec)
 Average num. events: 52.000 (+- 0.000)
 Average time per event 2.069 usec
 Average data synthesis took: 114.633 usec (+- 0.159 usec)
 Average num. events: 279.000 (+- 0.000)
 Average time per event 0.411 usec
The performance win from thin LTO doesn't look to be there. Full LTO
appears to be reducing event synthesis time down to 60% of what it
was. The clang numbers are looking better than the GCC ones. I think
from this it makes sense to use -flto.
Thanks,
Ian
> --
> Thanks,
> ~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 0/4] Perf tool LTO support
  2023-07-24 21:48   ` Ian Rogers
@ 2023-07-24 22:27     ` Nick Desaulniers
  2023-07-24 22:38       ` Ian Rogers
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Desaulniers @ 2023-07-24 22:27 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 2:48 PM Ian Rogers <irogers@google.com> wrote:
>
> On Mon, Jul 24, 2023 at 2:15 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Mon, Jul 24, 2023 at 1:12 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > Add a build flag, LTO=1, so that perf is built with the -flto
> > > flag. Address some build errors this configuration throws up.
> >
> > Hi Ian,
> > Thanks for the performance numbers. Any sense of what the build time
> > numbers might look like for building perf with LTO?
> >
> > Does `-flto=thin` in clang's case make a meaningful difference of
> > `-flto`? I'd recommend that over "full LTO" `-flto` when the
> > performance difference of the result isn't too meaningful.  ThinLTO
> > should be faster to build, but I don't know that I've ever built perf,
> > so IDK what to expect.
>
> Hi Nick,
>
> I'm not sure how much the perf build will benefit from LTO to say
> whether thin is good enough or not. Things like "perf record" are
> designed to spend the majority of their time blocking on a poll system
> call. We have benchmarks at least :-)
>
> I grabbed some clang build times in an unscientific way on my loaded laptop:
>
> no LTO
> real    0m48.846s
> user    3m11.452s
> sys     0m29.598s
>
> -flto=thin
> real    0m55.910s
> user    4m2.342s
> sys     0m30.120s
>
> real    0m50.330s
> user    3m36.986s
> sys     0m28.519s
>
> -flto
> real    1m12.002s
> user    3m27.676s
> sys     0m30.305s
>
> real    1m5.187s
> user    3m19.348s
> sys     0m29.031s
>
> So perhaps thin LTO increases total build time by 10%, whilst full LTO
> increases it by 50%.
>
> Gathering some clang performance numbers:
>
> no LTO
> $ perf bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 178.694 usec (+- 0.171 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 3.436 usec
>  Average data synthesis took: 194.545 usec (+- 0.088 usec)
>  Average num. events: 277.000 (+- 0.000)
>  Average time per event 0.702 usec
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 175.381 usec (+- 0.105 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 3.373 usec
>  Average data synthesis took: 188.980 usec (+- 0.071 usec)
>  Average num. events: 278.000 (+- 0.000)
>  Average time per event 0.680 usec
>
> -flto=thin
> $ perf bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 183.122 usec (+- 0.082 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 3.522 usec
>  Average data synthesis took: 196.468 usec (+- 0.102 usec)
>  Average num. events: 277.000 (+- 0.000)
>  Average time per event 0.709 usec
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 177.684 usec (+- 0.094 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 3.417 usec
>  Average data synthesis took: 190.079 usec (+- 0.077 usec)
>  Average num. events: 275.000 (+- 0.000)
>  Average time per event 0.691 usec
>
> -flto
> $ perf bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 112.599 usec (+- 0.040 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 2.165 usec
>  Average data synthesis took: 119.012 usec (+- 0.070 usec)
>  Average num. events: 278.000 (+- 0.000)
>  Average time per event 0.428 usec
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
>  Average synthesis took: 107.606 usec (+- 0.147 usec)
>  Average num. events: 52.000 (+- 0.000)
>  Average time per event 2.069 usec
>  Average data synthesis took: 114.633 usec (+- 0.159 usec)
>  Average num. events: 279.000 (+- 0.000)
>  Average time per event 0.411 usec
>
> The performance win from thin LTO doesn't look to be there. Full LTO
> appears to be reducing event synthesis time down to 60% of what it
> was. The clang numbers are looking better than the GCC ones. I think
> from this it makes sense to use -flto.
Without any context, I'm not really sure what numbers are good vs. bad
("is larger better?").  More so I was curious if thinLTO perhaps got
most of the win without significant performance regressions. If not,
oh well, and if the slower full LTO has numbers that make sense to
other reviewers, well then *Chuck Norris thumbs up*.  Thanks for the
stats.
>
> Thanks,
> Ian
>
> > --
> > Thanks,
> > ~Nick Desaulniers
-- 
Thanks,
~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: [PATCH v1 0/4] Perf tool LTO support
  2023-07-24 22:27     ` Nick Desaulniers
@ 2023-07-24 22:38       ` Ian Rogers
  0 siblings, 0 replies; 12+ messages in thread
From: Ian Rogers @ 2023-07-24 22:38 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Tom Rix, Kan Liang, Yang Jihong,
	Ravi Bangoria, Carsten Haitzler, Zhengjun Xing, James Clark,
	linux-perf-users, linux-kernel, bpf, llvm, maskray
On Mon, Jul 24, 2023 at 3:27 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Mon, Jul 24, 2023 at 2:48 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Mon, Jul 24, 2023 at 2:15 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > On Mon, Jul 24, 2023 at 1:12 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > Add a build flag, LTO=1, so that perf is built with the -flto
> > > > flag. Address some build errors this configuration throws up.
> > >
> > > Hi Ian,
> > > Thanks for the performance numbers. Any sense of what the build time
> > > numbers might look like for building perf with LTO?
> > >
> > > Does `-flto=thin` in clang's case make a meaningful difference of
> > > `-flto`? I'd recommend that over "full LTO" `-flto` when the
> > > performance difference of the result isn't too meaningful.  ThinLTO
> > > should be faster to build, but I don't know that I've ever built perf,
> > > so IDK what to expect.
> >
> > Hi Nick,
> >
> > I'm not sure how much the perf build will benefit from LTO to say
> > whether thin is good enough or not. Things like "perf record" are
> > designed to spend the majority of their time blocking on a poll system
> > call. We have benchmarks at least :-)
> >
> > I grabbed some clang build times in an unscientific way on my loaded laptop:
> >
> > no LTO
> > real    0m48.846s
> > user    3m11.452s
> > sys     0m29.598s
> >
> > -flto=thin
> > real    0m55.910s
> > user    4m2.342s
> > sys     0m30.120s
> >
> > real    0m50.330s
> > user    3m36.986s
> > sys     0m28.519s
> >
> > -flto
> > real    1m12.002s
> > user    3m27.676s
> > sys     0m30.305s
> >
> > real    1m5.187s
> > user    3m19.348s
> > sys     0m29.031s
> >
> > So perhaps thin LTO increases total build time by 10%, whilst full LTO
> > increases it by 50%.
> >
> > Gathering some clang performance numbers:
> >
> > no LTO
> > $ perf bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 178.694 usec (+- 0.171 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 3.436 usec
> >  Average data synthesis took: 194.545 usec (+- 0.088 usec)
> >  Average num. events: 277.000 (+- 0.000)
> >  Average time per event 0.702 usec
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 175.381 usec (+- 0.105 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 3.373 usec
> >  Average data synthesis took: 188.980 usec (+- 0.071 usec)
> >  Average num. events: 278.000 (+- 0.000)
> >  Average time per event 0.680 usec
> >
> > -flto=thin
> > $ perf bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 183.122 usec (+- 0.082 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 3.522 usec
> >  Average data synthesis took: 196.468 usec (+- 0.102 usec)
> >  Average num. events: 277.000 (+- 0.000)
> >  Average time per event 0.709 usec
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 177.684 usec (+- 0.094 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 3.417 usec
> >  Average data synthesis took: 190.079 usec (+- 0.077 usec)
> >  Average num. events: 275.000 (+- 0.000)
> >  Average time per event 0.691 usec
> >
> > -flto
> > $ perf bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 112.599 usec (+- 0.040 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 2.165 usec
> >  Average data synthesis took: 119.012 usec (+- 0.070 usec)
> >  Average num. events: 278.000 (+- 0.000)
> >  Average time per event 0.428 usec
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >  Average synthesis took: 107.606 usec (+- 0.147 usec)
> >  Average num. events: 52.000 (+- 0.000)
> >  Average time per event 2.069 usec
> >  Average data synthesis took: 114.633 usec (+- 0.159 usec)
> >  Average num. events: 279.000 (+- 0.000)
> >  Average time per event 0.411 usec
> >
> > The performance win from thin LTO doesn't look to be there. Full LTO
> > appears to be reducing event synthesis time down to 60% of what it
> > was. The clang numbers are looking better than the GCC ones. I think
> > from this it makes sense to use -flto.
>
> Without any context, I'm not really sure what numbers are good vs. bad
> ("is larger better?").  More so I was curious if thinLTO perhaps got
> most of the win without significant performance regressions. If not,
> oh well, and if the slower full LTO has numbers that make sense to
> other reviewers, well then *Chuck Norris thumbs up*.  Thanks for the
> stats.
I can at least explain the stats. When perf starts it has to
"synthesize" the state-of-the machine, it generates fake events to
describe the mmaps in each process by reading /proc. This is done most
typically so a virtual address can be turned into a filename and line
number. Generally this is done for the text part of a binary but it
may also be done for the data. Large systems may take a long time to
synthesize all the state for, hence the benchmark.
The result I normally look at above is the "Average time per event",
so without LTO or with thin LTO each event is taking approx. 180
microseconds to create. With full LTO the time taken per event is 110
microseconds, which could be a noticeable start-up time win.
Thanks,
Ian
> >
> > Thanks,
> > Ian
> >
> > > --
> > > Thanks,
> > > ~Nick Desaulniers
>
>
>
> --
> Thanks,
> ~Nick Desaulniers
^ permalink raw reply	[flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-07-24 22:39 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-24 20:12 [PATCH v1 0/4] Perf tool LTO support Ian Rogers
2023-07-24 20:12 ` [PATCH v1 1/4] perf stat: Avoid uninitialized use of perf_stat_config Ian Rogers
2023-07-24 21:09   ` Nick Desaulniers
2023-07-24 20:12 ` [PATCH v1 2/4] perf parse-events: Avoid use uninitialized warning Ian Rogers
2023-07-24 21:01   ` Nick Desaulniers
2023-07-24 20:12 ` [PATCH v1 3/4] perf test: Avoid weak symbol for arch_tests Ian Rogers
2023-07-24 20:12 ` [PATCH v1 4/4] perf build: Add LTO build option Ian Rogers
2023-07-24 21:15 ` [PATCH v1 0/4] Perf tool LTO support Nick Desaulniers
2023-07-24 21:48   ` Ian Rogers
2023-07-24 22:27     ` Nick Desaulniers
2023-07-24 22:38       ` Ian Rogers
2023-07-24 21:29 ` Arnaldo Carvalho de Melo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).