Linux Perf Users
 help / color / mirror / Atom feed
* Re: [PATCH v4 12/53] perf bpf: Don't synthesize BPF events when disabled
From: Arnaldo Carvalho de Melo @ 2023-11-09 16:10 UTC (permalink / raw)
  To: Song Liu
  Cc: Ian Rogers, Song Liu, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Adrian Hunter,
	Nick Terrell, Kan Liang, Andi Kleen, Kajol Jain, Athira Rajeev,
	Huacai Chen, Masami Hiramatsu, Vincent Whitchurch,
	Steinar H. Gunderson, Liam Howlett, Miguel Ojeda, Colin Ian King,
	Dmitrii Dolgov, Yang Jihong, Ming Wang, James Clark,
	K Prateek Nayak, Sean Christopherson, Leo Yan, Ravi Bangoria,
	German Gomez, Changbin Du, Paolo Bonzini, Li Dong, Sandipan Das,
	liuwenyu, linux-kernel, linux-perf-users
In-Reply-To: <CAPhsuW4ftvoUFnNfcZgBg7=SeaHmev7roFnix=+c+zSq3LawFQ@mail.gmail.com>

Em Wed, Nov 08, 2023 at 03:03:15PM -0800, Song Liu escreveu:
> On Wed, Nov 8, 2023 at 8:15 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > Em Thu, Nov 02, 2023 at 10:56:54AM -0700, Ian Rogers escreveu:
> > > If BPF sideband events are disabled on the command line, don't
> > > synthesize BPF events too.
> >
> >
> > Interesting, in 71184c6ab7e60fd5 ("perf record: Replace option
> > --bpf-event with --no-bpf-event") we checked that, but only down at
> > perf_event__synthesize_one_bpf_prog(), where we have:
> >
> >         if (!opts->no_bpf_event) {
> >                 /* Synthesize PERF_RECORD_BPF_EVENT */
> >                 *bpf_event = (struct perf_record_bpf_event)
> >
> >
> > So we better remove that, now redundant check? I'll apply your patch as
> > is and then we can remove that other check.
> >
> > Song, can I have your Acked-by or Reviewed-by, please?
> >
> > - Arnaldo
> >
> > > Signed-off-by: Ian Rogers <irogers@google.com>
> 
> Good catch!
> 
> Acked-by: Song Liu <song@kernel.org>

Thanks, applied the patch with your Acked-by, will revisit this after
this gets published.

- Arnaldo

^ permalink raw reply

* Re: s390x stack unwinding with perf?
From: Heiko Carstens @ 2023-11-09 14:48 UTC (permalink / raw)
  To: Neal Gompa
  Cc: Daan De Meyer, linux-s390, linux-perf-users, Andreas Krebbel,
	Ilya Leoshkevich, Thomas Richter, Sumanth Korikkar, Vasily Gorbik,
	Davide Cavalca
In-Reply-To: <CAEg-Je_eyVRFmtCtAH+BLvqfPut3LtZQL7NFASzv7Er=iJjqAw@mail.gmail.com>

On Mon, Oct 30, 2023 at 09:19:02AM -0400, Neal Gompa wrote:
> On Mon, Oct 30, 2023 at 9:02 AM Heiko Carstens <hca@linux.ibm.com> wrote:
> >
> > On Fri, Oct 27, 2023 at 11:22:42AM -0400, Neal Gompa wrote:
> > > On Fri, Oct 27, 2023 at 6:10 AM Heiko Carstens <hca@linux.ibm.com> wrote:
> > > >
> > > > On Fri, Oct 27, 2023 at 10:00:53AM +0200, Daan De Meyer wrote:
> > > > >
> > > > > If the kernel gets support for s390x user space unwinding using the backchain,
> > > > > we'll propose to enable -mbackchain in the default compilation flags for Fedora
> > > > > so that s390x on Fedora will have the same profiling experience as x86-64, arm64
> > > > > and ppc64. For now we'll keep the status quo since compiling with the backchain
> > > > > doesn't provide any benefit until the kernel unwinder can unwind user
> > > > > space stacks
> > > > > using it.
> > > > >
> > > > > Thanks for clarifying the current state of user space stack unwinding on s390x!
> > > >
> > > > We will implement the missing pieces and let you know when things are
> > > > supposed to work.
> > >
> > > Do you think we could have an initial patch set for implementing the
> > > missing pieces in time for the Linux 6.8 merge window? Then we can
> > > look at enabling this for s390x as a Fedora Linux 40 Change.
> >
> > This will be very likely the case. Actually the plan is to go with the
> > patch below. I gave it some testing with Fedora 38 and replaced (only)
> > glibc with a variant that was compiled with -mbackchain.
...
> This patch LGTM. I'd love to see it land in Linux 6.7!
> 
> Reviewed-by: Neal Gompa <ngompa@fedoraproject.org>

FWIW, this is now upstream and will land in 6.7, together with a similar
patch which adds user stacktrace support:

504b73d00a55 ("s390/perf: implement perf_callchain_user()")
aa44433ac4ee ("s390: add USER_STACKTRACE support")

Please let us know if there are any problems.

^ permalink raw reply

* [PATCH v2] perf report: Add s390 raw data interpretation for PAI counters
From: Thomas Richter @ 2023-11-09 12:41 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, acme, namhyung
  Cc: svens, gor, sumanthk, hca, Thomas Richter

commit 1bf54f32f525 ("s390/pai: Add support for cryptography counters")
added support for Processor Activity Instrumentation Facility (PAI)
counters.  These counters values are added as raw data with the perf
sample during perf record.
Now add support to display these counters in perf report command.
The counter number, its assigned name and value is now printed in
addition to the hexadecimal output.

Version 2 fixed the lookup of the pmu only when evsel->pmu is NULL.

Output before:
 # perf report -D

 6 514766399626050 0x7b058 [0x48]: PERF_RECORD_SAMPLE(IP, 0x1):
				303977/303977: 0 period: 1 addr: 0
 ... thread: paitest:303977
 ...... dso: <not found>

 0x7b0a0@/root/perf.data.paicrypto [0x48]: event: 9
 .
 . ... raw event: size 72 bytes
 . 0000:  00 00 00 09 00 01 00 48 00 00 00 00 00 00 00 00  .......H........
 . 0010:  00 04 a3 69 00 04 a3 69 00 01 d4 2d 76 de a0 bb  ...i...i...-v...
 . 0020:  00 00 00 00 00 01 5c 53 00 00 00 06 00 00 00 00  ......\S........
 . 0030:  00 00 00 00 00 00 00 01 00 00 00 0c 00 07 00 00  ................
 . 0040:  00 00 00 53 96 af 00 00                          ...S....

Output after:
 # perf report -D

 6 514766399626050 0x7b058 [0x48]: PERF_RECORD_SAMPLE(IP, 0x1):
				303977/303977: 0 period: 1 addr: 0
 ... thread: paitest:303977
 ...... dso: <not found>

 0x7b0a0@/root/perf.data.paicrypto [0x48]: event: 9
 .
 . ... raw event: size 72 bytes
 . 0000:  00 00 00 09 00 01 00 48 00 00 00 00 00 00 00 00  .......H........
 . 0010:  00 04 a3 69 00 04 a3 69 00 01 d4 2d 76 de a0 bb  ...i...i...-v...
 . 0020:  00 00 00 00 00 01 5c 53 00 00 00 06 00 00 00 00  ......\S........
 . 0030:  00 00 00 00 00 00 00 01 00 00 00 0c 00 07 00 00  ................
 . 0040:  00 00 00 53 96 af 00 00                          ...S....

        Counter:007 km_aes_128 Value:0x00000000005396af     <--- new

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
---
 tools/perf/util/s390-cpumcf-kernel.h |   2 +
 tools/perf/util/s390-sample-raw.c    | 103 ++++++++++++++++++++++++---
 2 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/s390-cpumcf-kernel.h b/tools/perf/util/s390-cpumcf-kernel.h
index f55ca07f3ca1..74b36644e384 100644
--- a/tools/perf/util/s390-cpumcf-kernel.h
+++ b/tools/perf/util/s390-cpumcf-kernel.h
@@ -12,6 +12,8 @@
 #define	S390_CPUMCF_DIAG_DEF	0xfeef	/* Counter diagnostic entry ID */
 #define	PERF_EVENT_CPUM_CF_DIAG	0xBC000	/* Event: Counter sets */
 #define PERF_EVENT_CPUM_SF_DIAG	0xBD000 /* Event: Combined-sampling */
+#define PERF_EVENT_PAI_CRYPTO_ALL	0x1000 /* Event: CRYPTO_ALL */
+#define PERF_EVENT_PAI_NNPA_ALL	0x1800 /* Event: NNPA_ALL */
 
 struct cf_ctrset_entry {	/* CPU-M CF counter set entry (8 byte) */
 	unsigned int def:16;	/* 0-15  Data Entry Format */
diff --git a/tools/perf/util/s390-sample-raw.c b/tools/perf/util/s390-sample-raw.c
index 115b16edb451..85a54a018d15 100644
--- a/tools/perf/util/s390-sample-raw.c
+++ b/tools/perf/util/s390-sample-raw.c
@@ -125,6 +125,9 @@ static int get_counterset_start(int setnr)
 		return 128;
 	case CPUMF_CTR_SET_MT_DIAG:		/* Diagnostic counter set */
 		return 448;
+	case PERF_EVENT_PAI_NNPA_ALL:		/* PAI NNPA counter set */
+	case PERF_EVENT_PAI_CRYPTO_ALL:		/* PAI CRYPTO counter set */
+		return setnr;
 	default:
 		return -1;
 	}
@@ -212,27 +215,111 @@ static void s390_cpumcfdg_dump(struct perf_pmu *pmu, struct perf_sample *sample)
 	}
 }
 
+/*
+ * Check for consistency of PAI_CRYPTO/PAI_NNPA raw data.
+ */
+struct pai_data {		/* Event number and value */
+	u16 event_nr;
+	u64 event_val;
+} __packed;
+
+/*
+ * Test for valid raw data. At least one PAI event should be in the raw
+ * data section.
+ */
+static bool s390_pai_all_test(struct perf_sample *sample)
+{
+	unsigned char *buf = sample->raw_data;
+	size_t len = sample->raw_size;
+
+	if (len < 0xa || !buf)
+		return false;
+	return true;
+}
+
+static void s390_pai_all_dump(struct evsel *evsel, struct perf_sample *sample)
+{
+	size_t len = sample->raw_size, offset = 0;
+	unsigned char *p = sample->raw_data;
+	const char *color = PERF_COLOR_BLUE;
+	struct pai_data pai_data;
+	char *ev_name;
+
+	while (offset < len) {
+		memcpy(&pai_data.event_nr, p, sizeof(pai_data.event_nr));
+		pai_data.event_nr = be16_to_cpu(pai_data.event_nr);
+		p += sizeof(pai_data.event_nr);
+		offset += sizeof(pai_data.event_nr);
+
+		memcpy(&pai_data.event_val, p, sizeof(pai_data.event_val));
+		pai_data.event_val = be64_to_cpu(pai_data.event_val);
+		p += sizeof(pai_data.event_val);
+		offset += sizeof(pai_data.event_val);
+
+		ev_name = get_counter_name(evsel->core.attr.config,
+					   pai_data.event_nr, evsel->pmu);
+		color_fprintf(stdout, color, "\tCounter:%03d %s Value:%#018lx\n",
+			      pai_data.event_nr, ev_name ?: "<unknown>",
+			      pai_data.event_val);
+
+		if (offset + 0xa > len)
+			break;
+	}
+	color_fprintf(stdout, color, "\n");
+}
+
 /* S390 specific trace event function. Check for PERF_RECORD_SAMPLE events
- * and if the event was triggered by a counter set diagnostic event display
- * its raw data.
+ * and if the event was triggered by a
+ * - counter set diagnostic event
+ * - processor activity assist (PAI) crypto counter event
+ * - processor activity assist (PAI) neural network processor assist (NNPA)
+ *   counter event
+ * display its raw data.
  * The function is only invoked when the dump flag -D is set.
+ *
+ * Function evlist__s390_sample_raw() is defined as call back after it has
+ * been verified that the perf.data file was created on s390 platform.
  */
-void evlist__s390_sample_raw(struct evlist *evlist, union perf_event *event, struct perf_sample *sample)
+void evlist__s390_sample_raw(struct evlist *evlist, union perf_event *event,
+			     struct perf_sample *sample)
 {
+	const char *pai_name;
 	struct evsel *evsel;
 
 	if (event->header.type != PERF_RECORD_SAMPLE)
 		return;
 
 	evsel = evlist__event2evsel(evlist, event);
-	if (evsel == NULL ||
-	    evsel->core.attr.config != PERF_EVENT_CPUM_CF_DIAG)
+	if (evsel == NULL)
 		return;
 
 	/* Display raw data on screen */
-	if (!s390_cpumcfdg_testctr(sample)) {
-		pr_err("Invalid counter set data encountered\n");
+	if (evsel->core.attr.config == PERF_EVENT_CPUM_CF_DIAG) {
+		if (!evsel->pmu)
+			evsel->pmu = perf_pmus__find("cpum_cf");
+		if (!s390_cpumcfdg_testctr(sample))
+			pr_err("Invalid counter set data encountered\n");
+		else
+			s390_cpumcfdg_dump(evsel->pmu, sample);
 		return;
 	}
-	s390_cpumcfdg_dump(evsel->pmu, sample);
+
+	switch (evsel->core.attr.config) {
+	case PERF_EVENT_PAI_NNPA_ALL:
+		pai_name = "NNPA_ALL";
+		break;
+	case PERF_EVENT_PAI_CRYPTO_ALL:
+		pai_name = "CRYPTO_ALL";
+		break;
+	default:
+		return;
+	}
+
+	if (!s390_pai_all_test(sample))
+		pr_err("Invalid %s raw data encountered\n", pai_name);
+	else {
+		if (!evsel->pmu)
+			evsel->pmu = perf_pmus__find_by_type(evsel->core.attr.type);
+		s390_pai_all_dump(evsel, sample);
+	}
 }
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH v2 2/2] perf test: Add support for setting objdump binary via perf config
From: James Clark @ 2023-11-09 10:26 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-perf-users, irogers, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Yonghong Song, Fangrui Song, Kan Liang, Yang Jihong,
	Athira Rajeev, Ravi Bangoria, linux-kernel, llvm
In-Reply-To: <ZUv1TgveArYdvTsl@kernel.org>



On 08/11/2023 20:53, Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 06, 2023 at 03:10:49PM +0000, James Clark escreveu:
>> Add a perf config variable that does the same thing as "perf test
>> --objdump <x>".
>>
>> Also update the man page.
> 
> That is ok, if one wants to change objdump just for testing, as a
> followup improvement it may be interesting to allow that for the other
> tools that have --objdump as well as to add this as a global option,
> that affects all tools, no?

For the tools they already all share annotate.objdump in the config. Do
you mean that the tests could share the same config instead?

Maybe I could have used annotate.objdump for the tests, but it was used
in a slightly different way, and I thought it would be easier for people
to find if it started with "test."

> 
> Anyway, applied both patches.
> 
> - Arnaldo
>  
>> Signed-off-by: James Clark <james.clark@arm.com>
>> ---
>>  tools/perf/Documentation/perf-config.txt |  4 ++++
>>  tools/perf/tests/builtin-test.c          | 12 ++++++++++++
>>  2 files changed, 16 insertions(+)
>>
>> diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt
>> index 0b4e79dbd3f6..16398babd1ef 100644
>> --- a/tools/perf/Documentation/perf-config.txt
>> +++ b/tools/perf/Documentation/perf-config.txt
>> @@ -722,6 +722,10 @@ session-<NAME>.*::
>>  		Defines new record session for daemon. The value is record's
>>  		command line without the 'record' keyword.
>>  
>> +test.*::
>> +
>> +	test.objdump::
>> +		objdump binary to use for disassembly and annotations.
>>  
>>  SEE ALSO
>>  --------
>> diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
>> index a8d17dd50588..113e92119e1d 100644
>> --- a/tools/perf/tests/builtin-test.c
>> +++ b/tools/perf/tests/builtin-test.c
>> @@ -14,6 +14,7 @@
>>  #include <sys/wait.h>
>>  #include <sys/stat.h>
>>  #include "builtin.h"
>> +#include "config.h"
>>  #include "hist.h"
>>  #include "intlist.h"
>>  #include "tests.h"
>> @@ -514,6 +515,15 @@ static int run_workload(const char *work, int argc, const char **argv)
>>  	return -1;
>>  }
>>  
>> +static int perf_test__config(const char *var, const char *value,
>> +			     void *data __maybe_unused)
>> +{
>> +	if (!strcmp(var, "test.objdump"))
>> +		test_objdump_path = value;
>> +
>> +	return 0;
>> +}
>> +
>>  int cmd_test(int argc, const char **argv)
>>  {
>>  	const char *test_usage[] = {
>> @@ -541,6 +551,8 @@ int cmd_test(int argc, const char **argv)
>>          if (ret < 0)
>>                  return ret;
>>  
>> +	perf_config(perf_test__config, NULL);
>> +
>>  	/* Unbuffered output */
>>  	setvbuf(stdout, NULL, _IONBF, 0);
>>  
>> -- 
>> 2.34.1
>>
> 

^ permalink raw reply

* [PATCH v3] perf vendor events riscv: add StarFive Dubhe-90 JSON file
From: Ji Sheng Teoh @ 2023-11-09  7:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Nikita Shubin
  Cc: Ji Sheng Teoh, Ley Foon Tan, linux-perf-users, linux-kernel,
	linux-riscv

StarFive's Dubhe-90 supports raw event id 0x00 - 0x22.
The raw events are enabled through PMU node of DT binding.
Besides raw event, add standard RISC-V firmware events to
support monitoring of firmware event.

Example of PMU DT node:
pmu {
	compatible = "riscv,pmu";
	riscv,raw-event-to-mhpmcounters =
		/* Event ID 1-31 */
		<0x00 0x00 0xFFFFFFFF 0xFFFFFFE0 0x00007FF8>,
		/* Event ID 32-33 */
		<0x00 0x20 0xFFFFFFFF 0xFFFFFFFE 0x00007FF8>,
		/* Event ID 34 */
		<0x00 0x22 0xFFFFFFFF 0xFFFFFF22 0x00007FF8>;
};

Perf stat output:
[root@user]# perf stat -a \
	-e access_mmu_stlb \
	-e miss_mmu_stlb \
	-e access_mmu_pte_c \
	-e rob_flush \
	-e btb_prediction_miss \
	-e itlb_miss \
	-e sync_del_fetch_g \
	-e icache_miss \
	-e bpu_br_retire \
	-e bpu_br_miss \
	-e ret_ins_retire \
	-e ret_ins_miss \
	-- openssl speed rsa2048
Doing 2048 bits private rsa's for 10s: 39 2048 bits private RSA's in
10.03s
Doing 2048 bits public rsa's for 10s: 1469 2048 bits public RSA's in
9.47s
version: 3.0.10
built on: Tue Aug  1 13:47:24 2023 UTC
options: bn(64,64)
CPUINFO: N/A
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.257179s 0.006447s      3.9    155.1

 Performance counter stats for 'system wide':

           3112882      access_mmu_stlb
             10550      miss_mmu_stlb
             18251      access_mmu_pte_c
            274765      rob_flush
          22470560      btb_prediction_miss
           3035839      itlb_miss
         643549060      sync_del_fetch_g
            133013      icache_miss
          62982796      bpu_br_retire
            287548      bpu_br_miss
           8935910      ret_ins_retire
              8308      ret_ins_miss

      20.656182600 seconds time elapsed

Signed-off-by: Ji Sheng Teoh <jisheng.teoh@starfivetech.com>
Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com>
---
Changelog:
v2 -> v3:
- Add standard RISC-V firmware event
- Update commit message to reflect addition of standard
  RISC-V firmware event.
v1 -> v2:
- Rename 'Starfive Dubhe' to 'StarFive Dubhe-90' in commit message.
- Rename 'starfive/dubhe' pmu-events folder to 'starfive/dubhe-90'
- Update MARCHID to 0x80000000db000090 in mapfile.csv
---
 tools/perf/pmu-events/arch/riscv/mapfile.csv  |   1 +
 .../arch/riscv/starfive/dubhe-90/common.json  | 172 ++++++++++++++++++
 .../riscv/starfive/dubhe-90/firmware.json     |  68 +++++++
 3 files changed, 241 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/common.json
 create mode 100644 tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/firmware.json

diff --git a/tools/perf/pmu-events/arch/riscv/mapfile.csv b/tools/perf/pmu-events/arch/riscv/mapfile.csv
index c61b3d6ef616..5b75ecfe206d 100644
--- a/tools/perf/pmu-events/arch/riscv/mapfile.csv
+++ b/tools/perf/pmu-events/arch/riscv/mapfile.csv
@@ -15,3 +15,4 @@
 #
 #MVENDORID-MARCHID-MIMPID,Version,Filename,EventType
 0x489-0x8000000000000007-0x[[:xdigit:]]+,v1,sifive/u74,core
+0x67e-0x80000000db000090-0x[[:xdigit:]]+,v1,starfive/dubhe-90,core
\ No newline at end of file
diff --git a/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/common.json b/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/common.json
new file mode 100644
index 000000000000..fbffcacb2ace
--- /dev/null
+++ b/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/common.json
@@ -0,0 +1,172 @@
+[
+  {
+    "EventName": "ACCESS_MMU_STLB",
+    "EventCode": "0x1",
+    "BriefDescription": "access MMU STLB"
+  },
+  {
+    "EventName": "MISS_MMU_STLB",
+    "EventCode": "0x2",
+    "BriefDescription": "miss MMU STLB"
+  },
+  {
+    "EventName": "ACCESS_MMU_PTE_C",
+    "EventCode": "0x3",
+    "BriefDescription": "access MMU PTE-Cache"
+  },
+  {
+    "EventName": "MISS_MMU_PTE_C",
+    "EventCode": "0x4",
+    "BriefDescription": "miss MMU PTE-Cache"
+  },
+  {
+    "EventName": "ROB_FLUSH",
+    "EventCode": "0x5",
+    "BriefDescription": "ROB flush (all kinds of exceptions)"
+  },
+  {
+    "EventName": "BTB_PREDICTION_MISS",
+    "EventCode": "0x6",
+    "BriefDescription": "BTB prediction miss"
+  },
+  {
+    "EventName": "ITLB_MISS",
+    "EventCode": "0x7",
+    "BriefDescription": "ITLB miss"
+  },
+  {
+    "EventName": "SYNC_DEL_FETCH_G",
+    "EventCode": "0x8",
+    "BriefDescription": "SYNC delivery a fetch-group"
+  },
+  {
+    "EventName": "ICACHE_MISS",
+    "EventCode": "0x9",
+    "BriefDescription": "ICache miss"
+  },
+  {
+    "EventName": "BPU_BR_RETIRE",
+    "EventCode": "0xA",
+    "BriefDescription": "condition branch instruction retire"
+  },
+  {
+    "EventName": "BPU_BR_MISS",
+    "EventCode": "0xB",
+    "BriefDescription": "condition branch instruction miss"
+  },
+  {
+    "EventName": "RET_INS_RETIRE",
+    "EventCode": "0xC",
+    "BriefDescription": "return instruction retire"
+  },
+  {
+    "EventName": "RET_INS_MISS",
+    "EventCode": "0xD",
+    "BriefDescription": "return instruction miss"
+  },
+  {
+    "EventName": "INDIRECT_JR_MISS",
+    "EventCode": "0xE",
+    "BriefDescription": "indirect JR instruction miss (inlcude without target)"
+  },
+  {
+    "EventName": "IBUF_VAL_ID_NORDY",
+    "EventCode": "0xF",
+    "BriefDescription": "IBUF valid while ID not ready"
+  },
+  {
+    "EventName": "IBUF_NOVAL_ID_RDY",
+    "EventCode": "0x10",
+    "BriefDescription": "IBUF not valid while ID ready"
+  },
+  {
+    "EventName": "REN_INT_PHY_REG_NORDY",
+    "EventCode": "0x11",
+    "BriefDescription": "REN integer physical register file is not ready"
+  },
+  {
+    "EventName": "REN_FP_PHY_REG_NORDY",
+    "EventCode": "0x12",
+    "BriefDescription": "REN floating point physical register file is not ready"
+  },
+  {
+    "EventName": "REN_CP_NORDY",
+    "EventCode": "0x13",
+    "BriefDescription": "REN checkpoint is not ready"
+  },
+  {
+    "EventName": "DEC_VAL_ROB_NORDY",
+    "EventCode": "0x14",
+    "BriefDescription": "DEC is valid and ROB is not ready"
+  },
+  {
+    "EventName": "OOD_FLUSH_LS_DEP",
+    "EventCode": "0x15",
+    "BriefDescription": "out of order flush due to load/store dependency"
+  },
+  {
+    "EventName": "BRU_RET_IJR_INS",
+    "EventCode": "0x16",
+    "BriefDescription": "BRU retire an IJR instruction"
+  },
+  {
+    "EventName": "ACCESS_DTLB",
+    "EventCode": "0x17",
+    "BriefDescription": "access DTLB"
+  },
+  {
+    "EventName": "MISS_DTLB",
+    "EventCode": "0x18",
+    "BriefDescription": "miss DTLB"
+  },
+  {
+    "EventName": "LOAD_INS_DCACHE",
+    "EventCode": "0x19",
+    "BriefDescription": "load instruction access DCache"
+  },
+  {
+    "EventName": "LOAD_INS_MISS_DCACHE",
+    "EventCode": "0x1A",
+    "BriefDescription": "load instruction miss DCache"
+  },
+  {
+    "EventName": "STORE_INS_DCACHE",
+    "EventCode": "0x1B",
+    "BriefDescription": "store/amo instruction access DCache"
+  },
+  {
+    "EventName": "STORE_INS_MISS_DCACHE",
+    "EventCode": "0x1C",
+    "BriefDescription": "store/amo instruction miss DCache"
+  },
+  {
+    "EventName": "LOAD_SCACHE",
+    "EventCode": "0x1D",
+    "BriefDescription": "load access SCache"
+  },
+  {
+    "EventName": "STORE_SCACHE",
+    "EventCode": "0x1E",
+    "BriefDescription": "store access SCache"
+  },
+  {
+    "EventName": "LOAD_MISS_SCACHE",
+    "EventCode": "0x1F",
+    "BriefDescription": "load miss SCache"
+  },
+  {
+    "EventName": "STORE_MISS_SCACHE",
+    "EventCode": "0x20",
+    "BriefDescription": "store miss SCache"
+  },
+  {
+    "EventName": "L2C_PF_REQ",
+    "EventCode": "0x21",
+    "BriefDescription": "L2C data-prefetcher request"
+  },
+  {
+    "EventName": "L2C_PF_HIT",
+    "EventCode": "0x22",
+    "BriefDescription": "L2C data-prefetcher hit"
+  }
+]
diff --git a/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/firmware.json b/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/firmware.json
new file mode 100644
index 000000000000..9b4a032186a7
--- /dev/null
+++ b/tools/perf/pmu-events/arch/riscv/starfive/dubhe-90/firmware.json
@@ -0,0 +1,68 @@
+[
+  {
+    "ArchStdEvent": "FW_MISALIGNED_LOAD"
+  },
+  {
+    "ArchStdEvent": "FW_MISALIGNED_STORE"
+  },
+  {
+    "ArchStdEvent": "FW_ACCESS_LOAD"
+  },
+  {
+    "ArchStdEvent": "FW_ACCESS_STORE"
+  },
+  {
+    "ArchStdEvent": "FW_ILLEGAL_INSN"
+  },
+  {
+    "ArchStdEvent": "FW_SET_TIMER"
+  },
+  {
+    "ArchStdEvent": "FW_IPI_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_IPI_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_FENCE_I_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_FENCE_I_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_SFENCE_VMA_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_SFENCE_VMA_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_SFENCE_VMA_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_SFENCE_VMA_ASID_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_GVMA_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_GVMA_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_GVMA_VMID_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_GVMA_VMID_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_VVMA_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_VVMA_RECEIVED"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_VVMA_ASID_SENT"
+  },
+  {
+    "ArchStdEvent": "FW_HFENCE_VVMA_ASID_RECEIVED"
+  }
+]
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 28/48] perf dwarf-aux: Add die_find_variable_by_addr()
From: Namhyung Kim @ 2023-11-09  5:36 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ian Rogers,
	Adrian Hunter, Ingo Molnar, LKML, linux-perf-users,
	Linus Torvalds, Stephane Eranian, linux-toolchains,
	linux-trace-devel
In-Reply-To: <20231107002523.f0643720eac144841dedb8a4@kernel.org>

On Mon, Nov 6, 2023 at 7:25 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 11 Oct 2023 20:50:51 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
>
> > The die_find_variable_by_addr() is to find a variables in the given DIE
> > using given (PC-relative) address.  Global variables will have a
> > location expression with DW_OP_addr which has an address so can simply
> > compare it with the address.
> >
> >   <1><143a7>: Abbrev Number: 2 (DW_TAG_variable)
> >       <143a8>   DW_AT_name        : loops_per_jiffy
> >       <143ac>   DW_AT_type        : <0x1cca>
> >       <143b0>   DW_AT_external    : 1
> >       <143b0>   DW_AT_decl_file   : 193
> >       <143b1>   DW_AT_decl_line   : 213
> >       <143b2>   DW_AT_location    : 9 byte block: 3 b0 46 41 82 ff ff ff ff
> >                                      (DW_OP_addr: ffffffff824146b0)
> >
> > Note that the type-offset should be calculated from the base address of
> > the global variable.
> >
> > Signed-off-by: Namhyung Kim <namhyung@kernel.org>
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> BTW, for the global variable, you can also find it via maps. Can't you?

What do you mean by 'via maps'?  The map in perf?

Thanks,
Namhyung

^ permalink raw reply

* Re: [PATCH 34/48] perf dwarf-aux: Add die_collect_vars()
From: Namhyung Kim @ 2023-11-09  5:05 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ian Rogers,
	Adrian Hunter, Ingo Molnar, LKML, linux-perf-users,
	Linus Torvalds, Stephane Eranian, linux-toolchains,
	linux-trace-devel
In-Reply-To: <20231108195204.a3ddfe5965e9c33661460ff4@kernel.org>

On Wed, Nov 8, 2023 at 2:52 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 11 Oct 2023 20:50:57 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
>
> > The die_collect_vars() is to find all variable information in the scope
> > including function parameters.  The struct die_var_type is to save the
> > type of the variable with the location (reg and offset) as well as where
> > it's defined in the code (addr).
> >
>
> This looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

>
> BTW, I did similar thing in collect_variables_cb()@probe-finder.c, maybe
> this can simplify that too.

Ok, I'll take a look later.

Thanks,
Namhyung

^ permalink raw reply

* Re: [RFC 00/48] perf tools: Introduce data type profiling (v1)
From: Namhyung Kim @ 2023-11-09  4:48 UTC (permalink / raw)
  To: Joe Mario
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Peter Zijlstra, Ian Rogers,
	Adrian Hunter, Ingo Molnar, LKML, linux-perf-users,
	Linus Torvalds, Stephane Eranian, Masami Hiramatsu,
	linux-toolchains, linux-trace-devel, Ben Woodard, Kees Cook,
	David Blaikie, Xu Liu, Kan Liang, Ravi Bangoria
In-Reply-To: <82cd8b7e-bd46-49ed-9160-eabcfd4c3c20@redhat.com>

Hello,

On Wed, Nov 8, 2023 at 9:12 AM Joe Mario <jmario@redhat.com> wrote:
>
> Hi Namhyung:
>
> I've been playing with your datatype profile patch and it looks really promising.
> I think it would be a big help if it could be integrated into perf c2c.

Great!  Yeah, I think we can collaborate on it.

>
> Perf c2c gives a great insight into what's contributing to cpu cacheline contention, but it
> can be difficult to understand the output.  Having visuals with your datatype profile output
> would be a big help.

Exactly.

>
> I have a simple test program with readers and writers tugging on the data below:
>
>   uint64_t hotVar;
>   typedef struct __foo {
>      uint64_t m1;
>      uint64_t m2;
>      uint64_t m3;
>   } FOO;
>
> The rest of this reply looks at both your datatype output and c2c to see where they
> might compliment each other.
>
>
> When I run perf with your patches on a simple program to cause contention on the above data, I get the following:
>
> # perf mem record --ldlat=1 --all-user --  ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
> # perf report -s type,typeoff --hierarchy --stdio
>
>    # Samples: 26K of event 'cpu/mem-loads,ldlat=1/P'
>    # Event count (approx.): 2958226
>    #
>    #    Overhead  Data Type / Data Type Offset
>    # ...........  ............................
>    #
>        54.50%     int
>           54.50%     int +0 (no field)
>        23.21%     long int
>           23.21%     long int +0 (no field)
>        18.30%     struct __foo
>            9.57%     struct __foo +8 (m2)
>            8.73%     struct __foo +0 (m1)
>         3.86%     long unsigned int
>            3.86%     long unsigned int +0 (no field)
>        <snip>
>
>    # Samples: 30K of event 'cpu/mem-stores/P'
>    # Event count (approx.): 33880197
>    #
>    #    Overhead  Data Type / Data Type Offset
>    # ...........  ............................
>    #
>        99.85%     struct __foo
>           70.48%     struct __foo +0 (m1)
>           29.34%     struct __foo +16 (m3)
>            0.03%     struct __foo +8 (m2)
>         0.09%     long unsigned int
>            0.09%     long unsigned int +0 (no field)
>         0.06%     (unknown)
>            0.06%     (unknown) +0 (no field)
>        <snip>
>
> Then I run perf annotate with your patches, and I get the following:
>
>   # perf annotate  --data-type
>
>    Annotate type: 'long int' in /home/joe/tugtest/tugtest (2901 samples):
>    ============================================================================
>        samples     offset       size  field
>           2901          0          8  long int  ;
>
>    Annotate type: 'struct __foo' in /home/joe/tugtest/tugtest (5593 samples):
>    ============================================================================
>        samples     offset       size  field
>           5593          0         24  struct __foo       {
>           2755          0          8      uint64_t      m1;
>           2838          8          8      uint64_t      m2;
>              0         16          8      uint64_t      m3;
>                                       };
>
> Now when I run that same simple test using perf c2c, and I focus on the cachline that the struct and hotVar reside in, I get:
>
> # perf c2c record --all-user -- ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
> # perf c2c report -NNN --stdio
> # <snip>
> #
> #      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared
> # Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
> #....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
> #
>  ---------------------------------------------------------------
>     0     1094     2008    17071    13762        0      0x406100
>  ---------------------------------------------------------------
>          0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>          0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>         68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>         31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>          0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>
> With the above c2c output, we can see:
>  - the hottest contended addresses, and the load latencies they caused.
>  - the cacheline offset for the contended addresses.
>  - the cpus and numa nodes where the accesses came from.
>  - the cacheline alignment for the data of interest.
>  - the number of cpus and threads concurrently accessing each address.
>  - the breakdown of reads causing HITM (contention) and writes hitting or missing the cacheline.
>  - the object name, source line and line number for where the accesses occured.
>  - the numa node where the data is allocated.
>  - the number of physical pages the virtual addresses were mapped to (e.g. numa_balancing).
>
> What would really help the c2c output be more usable is if it had a better visual to it.
> It's likely the current c2c output can be trimmed a bit.
>
> Here's one idea that incorporates your datatype info, though I'm sure there are better ways, as this may get unwieldy.:
>
> #      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared
> # Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
> #....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
> #
>  ---------------------------------------------------------------
>     0     1094     2008    17071    13762        0      0x406100
>  ---------------------------------------------------------------
>   uint64_t hotVar: tugtest.c:38
>          0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>          0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>   struct __foo uint64_t m1: tugtest.c:39
>         68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>         31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>   struct __foo uint64_t m2: tugtest.c:40
>          0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>          0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>   struct __foo uint64_t m3: tugtest.c:41
>          0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>
> And then it would be good to find a clean way to incorporate your sample counts.

I'm not sure we can get the exact source line for the data type/fields.
Of course, we can aggregate the results for each field.  Actually you
can use `perf report -s type,typeoff,symoff --hierarchy` for something
similar. :)

>
> On a related note, is there a way the accesses could be broken down into read counts
> and write counts?   That, with the above source line info for all the accesses,
> helps to convey a picture of "the affinity of the accesses".

Sure, perf report already supports showing events in a group
together.  You can use --group option to force grouping
individual events.  perf annotate with --data-type doesn't have
that yet.  I'll update it in v2.

>
> For example, while it's normally good to separate read-mostly data from hot
> written data, if the reads and writes are done together in the same block of
> code by the same thread, then keeping the two data symbols in the same cacheline
> could be a win.  I've seen this often. Your datatype info might be able to
> make these affinities more visible to the user.
>
> Thanks for doing this. This is great.
> Joe

Thanks for your feedback!
Namhyung

^ permalink raw reply

* Re: [PATCH RFC 00/10] perf: user space sframe unwinding
From: Josh Poimboeuf @ 2023-11-09  0:45 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

On Wed, Nov 08, 2023 at 04:41:05PM -0800, Josh Poimboeuf wrote:
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF (or .eh_frame) instead isn't feasible
> because of complexity and slowness.
> 
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the SFrame ("Simple Frame") format starting with binutils
> 2.40.
> 
> These patches add support for unwinding user space from the kernel using
> SFrame with perf.  It should be easy to add user unwinding support for
> other components like ftrace.
> 
> I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
> using a custom glibc patch (which I'll send in a reply to this email).

Here's my glibc patch:

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 2923b1141d..333d7c39fd 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -29,6 +29,7 @@
 #include <bits/wordsize.h>
 #include <sys/mman.h>
 #include <sys/param.h>
+#include <sys/prctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <gnu/lib-names.h>
@@ -88,6 +89,10 @@ struct filebuf
 
 #define STRING(x) __STRING (x)
 
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
+
 
 int __stack_prot attribute_hidden attribute_relro
 #if _STACK_GROWS_DOWN && defined PROT_GROWSDOWN
@@ -1213,6 +1218,10 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  l->l_relro_addr = ph->p_vaddr;
 	  l->l_relro_size = ph->p_memsz;
 	  break;
+
+	case PT_GNU_SFRAME:
+	  l->l_sframe_addr = ph->p_vaddr;
+	  break;
 	}
 
     if (__glibc_unlikely (nloadcmds == 0))
@@ -1263,6 +1272,8 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	l->l_map_start = l->l_map_end = 0;
 	goto lose;
       }
+
+
   }
 
   if (l->l_ld != 0)
@@ -1376,6 +1387,13 @@ cannot enable executable stack as shared object requires");
 	break;
       }
 
+#define PR_ADD_SFRAME 71
+  if (l->l_sframe_addr != 0)
+  {
+    l->l_sframe_addr += l->l_addr;
+    __prctl(PR_ADD_SFRAME, l->l_sframe_addr, NULL, NULL, NULL);
+  }
+
   /* We are done mapping in the file.  We no longer need the descriptor.  */
   if (__glibc_unlikely (__close_nocancel (fd) != 0))
     {
diff --git a/include/link.h b/include/link.h
index c6af095d87..36ac75680f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -348,6 +348,8 @@ struct link_map
     ElfW(Addr) l_relro_addr;
     size_t l_relro_size;
 
+    ElfW(Addr) l_sframe_addr;
+
     unsigned long long int l_serial;
   };
 

^ permalink raw reply related

* [PATCH RFC 10/10] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Binutils 2.40 supports generating sframe for x86_64.  It works well in
testing so enable it.

NOTE: An out-of-tree glibc patch is still needed to enable setting
PR_ADD_SFRAME for shared libraries and dlopens.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 95939cd54dfe..770d0528e4c9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_USER_UNWIND
+	select HAVE_USER_UNWIND_SFRAME		if X86_64
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF instead isn't feasible due to complexity and
slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame format starting with binutils 2.40.  SFrame is a
simpler version of .eh_frame which gets placed in the .sframe section.

Add support for unwinding user space using SFrame.

More information about SFrame can be found here:

  - https://lwn.net/Articles/932209/
  - https://lwn.net/Articles/940686/
  - https://sourceware.org/binutils/docs/sframe-spec.html

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |   3 +
 arch/x86/include/asm/mmu.h  |   2 +-
 fs/binfmt_elf.c             |  46 +++-
 include/linux/mm_types.h    |   3 +
 include/linux/sframe.h      |  46 ++++
 include/linux/user_unwind.h |   1 +
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/prctl.h  |   3 +
 kernel/fork.c               |  10 +
 kernel/sys.c                |  11 +
 kernel/unwind/Makefile      |   1 +
 kernel/unwind/sframe.c      | 414 ++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h      | 217 +++++++++++++++++++
 kernel/unwind/user.c        |  15 +-
 mm/init-mm.c                |   2 +
 15 files changed, 768 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index c4a08485835e..b133b03102c7 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -431,6 +431,9 @@ config HAVE_PERF_CALLCHAIN_DEFERRED
 config HAVE_USER_UNWIND
 	bool
 
+config HAVE_USER_UNWIND_SFRAME
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 0da5c227f490..9cf9cae8345f 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -73,7 +73,7 @@ typedef struct {
 	.context = {							\
 		.ctx_id = 1,						\
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
-	}
+	},
 
 void leave_mm(int cpu);
 #define leave_mm leave_mm
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5397b552fbeb..bca207844a70 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
+	unsigned long start_code = ~0UL;
+	unsigned long end_code = 0;
 	int i;
 
 	/* First of all, some simple consistency checks */
@@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -698,7 +702,29 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+
+			if ((eppnt->p_flags & PF_X) && k < start_code)
+				start_code = k;
+
+			k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
+			if ((eppnt->p_flags & PF_X) && k > end_code)
+				end_code = k;
+			break;
 		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
+		}
+	}
+
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_addr + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
 	}
 
 	error = load_addr;
@@ -823,7 +849,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -931,6 +957,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1279,6 +1309,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				MAP_FIXED | MAP_PRIVATE, 0);
 	}
 
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_bias + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
+	}
+
 	regs = current_pt_regs();
 #ifdef ELF_PLAT_INIT
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 957ce38768b2..7c361a9ccf75 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -974,6 +974,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN */
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..72a2e8625026
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+
+struct sframe_file {
+	unsigned long sframe_addr, text_start, text_end;
+};
+
+struct user_unwind_frame;
+
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+
+extern void sframe_free_mm(struct mm_struct *mm);
+
+extern int __sframe_add_section(struct sframe_file *file);
+extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
+
+static inline bool sframe_enabled_current(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
+#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#define INIT_MM_SFRAME
+
+static inline void sframe_free_mm(struct mm_struct *mm) {}
+
+static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
+static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
+static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
+
+static inline bool sframe_enabled_current(void) { return false; }
+
+#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
index 2812b88c95fd..9a5e6e557530 100644
--- a/include/linux/user_unwind.h
+++ b/include/linux/user_unwind.h
@@ -8,6 +8,7 @@
 enum user_unwind_type {
 	USER_UNWIND_TYPE_AUTO,
 	USER_UNWIND_TYPE_FP,
+	USER_UNWIND_TYPE_SFRAME,
 };
 
 struct user_unwind_frame {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..e3a08ee03fe4 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..336277ea9782 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,7 @@ struct prctl_mm_map {
 # define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK	0xc
 # define PR_RISCV_V_VSTATE_CTRL_MASK		0x1f
 
+#define PR_ADD_SFRAME			71
+#define PR_REMOVE_SFRAME		72
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..0ec13004d86c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -99,6 +99,7 @@
 #include <linux/stackprotector.h>
 #include <linux/user_events.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -924,6 +925,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1254,6 +1256,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1285,6 +1294,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 420d9cb9cc8e..4f2d6f91814d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2739,6 +2740,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_V_GET_CONTROL:
 		error = RISCV_V_GET_CONTROL();
 		break;
+	case PR_ADD_SFRAME:
+		if (arg5)
+			return -EINVAL;
+		error = sframe_add_section(arg2, arg3, arg4);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index eb466d6a3295..6f202c5840cf 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
+obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..b167c19497e5
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/sframe.h>
+#include <linux/user_unwind.h>
+
+#include "sframe.h"
+
+#define SFRAME_FILENAME_LEN 32
+
+struct sframe_section {
+	struct rcu_head rcu;
+
+	unsigned long sframe_addr;
+	unsigned long text_addr;
+
+	unsigned long fdes_addr;
+	unsigned long fres_addr;
+	unsigned int  fdes_num;
+	signed char ra_off, fp_off;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+#define __SFRAME_GET_USER(out, user_ptr, type)				\
+({									\
+	type __tmp;							\
+	if (get_user(__tmp, (type *)user_ptr))				\
+		return -EFAULT;						\
+	user_ptr += sizeof(__tmp);					\
+	out = __tmp;							\
+})
+
+#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, s8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, s16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, s32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, u8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, u16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, u32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+static unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static int find_fde(struct sframe_section *sec, unsigned long ip,
+		    struct sframe_fde *fde)
+{
+	s32 func_off, ip_off;
+	struct sframe_fde __user *first, *last, *mid, *found;
+
+	ip_off = ip - sec->sframe_addr;
+
+	first = (void *)sec->fdes_addr;
+	last = first + sec->fdes_num;
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+		if (get_user(func_off, (s32 *)mid))
+			return -EFAULT;
+		if (ip_off >= func_off) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (copy_from_user(fde, found, sizeof(*fde)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
+		    unsigned long ip, struct user_unwind_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;
+	unsigned char offset_count, offset_size;
+	unsigned char addr_size;
+	void __user *f, *last_f;
+	u8 fre_info;
+	int i;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EINVAL;
+
+	ip_off = ip - sec->sframe_addr - fde->start_addr;
+
+	f = (void *)sec->fres_addr + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+
+		SFRAME_GET_USER_UNSIGNED(fre_ip_off, f, addr_size);
+
+		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
+			if (fre_ip_off > ip_off)
+				break;
+		} else {
+			/* SFRAME_FDE_TYPE_PCMASK */
+#if 0 /* sframe v2 */
+			if (ip_off % fde->rep_size < fre_ip_off)
+				break;
+#endif
+		}
+
+		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
+
+		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
+		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
+
+		if (!offset_count || !offset_size)
+			return -EINVAL;
+
+		last_f = f;
+		f += offset_count * offset_size;
+	}
+
+	if (!last_f)
+		return -EINVAL;
+
+	f = last_f;
+
+	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EINVAL;
+		SFRAME_GET_USER_SIGNED(ra_off, f, offset_size);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		SFRAME_GET_USER_SIGNED(fp_off, f, offset_size);
+	}
+
+	if (offset_count)
+		return -EINVAL;
+
+	frame->cfa_off = cfa_off;
+	frame->ra_off = ra_off;
+	frame->fp_off = fp_off;
+	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct user_unwind_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int srcu_idx;
+	int ret = -EINVAL;
+
+	srcu_idx = srcu_read_lock(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec) {
+		srcu_read_unlock(&sframe_srcu, srcu_idx);
+		return -EINVAL;
+	}
+
+
+	ret = find_fde(sec, ip, &fde);
+	if (ret)
+		goto err_unlock;
+
+	ret = find_fre(sec, &fde, ip, frame);
+	if (ret)
+		goto err_unlock;
+
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return 0;
+
+err_unlock:
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return ret;
+}
+
+static int get_sframe_file(unsigned long sframe_addr, struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *sframe_vma, *text_vma, *vma;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	mmap_read_lock(mm);
+
+	sframe_vma = vma_lookup(mm, sframe_addr);
+	if (!sframe_vma || !sframe_vma->vm_file)
+		goto err_unlock;
+
+	text_vma = NULL;
+
+	for_each_vma(vmi, vma) {
+		if (vma->vm_file != sframe_vma->vm_file)
+			continue;
+		if (vma->vm_flags & VM_EXEC) {
+			if (text_vma) {
+				/*
+				 * Multiple EXEC segments in a single file
+				 * aren't currently supported, is that a thing?
+				 */
+				WARN_ON_ONCE(1);
+				goto err_unlock;
+			}
+			text_vma = vma;
+		}
+	}
+
+	file->sframe_addr	= sframe_addr;
+	file->text_start	= text_vma->vm_start;
+	file->text_end		= text_vma->vm_end;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+static int validate_sframe_addrs(struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *text_vma;
+
+	mmap_read_lock(mm);
+
+	if (!vma_lookup(mm, file->sframe_addr))
+		goto err_unlock;
+
+	text_vma = vma_lookup(mm, file->text_start);
+	if (!(text_vma->vm_flags & VM_EXEC))
+		goto err_unlock;
+
+	if (vma_lookup(mm, file->text_end-1) != text_vma)
+		goto err_unlock;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+int __sframe_add_section(struct sframe_file *file)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct sframe_section *sec;
+	struct sframe_header shdr;
+	unsigned long header_end;
+	int ret;
+
+	if (copy_from_user(&shdr, (void *)file->sframe_addr, sizeof(shdr)))
+		return -EFAULT;
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_1 ||
+	    (!shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
+	    shdr.fdes_off > shdr.fres_off) {
+		return -EINVAL;
+	}
+
+	header_end = file->sframe_addr + SFRAME_HDR_SIZE(shdr);
+
+	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	sec->sframe_addr	= file->sframe_addr;
+	sec->text_addr		= file->text_start;
+	sec->fdes_addr		= header_end + shdr.fdes_off;
+	sec->fres_addr		= header_end + shdr.fres_off;
+	sec->fdes_num		= shdr.num_fdes;
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	ret = mtree_insert_range(sframe_mt, file->text_start, file->text_end,
+				 sec, GFP_KERNEL);
+	if (ret) {
+		kfree(sec);
+		return ret;
+	}
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end)
+{
+	struct sframe_file file;
+	int ret;
+
+	if (!text_start || !text_end) {
+		ret = get_sframe_file(sframe_addr, &file);
+		if (ret)
+			return ret;
+	} else {
+		/*
+		 * This is mainly for generated code, for which the text isn't
+		 * file-backed so the user has to give the text bounds.
+		 */
+		file.sframe_addr	= sframe_addr;
+		file.text_start		= text_start;
+		file.text_end		= text_end;
+		ret = validate_sframe_addrs(&file);
+		if (ret)
+			return ret;
+	}
+
+	return __sframe_add_section(&file);
+}
+
+static void sframe_free_rcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	kfree(sec);
+}
+
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	struct sframe_section *s;
+
+	s = mtree_erase(&mm->sframe_mt, sec->text_addr);
+	if (!s || WARN_ON_ONCE(s != sec))
+		return -EINVAL;
+
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_rcu);
+
+	return 0;
+}
+
+int sframe_remove_section(unsigned long sframe_addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	sec = mtree_load(&mm->sframe_mt, sframe_addr);
+	if (!sec)
+		return -EINVAL;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_addr == sframe_addr)
+			return __sframe_remove_section(mm, sec);
+	}
+
+	return -EINVAL;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		kfree(sec);
+
+	mtree_destroy(&mm->sframe_mt);
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..1f91b696daf5
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,217 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ *
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_VERSION_2	2
+#define SFRAME_MAGIC		0xdee2
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	u16 magic;
+	u8  version;
+	u8  flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed  for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8  abi_arch;
+	s8  cfa_fixed_fp_offset;
+	s8  cfa_fixed_ra_offset;
+	u8  auxhdr_len;
+	u32 num_fdes;
+	u32 num_fres;
+	u32 fre_len;
+	u32 fdes_off;
+	u32 fres_off;
+} __packed;
+
+#define SFRAME_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_fde - SFrame Function Descriptor Entry.
+ * @start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @size: Size of the function in bytes.
+ * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @fres_num: Number of frame row entries for the function.
+ * @info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
+ * @padding: Unused
+ */
+struct sframe_fde {
+	s32 start_addr;
+	u32 size;
+	u32 fres_off;
+	u32 fres_num;
+	u8  info;
+#if 0 /* TODO sframe v2 */
+	u8  rep_size;
+	u16 padding;
+#endif
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 8f9432306482..4194180df154 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -26,6 +26,11 @@ int user_unwind_next(struct user_unwind_state *state)
 	case USER_UNWIND_TYPE_FP:
 		frame = &fp_frame;
 		break;
+	case USER_UNWIND_TYPE_SFRAME:
+		ret = sframe_find(state->ip, frame);
+		if (ret)
+			goto the_end;
+		break;
 	default:
 		BUG();
 	}
@@ -64,10 +69,14 @@ int user_unwind_start(struct user_unwind_state *state,
 		return -EINVAL;
 	}
 
-	if (type == USER_UNWIND_TYPE_AUTO)
-		state->type = USER_UNWIND_TYPE_FP;
-	else
+	if (type == USER_UNWIND_TYPE_AUTO) {
+		state->type = sframe_enabled_current() ? USER_UNWIND_TYPE_SFRAME
+						       : USER_UNWIND_TYPE_FP;
+	} else {
+		if (type == USER_UNWIND_TYPE_SFRAME && !sframe_enabled_current())
+			return -EINVAL;
 		state->type = type;
+	}
 
 	state->sp = user_stack_pointer(regs);
 	state->ip = instruction_pointer(regs);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index cfd367822cdd..288885a39e12 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
@@ -48,6 +49,7 @@ struct mm_struct init_mm = {
 	.pasid		= IOMMU_PASID_INVALID,
 #endif
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_SFRAME
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 08/10] perf/x86: Use user_unwind interface
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Simplify __perf_callchain_user() and prepare for sframe user space
unwinding by switching to the generic user unwind interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/events/core.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ae264437f794..5c41a11f058f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -29,6 +29,7 @@
 #include <linux/device.h>
 #include <linux/nospec.h>
 #include <linux/static_call.h>
+#include <linux/user_unwind.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -2856,8 +2857,7 @@ static inline int __perf_callchain_user32(struct pt_regs *regs,
 void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 			   struct pt_regs *regs, bool atomic)
 {
-	struct stack_frame frame;
-	const struct stack_frame __user *fp;
+	struct user_unwind_state state;
 
 	if (perf_guest_state()) {
 		/* TODO: We don't support guest os callchain now */
@@ -2870,8 +2870,6 @@ void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
 		return;
 
-	fp = (void __user *)regs->bp;
-
 	perf_callchain_store(entry, regs->ip);
 
 	if (atomic && !nmi_uaccess_okay())
@@ -2883,17 +2881,9 @@ void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	if (__perf_callchain_user32(regs, entry))
 		goto done;
 
-	while (entry->nr < entry->max_stack) {
-		if (!valid_user_frame(fp, sizeof(frame)))
-			break;
-
-		if (__get_user(frame.next_frame, &fp->next_frame))
-			break;
-		if (__get_user(frame.return_address, &fp->return_address))
-			break;
-
-		perf_callchain_store(entry, frame.return_address);
-		fp = (void __user *)frame.next_frame;
+	for_each_user_frame(state, USER_UNWIND_TYPE_AUTO) {
+		if (perf_callchain_store(entry, state.ip))
+			goto done;
 	}
 done:
 	if (atomic)
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 07/10] unwind/x86: Add HAVE_USER_UNWIND
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable HAVE_USER_UNWIND accordinlgy so the user unwind
interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/user_unwind.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/user_unwind.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cacf11ac4b10..95939cd54dfe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -278,6 +278,7 @@ config X86
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
+	select HAVE_USER_UNWIND
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
diff --git a/arch/x86/include/asm/user_unwind.h b/arch/x86/include/asm/user_unwind.h
new file mode 100644
index 000000000000..caa6266abbb4
--- /dev/null
+++ b/arch/x86/include/asm/user_unwind.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_USER_UNWIND_H
+#define _ASM_X86_USER_UNWIND_H
+
+#define ARCH_INIT_USER_FP_FRAME						\
+	.ra_off		= sizeof(long) * -1,				\
+	.cfa_off	= sizeof(long) * 2,				\
+	.fp_off		= sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_USER_UNWIND_H */
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 06/10] unwind: Introduce generic user space unwinding interfaces
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Introduce generic user space unwinder interfaces which will provide a
unified way for architectures to unwind different user space stack frame
types.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |  3 ++
 include/linux/user_unwind.h | 32 +++++++++++++++
 kernel/Makefile             |  1 +
 kernel/unwind/Makefile      |  1 +
 kernel/unwind/user.c        | 77 +++++++++++++++++++++++++++++++++++++
 5 files changed, 114 insertions(+)
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 690c82212224..c4a08485835e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -428,6 +428,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config HAVE_PERF_CALLCHAIN_DEFERRED
 	bool
 
+config HAVE_USER_UNWIND
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
new file mode 100644
index 000000000000..2812b88c95fd
--- /dev/null
+++ b/include/linux/user_unwind.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_USER_UNWIND_H
+#define _LINUX_USER_UNWIND_H
+
+#include <linux/types.h>
+#include <linux/sframe.h>
+
+enum user_unwind_type {
+	USER_UNWIND_TYPE_AUTO,
+	USER_UNWIND_TYPE_FP,
+};
+
+struct user_unwind_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct user_unwind_state {
+	unsigned long ip, sp, fp;
+	enum user_unwind_type type;
+	bool done;
+};
+
+extern int user_unwind_start(struct user_unwind_state *state, enum user_unwind_type);
+extern int user_unwind_next(struct user_unwind_state *state);
+
+#define for_each_user_frame(state, type) \
+	for (user_unwind_start(&state, type); !state.done; user_unwind_next(&state))
+
+#endif /* _LINUX_USER_UNWIND_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..bddf58b3b496 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..eb466d6a3295
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..8f9432306482
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/user_unwind.h>
+#include <linux/sframe.h>
+#include <linux/uaccess.h>
+#include <asm/user_unwind.h>
+
+static struct user_unwind_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+int user_unwind_next(struct user_unwind_state *state)
+{
+	struct user_unwind_frame _frame;
+	struct user_unwind_frame *frame = &_frame;
+	unsigned long cfa, fp, ra;
+	int ret = -EINVAL;
+
+	if (state->done)
+		return -EINVAL;
+
+	switch (state->type) {
+	case USER_UNWIND_TYPE_FP:
+		frame = &fp_frame;
+		break;
+	default:
+		BUG();
+	}
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	if (frame->ra_off && get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->sp = cfa;
+	state->ip = ra;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return ret;
+}
+
+int user_unwind_start(struct user_unwind_state *state,
+		      enum user_unwind_type type)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	might_sleep();
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	if (type == USER_UNWIND_TYPE_AUTO)
+		state->type = USER_UNWIND_TYPE_FP;
+	else
+		state->type = type;
+
+	state->sp = user_stack_pointer(regs);
+	state->ip = instruction_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return user_unwind_next(state);
+}
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 05/10] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Enable deferred user space unwinding on x86.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig       |  1 +
 arch/x86/events/core.c | 47 ++++++++++++++++++++++++++++--------------
 2 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3762f41bb092..cacf11ac4b10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -256,6 +256,7 @@ config X86
 	select HAVE_PERF_EVENTS_NMI
 	select HAVE_HARDLOCKUP_DETECTOR_PERF	if PERF_EVENTS && HAVE_PERF_EVENTS_NMI
 	select HAVE_PCI
+	select HAVE_PERF_CALLCHAIN_DEFERRED
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40ad1425ffa2..ae264437f794 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2816,8 +2816,8 @@ static unsigned long get_segment_base(unsigned int segment)
 
 #include <linux/compat.h>
 
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+static inline int __perf_callchain_user32(struct pt_regs *regs,
+					  struct perf_callchain_entry_ctx *entry)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -2831,7 +2831,6 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	ss_base = get_segment_base(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
-	pagefault_disable();
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2844,19 +2843,18 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
-	pagefault_enable();
 	return 1;
 }
-#else
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+#else /* !CONFIG_IA32_EMULATION */
+static inline int __perf_callchain_user32(struct pt_regs *regs,
+					  struct perf_callchain_entry_ctx *entry)
 {
-    return 0;
+	return 0;
 }
-#endif
+#endif /* CONFIG_IA32_EMULATION */
 
-void
-perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
+void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+			   struct pt_regs *regs, bool atomic)
 {
 	struct stack_frame frame;
 	const struct stack_frame __user *fp;
@@ -2876,13 +2874,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 
 	perf_callchain_store(entry, regs->ip);
 
-	if (!nmi_uaccess_okay())
+	if (atomic && !nmi_uaccess_okay())
 		return;
 
-	if (perf_callchain_user32(regs, entry))
-		return;
+	if (atomic)
+		pagefault_disable();
+
+	if (__perf_callchain_user32(regs, entry))
+		goto done;
 
-	pagefault_disable();
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2895,7 +2895,22 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		perf_callchain_store(entry, frame.return_address);
 		fp = (void __user *)frame.next_frame;
 	}
-	pagefault_enable();
+done:
+	if (atomic)
+		pagefault_enable();
+}
+
+
+void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+			 struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, true);
+}
+
+void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry,
+				  struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, false);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 04/10] perf: Introduce deferred user callchains
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Instead of attempting to unwind user space from the NMI handler, defer
it to run in task context by sending a self-IPI and then scheduling the
unwind to run in the IRQ's exit task work before returning to user space.

This allows the user stack page to be paged in if needed, avoids
duplicate unwinds for kernel-bound workloads, and prepares for SFrame
unwinding (so .sframe sections can be paged in on demand).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                    |  3 ++
 include/linux/perf_event.h      | 22 ++++++--
 include/uapi/linux/perf_event.h |  1 +
 kernel/bpf/stackmap.c           |  5 +-
 kernel/events/callchain.c       |  7 ++-
 kernel/events/core.c            | 90 ++++++++++++++++++++++++++++++---
 6 files changed, 115 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..690c82212224 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -425,6 +425,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d8fa253b9df..2f232111dff2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -786,6 +786,7 @@ struct perf_event {
 	struct irq_work			pending_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
+	unsigned int			pending_unwind;
 
 	atomic_t			event_limit;
 
@@ -1113,7 +1114,10 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
-extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, struct pt_regs *regs);
+extern void perf_callchain(struct perf_sample_data *data,
+			   struct perf_event *event, struct pt_regs *regs);
+extern void perf_callchain_deferred(struct perf_sample_data *data,
+				    struct perf_event *event, struct pt_regs *regs);
 
 static inline bool branch_sample_no_flags(const struct perf_event *event)
 {
@@ -1189,6 +1193,7 @@ struct perf_sample_data {
 	u64				data_page_size;
 	u64				code_page_size;
 	u64				aux_size;
+	bool				deferred;
 } ____cacheline_aligned;
 
 /* default value for data source */
@@ -1206,6 +1211,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->sample_flags = PERF_SAMPLE_PERIOD;
 	data->period = period;
 	data->dyn_size = 0;
+	data->deferred = false;
 
 	if (addr) {
 		data->addr = addr;
@@ -1219,7 +1225,11 @@ static inline void perf_sample_save_callchain(struct perf_sample_data *data,
 {
 	int size = 1;
 
-	data->callchain = perf_callchain(event, regs);
+	if (data->deferred)
+		perf_callchain_deferred(data, event, regs);
+	else
+		perf_callchain(data, event, regs);
+
 	size += data->callchain->nr;
 
 	data->dyn_size += size * sizeof(u64);
@@ -1534,12 +1544,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 39c6a250dd1b..9a1127af4cda 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1237,6 +1237,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index e4827ca5378d..fcdd26715b12 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -420,7 +419,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   false);
+					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2bee8b6fda0e..16571c8d6771 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -207,6 +207,11 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5e41a3b70bcd..290e06b0071c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6751,6 +6751,12 @@ static void perf_pending_irq(struct irq_work *entry)
 	struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_unwind)
+			task_work_add(current, &event->pending_task, TWA_RESUME);
+		return;
+	}
+
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -6772,11 +6778,57 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+static void perf_pending_task_unwind(struct perf_event *event)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct perf_output_handle handle;
+	struct perf_event_header header;
+	struct perf_sample_data data;
+	struct perf_callchain_entry *callchain;
+
+	callchain = kmalloc(sizeof(struct perf_callchain_entry) +
+			    (sizeof(__u64) * event->attr.sample_max_stack) +
+			    (sizeof(__u64) * 1) /* one context */,
+			    GFP_KERNEL);
+	if (!callchain)
+		return;
+
+	callchain->nr = 0;
+	data.callchain = callchain;
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	data.deferred = true;
+
+	perf_prepare_sample(&data, event, regs);
+
+	perf_prepare_header(&header, &data, event, regs);
+
+	if (perf_output_begin(&handle, &data, event, header.size))
+		goto done;
+
+	perf_output_sample(&handle, &header, &data, event);
+
+	perf_output_end(&handle);
+
+done:
+	kfree(callchain);
+}
+
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_unwind) {
+			perf_pending_task_unwind(event);
+			event->pending_unwind = 0;
+		}
+		return;
+	}
+
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -7587,22 +7639,48 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
-struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+void perf_callchain(struct perf_sample_data *data, struct perf_event *event,
+		    struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
 	const u32 max_stack = event->attr.sample_max_stack;
-	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED);
 
 	/* Disallow cross-task user callchains. */
 	user &= !event->ctx->task || event->ctx->task == current;
 
 	if (!kernel && !user)
-		return &__empty_callchain;
+		goto empty;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
-	return callchain ?: &__empty_callchain;
+	data->callchain = get_perf_callchain(regs, kernel, user, max_stack, true, defer_user);
+	if (!data->callchain)
+		goto empty;
+
+	if (user && defer_user && !event->pending_unwind) {
+		event->pending_unwind = 1;
+		irq_work_queue(&event->pending_irq);
+	}
+
+	return;
+
+empty:
+	data->callchain = &__empty_callchain;
+}
+
+void perf_callchain_deferred(struct perf_sample_data *data,
+			     struct perf_event *event, struct pt_regs *regs)
+{
+	struct perf_callchain_entry_ctx ctx;
+
+	ctx.entry		= data->callchain;
+	ctx.max_stack		= event->attr.sample_max_stack;
+	ctx.nr			= 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
+
+	perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+	perf_callchain_user_deferred(&ctx, regs);
 }
 
 static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index aa5f9d11c28d..2bee8b6fda0e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -202,20 +202,18 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			perf_callchain_user(&ctx, regs);
-		}
+		perf_callchain_user(&ctx, regs);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.  Instead, have
perf_callchain() adjust 'user' accordingly.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h | 2 +-
 kernel/bpf/stackmap.c      | 5 ++---
 kernel/events/callchain.c  | 6 +-----
 kernel/events/core.c       | 8 ++++----
 4 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f4b05954076c..2d8fa253b9df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1534,7 +1534,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b0b0fbff7c18..e4827ca5378d 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -421,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   false, false);
+					   false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1e135195250c..aa5f9d11c28d 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -209,9 +209,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -219,7 +216,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0d62df7df4e..5e41a3b70bcd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7592,16 +7592,16 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
+	/* Disallow cross-task user callchains. */
+	user &= !event->ctx->task || event->ctx->task == current;
+
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains
In-Reply-To: <cover.1699487758.git.jpoimboe@kernel.org>

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index afb028c54f33..f4b05954076c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1533,7 +1533,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d6b277482085..b0b0fbff7c18 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,7 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -420,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392c..1e135195250c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -177,7 +177,7 @@ put_callchain_entry(int rctx)
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -188,11 +188,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 683dc086ef10..b0d62df7df4e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7600,7 +7600,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.41.0


^ permalink raw reply related

* [PATCH RFC 00/10] perf: user space sframe unwinding
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") format starting with binutils
2.40.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
using a custom glibc patch (which I'll send in a reply to this email).

The unwinding itself seems to work well, though I still have a major
problem: how to tell perf tool to stitch together the separate
kernel+user callchains into a single event?

Right now I have a hack which somehow causes perf tool to overwrite the
kernel callchain with the user one.  I'm perf-clueless, any ideas or
patches for a clean way to implement that would be most helpful.


Otherwise there were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.


Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)


TODO:
- Stitch kernel+user events together in perf tool (help needed)
- Add arm64 support
- Add VDSO .sframe support
- Allow specifying FP vs sframe from perf tool?  Right now it's
  auto-detected, maybe that's enough
- Port ftrace and others to use sframe
- Support sframe v2
- Determine the impact of missing DRAP support (aligned stacks which
  SFrame doesn't currently support)
- Add debugging hooks



Josh Poimboeuf (10):
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Introduce deferred user callchains
  perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  unwind: Introduce generic user space unwinding interfaces
  unwind/x86: Add HAVE_USER_UNWIND
  perf/x86: Use user_unwind interface
  unwind: Introduce SFrame user space unwinding
  unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME

 arch/Kconfig                       |   9 +
 arch/x86/Kconfig                   |   3 +
 arch/x86/events/core.c             |  65 ++---
 arch/x86/include/asm/mmu.h         |   2 +-
 arch/x86/include/asm/user_unwind.h |  11 +
 fs/binfmt_elf.c                    |  46 +++-
 include/linux/mm_types.h           |   3 +
 include/linux/perf_event.h         |  24 +-
 include/linux/sframe.h             |  46 ++++
 include/linux/user_unwind.h        |  33 +++
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/perf_event.h    |   1 +
 include/uapi/linux/prctl.h         |   3 +
 kernel/Makefile                    |   1 +
 kernel/bpf/stackmap.c              |   6 +-
 kernel/events/callchain.c          |  39 ++-
 kernel/events/core.c               |  96 ++++++-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |  11 +
 kernel/unwind/Makefile             |   2 +
 kernel/unwind/sframe.c             | 414 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             | 217 +++++++++++++++
 kernel/unwind/user.c               |  86 ++++++
 mm/init-mm.c                       |   2 +
 24 files changed, 1060 insertions(+), 71 deletions(-)
 create mode 100644 arch/x86/include/asm/user_unwind.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.41.0


^ permalink raw reply

* Re: [PATCH v4 12/53] perf bpf: Don't synthesize BPF events when disabled
From: Song Liu @ 2023-11-08 23:03 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Song Liu, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Adrian Hunter,
	Nick Terrell, Kan Liang, Andi Kleen, Kajol Jain, Athira Rajeev,
	Huacai Chen, Masami Hiramatsu, Vincent Whitchurch,
	Steinar H. Gunderson, Liam Howlett, Miguel Ojeda, Colin Ian King,
	Dmitrii Dolgov, Yang Jihong, Ming Wang, James Clark,
	K Prateek Nayak, Sean Christopherson, Leo Yan, Ravi Bangoria,
	German Gomez, Changbin Du, Paolo Bonzini, Li Dong, Sandipan Das,
	liuwenyu, linux-kernel, linux-perf-users
In-Reply-To: <ZUuz/8EC0orXCffn@kernel.org>

On Wed, Nov 8, 2023 at 8:15 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> Em Thu, Nov 02, 2023 at 10:56:54AM -0700, Ian Rogers escreveu:
> > If BPF sideband events are disabled on the command line, don't
> > synthesize BPF events too.
>
>
> Interesting, in 71184c6ab7e60fd5 ("perf record: Replace option
> --bpf-event with --no-bpf-event") we checked that, but only down at
> perf_event__synthesize_one_bpf_prog(), where we have:
>
>         if (!opts->no_bpf_event) {
>                 /* Synthesize PERF_RECORD_BPF_EVENT */
>                 *bpf_event = (struct perf_record_bpf_event)
>
>
> So we better remove that, now redundant check? I'll apply your patch as
> is and then we can remove that other check.
>
> Song, can I have your Acked-by or Reviewed-by, please?
>
> - Arnaldo
>
> > Signed-off-by: Ian Rogers <irogers@google.com>

Good catch!

Acked-by: Song Liu <song@kernel.org>

^ permalink raw reply

* Re: [PATCH v2 2/2] perf test: Add support for setting objdump binary via perf config
From: Arnaldo Carvalho de Melo @ 2023-11-08 21:10 UTC (permalink / raw)
  To: James Clark
  Cc: linux-perf-users, irogers, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Yonghong Song, Fangrui Song, Kan Liang, Yang Jihong,
	Athira Rajeev, Ravi Bangoria, linux-kernel, llvm
In-Reply-To: <ZUv1TgveArYdvTsl@kernel.org>

Em Wed, Nov 08, 2023 at 05:53:34PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Nov 06, 2023 at 03:10:49PM +0000, James Clark escreveu:
> > Add a perf config variable that does the same thing as "perf test
> > --objdump <x>".
> > 
> > Also update the man page.
> 
> That is ok, if one wants to change objdump just for testing, as a
> followup improvement it may be interesting to allow that for the other
> tools that have --objdump as well as to add this as a global option,
> that affects all tools, no?
> 
> Anyway, applied both patches.

And added this to the last one:

Committer testing:

  # perf config test.objdump
  # perf test "object code reading"
   26: Object code reading                                             : Ok
  # perf config test.objdump=blah
  # perf config test.objdump
  test.objdump=blah
  # perf test "object code reading"
   26: Object code reading                                             : FAILED!
  # perf test -v "object code reading"
   26: Object code reading                                             :
  --- start ---
  test child forked, pid 600599
  Looking at the vmlinux_path (8 entries long)
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
  Parsing event 'cycles'
  Using CPUID AuthenticAMD-25-21-0
  mmap size 528384B
  Reading object code for memory address: 0x4d9a02
  File is: /home/acme/bin/perf
  On file address is: 0xd9a02
  Objdump command is: blah -z -d --start-address=0x4d9a02 --stop-address=0x4d9a82 /home/acme/bin/perf
  objdump read too few bytes: 128
  Bytes read differ from those read by objdump
  buf1 (dso):
  0x48 0x85 0xff 0x74 0x29 0xe8 0x94 0xdf 0x07 0x00 0x8b 0x73 0x1c 0x48 0x8b 0x43 
  0x08 0xeb 0xa5 0x0f 0x1f 0x00 0x48 0x8b 0x45 0xe8 0x64 0x48 0x2b 0x04 0x25 0x28 
  0x00 0x00 0x00 0x75 0x0f 0x48 0x8b 0x5d 0xf8 0xc9 0xc3 0x0f 0x1f 0x00 0x48 0x8b 
  0x43 0x08 0xeb 0x84 0xe8 0xc5 0x3e 0xf3 0xff 0x0f 0x1f 0x44 0x00 0x00 0x55 0x48 
  0x89 0xe5 0x41 0x56 0x41 0x55 0x49 0x89 0xd5 0x41 0x54 0x49 0x89 0xfc 0x53 0x48 
  0x89 0xf3 0x48 0x83 0xec 0x30 0x48 0x8b 0x7e 0x20 0x64 0x48 0x8b 0x04 0x25 0x28 
  0x00 0x00 0x00 0x48 0x89 0x45 0xd8 0x31 0xc0 0x48 0x89 0x75 0xb0 0x48 0xc7 0x45 
  0xb8 0x00 0x00 0x00 0x00 0x48 0xc7 0x45 0xc0 0x00 0x00 0x00 0x00 0xe8 0xad 0xfa 
  
  buf2 (objdump):
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
  
  test child finished with -1
  ---- end ----
  Object code reading: FAILED!
  # perf config test.objdump=/usr/bin/objdump
  # perf config test.objdump
  test.objdump=/usr/bin/objdump
  # perf test "object code reading"
   26: Object code reading                                             : Ok
  #

Signed-off-by: James Clark <james.clark@arm.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>

^ permalink raw reply

* Re: [PATCH v2 2/2] perf test: Add support for setting objdump binary via perf config
From: Arnaldo Carvalho de Melo @ 2023-11-08 20:53 UTC (permalink / raw)
  To: James Clark
  Cc: linux-perf-users, irogers, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Yonghong Song, Fangrui Song, Kan Liang, Yang Jihong,
	Athira Rajeev, Ravi Bangoria, linux-kernel, llvm
In-Reply-To: <20231106151051.129440-3-james.clark@arm.com>

Em Mon, Nov 06, 2023 at 03:10:49PM +0000, James Clark escreveu:
> Add a perf config variable that does the same thing as "perf test
> --objdump <x>".
> 
> Also update the man page.

That is ok, if one wants to change objdump just for testing, as a
followup improvement it may be interesting to allow that for the other
tools that have --objdump as well as to add this as a global option,
that affects all tools, no?

Anyway, applied both patches.

- Arnaldo
 
> Signed-off-by: James Clark <james.clark@arm.com>
> ---
>  tools/perf/Documentation/perf-config.txt |  4 ++++
>  tools/perf/tests/builtin-test.c          | 12 ++++++++++++
>  2 files changed, 16 insertions(+)
> 
> diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt
> index 0b4e79dbd3f6..16398babd1ef 100644
> --- a/tools/perf/Documentation/perf-config.txt
> +++ b/tools/perf/Documentation/perf-config.txt
> @@ -722,6 +722,10 @@ session-<NAME>.*::
>  		Defines new record session for daemon. The value is record's
>  		command line without the 'record' keyword.
>  
> +test.*::
> +
> +	test.objdump::
> +		objdump binary to use for disassembly and annotations.
>  
>  SEE ALSO
>  --------
> diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
> index a8d17dd50588..113e92119e1d 100644
> --- a/tools/perf/tests/builtin-test.c
> +++ b/tools/perf/tests/builtin-test.c
> @@ -14,6 +14,7 @@
>  #include <sys/wait.h>
>  #include <sys/stat.h>
>  #include "builtin.h"
> +#include "config.h"
>  #include "hist.h"
>  #include "intlist.h"
>  #include "tests.h"
> @@ -514,6 +515,15 @@ static int run_workload(const char *work, int argc, const char **argv)
>  	return -1;
>  }
>  
> +static int perf_test__config(const char *var, const char *value,
> +			     void *data __maybe_unused)
> +{
> +	if (!strcmp(var, "test.objdump"))
> +		test_objdump_path = value;
> +
> +	return 0;
> +}
> +
>  int cmd_test(int argc, const char **argv)
>  {
>  	const char *test_usage[] = {
> @@ -541,6 +551,8 @@ int cmd_test(int argc, const char **argv)
>          if (ret < 0)
>                  return ret;
>  
> +	perf_config(perf_test__config, NULL);
> +
>  	/* Unbuffered output */
>  	setvbuf(stdout, NULL, _IONBF, 0);
>  
> -- 
> 2.34.1
> 

-- 

- Arnaldo

^ permalink raw reply

* Re: [PATCH] perf test: Add option to change objdump binary
From: Arnaldo Carvalho de Melo @ 2023-11-08 20:50 UTC (permalink / raw)
  To: James Clark
  Cc: Ian Rogers, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Kan Liang, Yang Jihong, Kajol Jain, Athira Rajeev, Ravi Bangoria,
	linux-kernel, llvm
In-Reply-To: <ZUv0NKS7d4vYQMxy@kernel.org>

Em Wed, Nov 08, 2023 at 05:48:52PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Nov 07, 2023 at 09:37:32AM +0000, James Clark escreveu:
> > On 03/11/2023 15:41, Ian Rogers wrote:
> > > There is also "perf config" for things like this.

> > > Reviewed-by: Ian Rogers <irogers@google.com>

> > That seems a bit better for this use case so I added it in V2.
 
> Have you posted it?

Yeah, b4 found it, nevermind.

- Arnaldo

^ permalink raw reply

* Re: [PATCH] perf test: Add option to change objdump binary
From: Arnaldo Carvalho de Melo @ 2023-11-08 20:48 UTC (permalink / raw)
  To: James Clark
  Cc: Ian Rogers, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Nathan Chancellor, Nick Desaulniers, Tom Rix,
	Kan Liang, Yang Jihong, Kajol Jain, Athira Rajeev, Ravi Bangoria,
	linux-kernel, llvm
In-Reply-To: <47e3d265-e48e-238f-7528-b0eb4c250cb1@arm.com>

Em Tue, Nov 07, 2023 at 09:37:32AM +0000, James Clark escreveu:
> 
> 
> On 03/11/2023 15:41, Ian Rogers wrote:
> > On Fri, Nov 3, 2023 at 4:35 AM James Clark <james.clark@arm.com> wrote:
> > > 
> > > All of the other Perf subcommands that use objdump have an option to
> > > specify the binary, so add the same option to perf test.
> > > 
> > > This is useful if you have built the kernel with a different toolchain
> > > to the system one, where the system objdump may fail to disassemble
> > > vmlinux.
> > > 
> > > Now this can be fixed with something like this:
> > > 
> > >    $ perf test --objdump llvm-objdump "object code reading"
> > > 
> > > Signed-off-by: James Clark <james.clark@arm.com>
> > 
> > There is also "perf config" for things like this.
> > 
> > Reviewed-by: Ian Rogers <irogers@google.com>
> > 
> > Thanks,
> > Ian
> 
> That seems a bit better for this use case so I added it in V2.

Have you posted it?
 
> Thanks for the review.
> 
> James

-- 

- Arnaldo

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox