[PATCH 0/4] event synthesization multithreading for perf record

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] event synthesization multithreading for perf record
@ 2017-10-13 14:09 kan.liang
  2017-10-13 14:09 ` [PATCH 1/4] perf tools: pass thread info to process function kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: kan.liang @ 2017-10-13 14:09 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The event synthesization multithreading is introduced in
("perf top optimization") https://lkml.org/lkml/2017/9/29/269
But it was not enabled for perf record. Because the process function
process_synthesized_event was not multithreading friendly.

The patch series temporarily stores the process result in the buffer of each
thread, which make the processing in parallel. Then it writes the buffer
one by one to the perf.data at the end of event synthesization.

The source code is also available at
https://github.com/kliang2/perf.git perf_record_opt

Usually, the event synthesization only happens once on either start or end.
With the snapshotting code, we synthesize events multiple times, once per
each new perf.data file. Both of the cases are verified.

Here are the latency test result on Knights Mill and Skylake server

The workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf record -e cycles -a -- sleep 1"

The latency is the time cost of __machine__synthesize_threads or
its multithreading replacement, record__multithread_synthesize.

- Latency on Knights Mill (272 CPUs)

Original(s)     With patch(s)   Speedup
12.74           5.54            2.3X

- Latency on Skylake server (192 CPUs)

Original(s)     With patch(s)   Speedup
0.36            0.25            1.47X

Kan Liang (4):
  perf tools: pass thread info to process function
  perf tools: pass thread info in event synthesization
  perf record: event synthesization multithreading support
  perf record: add option to set the number of thread for event
    synthesize

 tools/perf/Documentation/perf-record.txt |   4 ++
 tools/perf/arch/x86/util/tsc.c           |   2 +-
 tools/perf/builtin-inject.c              |  12 +++-
 tools/perf/builtin-record.c              | 100 ++++++++++++++++++++++++++--
 tools/perf/builtin-sched.c               |  12 ++--
 tools/perf/builtin-stat.c                |   3 +-
 tools/perf/builtin-trace.c               |   3 +-
 tools/perf/tests/cpumap.c                |   6 +-
 tools/perf/tests/dwarf-unwind.c          |   6 +-
 tools/perf/tests/event_update.c          |  12 ++--
 tools/perf/tests/stat.c                  |   9 ++-
 tools/perf/tests/thread-map.c            |   3 +-
 tools/perf/util/auxtrace.c               |   2 +-
 tools/perf/util/event.c                  | 111 +++++++++++++++++++------------
 tools/perf/util/event.h                  |  19 ++++--
 tools/perf/util/header.c                 |  16 ++---
 tools/perf/util/intel-bts.c              |   3 +-
 tools/perf/util/intel-pt.c               |   3 +-
 tools/perf/util/session.c                |   4 +-
 19 files changed, 243 insertions(+), 87 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] perf tools: pass thread info to process function
  2017-10-13 14:09 [PATCH 0/4] event synthesization multithreading for perf record kan.liang
@ 2017-10-13 14:09 ` kan.liang
  2017-10-13 14:09 ` [PATCH 2/4] perf tools: pass thread info in event synthesization kan.liang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2017-10-13 14:09 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

For multithreading, the process function needs to know the thread
related information. E.g. saving the process result to the buffer or
file which belongs to specific thread.

Add struct thread_info parameter for process function.
Currently, it only includes thread index.

perf_event__repipe is shared by process function and event_op of
perf_tool in builtin-inject.c. Add dedicated process function
perf_event__repipe_threads.

No functional change.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/arch/x86/util/tsc.c  |  2 +-
 tools/perf/builtin-inject.c     | 12 +++++++++++-
 tools/perf/builtin-record.c     |  3 ++-
 tools/perf/builtin-sched.c      | 12 ++++++++----
 tools/perf/builtin-stat.c       |  3 ++-
 tools/perf/builtin-trace.c      |  3 ++-
 tools/perf/tests/cpumap.c       |  6 ++++--
 tools/perf/tests/dwarf-unwind.c |  3 ++-
 tools/perf/tests/event_update.c | 12 ++++++++----
 tools/perf/tests/stat.c         |  9 ++++++---
 tools/perf/tests/thread-map.c   |  3 ++-
 tools/perf/util/auxtrace.c      |  2 +-
 tools/perf/util/event.c         | 15 ++++++++-------
 tools/perf/util/event.h         | 10 ++++++++--
 tools/perf/util/header.c        | 16 ++++++++--------
 tools/perf/util/intel-bts.c     |  3 ++-
 tools/perf/util/intel-pt.c      |  3 ++-
 tools/perf/util/session.c       |  4 ++--
 18 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
index 2e5567c..0affc0f 100644
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -76,5 +76,5 @@ int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc,
 	event.time_conv.time_shift = tc.time_shift;
 	event.time_conv.time_zero  = tc.time_zero;
 
-	return process(tool, &event, NULL, machine);
+	return process(tool, &event, NULL, machine, NULL);
 }
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 2b80329..67a6701 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -191,6 +191,15 @@ static int perf_event__repipe(struct perf_tool *tool,
 	return perf_event__repipe_synth(tool, event);
 }
 
+static int perf_event__repipe_threads(struct perf_tool *tool,
+				      union perf_event *event,
+				      struct perf_sample *sample,
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
+{
+	return perf_event__repipe(tool, event, sample, machine);
+}
+
 static int perf_event__drop(struct perf_tool *tool __maybe_unused,
 			    union perf_event *event __maybe_unused,
 			    struct perf_sample *sample __maybe_unused,
@@ -413,7 +422,8 @@ static int dso__inject_build_id(struct dso *dso, struct perf_tool *tool,
 	if (dso->kernel)
 		misc = PERF_RECORD_MISC_KERNEL;
 
-	err = perf_event__synthesize_build_id(tool, dso, misc, perf_event__repipe,
+	err = perf_event__synthesize_build_id(tool, dso, misc,
+					      perf_event__repipe_threads,
 					      machine);
 	if (err) {
 		pr_err("Can't synthesize build_id event for %s\n", dso->long_name);
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a6cbf16..f53c1163 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -123,7 +123,8 @@ static int record__write(struct record *rec, void *bf, size_t size)
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct record *rec = container_of(tool, struct record, tool);
 	return record__write(rec, event, event->header.size);
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index b7e8812..ed34a14 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1432,7 +1432,8 @@ static void perf_sched__sort_lat(struct perf_sched *sched)
 static int process_sched_wakeup_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1603,7 +1604,8 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
 static int process_sched_switch_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 	int this_cpu = sample->cpu, err = 0;
@@ -1629,7 +1631,8 @@ static int process_sched_switch_event(struct perf_tool *tool,
 static int process_sched_runtime_event(struct perf_tool *tool,
 				       struct perf_evsel *evsel,
 				       struct perf_sample *sample,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1659,7 +1662,8 @@ static int perf_sched__process_fork_event(struct perf_tool *tool,
 static int process_sched_migrate_task_event(struct perf_tool *tool,
 					    struct perf_evsel *evsel,
 					    struct perf_sample *sample,
-					    struct machine *machine)
+					    struct machine *machine,
+					    struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index dd52541..80d5add 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -293,7 +293,8 @@ static inline int nsec_counter(struct perf_evsel *evsel)
 static int process_synthesized_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	if (perf_data_file__write(&perf_stat.file, event, event->header.size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index afef6fe..f737416 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1091,7 +1091,8 @@ static int trace__process_event(struct trace *trace, struct machine *machine,
 static int trace__tool_process(struct perf_tool *tool,
 			       union perf_event *event,
 			       struct perf_sample *sample,
-			       struct machine *machine)
+			       struct machine *machine,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct trace *trace = container_of(tool, struct trace, tool);
 	return trace__process_event(trace, machine, event, sample);
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 1997022..fec51c7 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -11,7 +11,8 @@ struct machine;
 static int process_event_mask(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_mask *mask;
@@ -45,7 +46,8 @@ static int process_event_mask(struct perf_tool *tool __maybe_unused,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_entries *cpus;
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 9ba1d21..5ed2271 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -22,7 +22,8 @@
 static int mmap_handler(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_mmap2_event(machine, event, sample);
 }
diff --git a/tools/perf/tests/event_update.c b/tools/perf/tests/event_update.c
index 9484da2..b5f4ab1 100644
--- a/tools/perf/tests/event_update.c
+++ b/tools/perf/tests/event_update.c
@@ -8,7 +8,8 @@
 static int process_event_unit(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 
@@ -21,7 +22,8 @@ static int process_event_unit(struct perf_tool *tool __maybe_unused,
 static int process_event_scale(struct perf_tool *tool __maybe_unused,
 			       union perf_event *event,
 			       struct perf_sample *sample __maybe_unused,
-			       struct machine *machine __maybe_unused)
+			       struct machine *machine __maybe_unused,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 	struct event_update_event_scale *ev_data;
@@ -42,7 +44,8 @@ struct event_name {
 static int process_event_name(struct perf_tool *tool,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_name *tmp = container_of(tool, struct event_name, tool);
 	struct event_update_event *ev = (struct event_update_event*) event;
@@ -56,7 +59,8 @@ static int process_event_name(struct perf_tool *tool,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event*) event;
 	struct event_update_event_cpus *ev_data;
diff --git a/tools/perf/tests/stat.c b/tools/perf/tests/stat.c
index 7f988a9..846cbec 100644
--- a/tools/perf/tests/stat.c
+++ b/tools/perf/tests/stat.c
@@ -22,7 +22,8 @@ static bool has_term(struct stat_config_event *config,
 static int process_stat_config_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct stat_config_event *config = &event->stat_config;
 	struct perf_stat_config stat_config;
@@ -62,7 +63,8 @@ int test__synthesize_stat_config(struct test *test __maybe_unused, int subtest _
 static int process_stat_event(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct stat_event *st = &event->stat;
 
@@ -92,7 +94,8 @@ int test__synthesize_stat(struct test *test __maybe_unused, int subtest __maybe_
 static int process_stat_round_event(struct perf_tool *tool __maybe_unused,
 				    union perf_event *event,
 				    struct perf_sample *sample __maybe_unused,
-				    struct machine *machine __maybe_unused)
+				    struct machine *machine __maybe_unused,
+				    struct thread_info *thread __maybe_unused)
 {
 	struct stat_round_event *stat_round = &event->stat_round;
 
diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index b3423c7..d2f42ad 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -52,7 +52,8 @@ int test__thread_map(struct test *test __maybe_unused, int subtest __maybe_unuse
 static int process_event(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct thread_map_event *map = &event->thread_map;
 	struct thread_map *threads;
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 5547457..c4ab2c8 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -887,7 +887,7 @@ int perf_event__synthesize_auxtrace_info(struct auxtrace_record *itr,
 	if (err)
 		goto out_free;
 
-	err = process(tool, ev, NULL, NULL);
+	err = process(tool, ev, NULL, NULL, NULL);
 out_free:
 	free(ev);
 	return err;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 47eff47..fd523ca7 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -102,7 +102,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine);
+	return process(tool, event, &synth_sample, machine, NULL);
 };
 
 /*
@@ -976,7 +976,7 @@ int perf_event__synthesize_thread_map2(struct perf_tool *tool,
 		strncpy((char *) &entry->comm, comm, sizeof(entry->comm));
 	}
 
-	err = process(tool, event, NULL, machine);
+	err = process(tool, event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1107,7 +1107,7 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
 	if (!event)
 		return -ENOMEM;
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1145,7 +1145,7 @@ int perf_event__synthesize_stat_config(struct perf_tool *tool,
 		  "stat config terms unbalanced\n");
 #undef ADD
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1170,7 +1170,7 @@ int perf_event__synthesize_stat(struct perf_tool *tool,
 	event.ena       = count->ena;
 	event.run       = count->run;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 int perf_event__synthesize_stat_round(struct perf_tool *tool,
@@ -1187,7 +1187,7 @@ int perf_event__synthesize_stat_round(struct perf_tool *tool,
 	event.time = evtime;
 	event.type = type;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 void perf_event__read_stat_config(struct perf_stat_config *config,
@@ -1476,7 +1476,8 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 int perf_event__process(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_event(machine, event, sample);
 }
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d6cbb0a..200f3f8 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -659,10 +659,15 @@ struct cpu_map;
 struct perf_stat_config;
 struct perf_counts_values;
 
+struct thread_info {
+	int	idx;
+};
+
 typedef int (*perf_event__handler_t)(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample,
-				     struct machine *machine);
+				     struct machine *machine,
+				     struct thread_info *thread);
 
 int perf_event__synthesize_thread_map(struct perf_tool *tool,
 				      struct thread_map *threads,
@@ -751,7 +756,8 @@ int perf_event__process_exit(struct perf_tool *tool,
 int perf_event__process(struct perf_tool *tool,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine);
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused);
 
 struct addr_location;
 
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 605bbd5..c0183fb 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -2986,7 +2986,7 @@ int perf_event__synthesize_attr(struct perf_tool *tool,
 	ev->attr.header.size = (u16)size;
 
 	if (ev->attr.header.size == size)
-		err = process(tool, ev, NULL, NULL);
+		err = process(tool, ev, NULL, NULL, NULL);
 	else
 		err = -E2BIG;
 
@@ -3040,7 +3040,7 @@ int perf_event__synthesize_features(struct perf_tool *tool,
 		fe->header.type = PERF_RECORD_HEADER_FEATURE;
 		fe->header.size = ff.offset;
 
-		ret = process(tool, ff.buf, NULL, NULL);
+		ret = process(tool, ff.buf, NULL, NULL, NULL);
 		if (ret) {
 			free(ff.buf);
 			return ret;
@@ -3124,7 +3124,7 @@ perf_event__synthesize_event_update_unit(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->unit, size);
-	err = process(tool, (union perf_event *)ev, NULL, NULL);
+	err = process(tool, (union perf_event *)ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3144,7 +3144,7 @@ perf_event__synthesize_event_update_scale(struct perf_tool *tool,
 
 	ev_data = (struct event_update_event_scale *) ev->data;
 	ev_data->scale = evsel->scale;
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3163,7 +3163,7 @@ perf_event__synthesize_event_update_name(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->name, len);
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3194,7 +3194,7 @@ perf_event__synthesize_event_update_cpus(struct perf_tool *tool,
 				 evsel->own_cpus,
 				 type, max);
 
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3377,7 +3377,7 @@ int perf_event__synthesize_tracing_data(struct perf_tool *tool, int fd,
 	ev.tracing_data.header.size = sizeof(ev.tracing_data);
 	ev.tracing_data.size = aligned_size;
 
-	process(tool, &ev, NULL, NULL);
+	process(tool, &ev, NULL, NULL, NULL);
 
 	/*
 	 * The put function will copy all the tracing data
@@ -3455,7 +3455,7 @@ int perf_event__synthesize_build_id(struct perf_tool *tool,
 	ev.build_id.header.size = sizeof(ev.build_id) + len;
 	memcpy(&ev.build_id.filename, pos->long_name, pos->long_name_len);
 
-	err = process(tool, &ev, NULL, machine);
+	err = process(tool, &ev, NULL, machine, NULL);
 
 	return err;
 }
diff --git a/tools/perf/util/intel-bts.c b/tools/perf/util/intel-bts.c
index 218ee2b..8c15309 100644
--- a/tools/perf/util/intel-bts.c
+++ b/tools/perf/util/intel-bts.c
@@ -755,7 +755,8 @@ struct intel_bts_synth {
 static int intel_bts_event_synth(struct perf_tool *tool,
 				 union perf_event *event,
 				 struct perf_sample *sample __maybe_unused,
-				 struct machine *machine __maybe_unused)
+				 struct machine *machine __maybe_unused,
+				 struct thread_info *thread __maybe_unused)
 {
 	struct intel_bts_synth *intel_bts_synth =
 			container_of(tool, struct intel_bts_synth, dummy_tool);
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index b58f9fd..4858634 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2121,7 +2121,8 @@ struct intel_pt_synth {
 static int intel_pt_event_synth(struct perf_tool *tool,
 				union perf_event *event,
 				struct perf_sample *sample __maybe_unused,
-				struct machine *machine __maybe_unused)
+				struct machine *machine __maybe_unused,
+				struct thread_info *thread __maybe_unused)
 {
 	struct intel_pt_synth *intel_pt_synth =
 			container_of(tool, struct intel_pt_synth, dummy_tool);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ceac084..f044bad 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2211,7 +2211,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 			struct perf_sample_id *sid;
 
 			if (i >= n) {
-				err = process(tool, ev, NULL, machine);
+				err = process(tool, ev, NULL, machine, NULL);
 				if (err)
 					goto out_err;
 				nr -= n;
@@ -2238,7 +2238,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 	ev->id_index.header.size = sz;
 	ev->id_index.nr = nr;
 
-	err = process(tool, ev, NULL, machine);
+	err = process(tool, ev, NULL, machine, NULL);
 out_err:
 	free(ev);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] perf tools: pass thread info in event synthesization
  2017-10-13 14:09 [PATCH 0/4] event synthesization multithreading for perf record kan.liang
  2017-10-13 14:09 ` [PATCH 1/4] perf tools: pass thread info to process function kan.liang
@ 2017-10-13 14:09 ` kan.liang
  2017-10-13 14:09 ` [PATCH 3/4] perf record: event synthesization multithreading support kan.liang
  2017-10-13 14:09 ` [PATCH 4/4] perf record: add option to set the number of thread for event synthesize kan.liang
  3 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2017-10-13 14:09 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Pass the thread idx to process function, which is used by the following
patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c     |  4 +-
 tools/perf/tests/dwarf-unwind.c |  3 +-
 tools/perf/util/event.c         | 98 ++++++++++++++++++++++++++---------------
 tools/perf/util/event.h         |  9 ++--
 4 files changed, 72 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f53c1163..4ede9bf 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -922,7 +922,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		tgid = perf_event__synthesize_comm(tool, event,
 						   rec->evlist->workload.pid,
 						   process_synthesized_event,
-						   machine);
+						   machine, NULL);
 		free(event);
 
 		if (tgid == -1)
@@ -942,7 +942,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		perf_event__synthesize_namespaces(tool, event,
 						  rec->evlist->workload.pid,
 						  tgid, process_synthesized_event,
-						  machine);
+						  machine, NULL);
 		free(event);
 
 		perf_evlist__start_workload(rec->evlist);
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 5ed2271..792c277 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -34,7 +34,8 @@ static int init_live_machine(struct machine *machine)
 	pid_t pid = getpid();
 
 	return perf_event__synthesize_mmap_events(NULL, &event, pid, pid,
-						  mmap_handler, machine, true, 500);
+						  mmap_handler, machine,
+						  true, 500, NULL);
 }
 
 #define MAX_STACK 8
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index fd523ca7..43e1dfa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -90,7 +90,8 @@ static const char *perf_ns__name(unsigned int id)
 static int perf_tool__process_synth_event(struct perf_tool *tool,
 					  union perf_event *event,
 					  struct machine *machine,
-					  perf_event__handler_t process)
+					  perf_event__handler_t process,
+					  struct thread_info *thread)
 {
 	struct perf_sample synth_sample = {
 	.pid	   = -1,
@@ -102,7 +103,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine, NULL);
+	return process(tool, event, &synth_sample, machine, thread);
 };
 
 /*
@@ -219,14 +220,16 @@ static int perf_event__prepare_comm(union perf_event *event, pid_t pid,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 					 union perf_event *event, pid_t pid,
 					 perf_event__handler_t process,
-					 struct machine *machine)
+					 struct machine *machine,
+					 struct thread_info *thread)
 {
 	pid_t tgid, ppid;
 
 	if (perf_event__prepare_comm(event, pid, machine, &tgid, &ppid) != 0)
 		return -1;
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return tgid;
@@ -249,7 +252,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread)
 {
 	u32 idx;
 	struct perf_ns_link_info *ns_link_info;
@@ -278,7 +282,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 			(NR_NAMESPACES * sizeof(struct perf_ns_link_info)) +
 			machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -288,7 +293,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 				       union perf_event *event,
 				       pid_t pid, pid_t tgid, pid_t ppid,
 				       perf_event__handler_t process,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread)
 {
 	memset(&event->fork, 0, sizeof(event->fork) + machine->id_hdr_size);
 
@@ -310,7 +316,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 
 	event->fork.header.size = (sizeof(event->fork) + machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -322,7 +329,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout)
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	FILE *fp;
@@ -444,7 +452,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		event->mmap2.pid = tgid;
 		event->mmap2.tid = pid;
 
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, thread) != 0) {
 			rc = -1;
 			break;
 		}
@@ -502,7 +511,8 @@ int perf_event__synthesize_modules(struct perf_tool *tool,
 
 		memcpy(event->mmap.filename, pos->dso->long_name,
 		       pos->dso->long_name_len + 1);
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, NULL) != 0) {
 			rc = -1;
 			break;
 		}
@@ -521,7 +531,8 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 				      struct perf_tool *tool,
 				      struct machine *machine,
 				      bool mmap_data,
-				      unsigned int proc_map_timeout)
+				      unsigned int proc_map_timeout,
+				      struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	DIR *tasks;
@@ -532,19 +543,22 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 	/* special case: only send one comm event using passed in pid */
 	if (!full) {
 		tgid = perf_event__synthesize_comm(tool, comm_event, pid,
-						   process, machine);
+						   process, machine, thread);
 
 		if (tgid == -1)
 			return -1;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      pid, tgid, process,
+						      machine, thread) < 0)
 			return -1;
 
 
-		return perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-							  process, machine, mmap_data,
-							  proc_map_timeout);
+		return perf_event__synthesize_mmap_events(tool, mmap_event,
+							  pid, tgid, process,
+							  machine, mmap_data,
+							  proc_map_timeout,
+							  thread);
 	}
 
 	if (machine__is_default_guest(machine))
@@ -572,25 +586,30 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 					     &tgid, &ppid) != 0)
 			break;
 
-		if (perf_event__synthesize_fork(tool, fork_event, _pid, tgid,
-						ppid, process, machine) < 0)
+		if (perf_event__synthesize_fork(tool, fork_event, _pid,
+						tgid, ppid, process,
+						machine, thread) < 0)
 			break;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, _pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      _pid, tgid, process,
+						      machine, thread) < 0)
 			break;
 
 		/*
 		 * Send the prepared comm event
 		 */
-		if (perf_tool__process_synth_event(tool, comm_event, machine, process) != 0)
+		if (perf_tool__process_synth_event(tool, comm_event, machine,
+						   process, thread) != 0)
 			break;
 
 		rc = 0;
 		if (_pid == pid) {
 			/* process the parent's maps too */
-			rc = perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-						process, machine, mmap_data, proc_map_timeout);
+			rc = perf_event__synthesize_mmap_events(tool,
+						mmap_event, pid, tgid,
+						process, machine, mmap_data,
+						proc_map_timeout, thread);
 			if (rc)
 				break;
 		}
@@ -633,9 +652,10 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 	for (thread = 0; thread < threads->nr; ++thread) {
 		if (__event__synthesize_thread(comm_event, mmap_event,
 					       fork_event, namespaces_event,
-					       thread_map__pid(threads, thread), 0,
-					       process, tool, machine,
-					       mmap_data, proc_map_timeout)) {
+					       thread_map__pid(threads, thread),
+					       0, process, tool, machine,
+					       mmap_data, proc_map_timeout,
+					       NULL)) {
 			err = -1;
 			break;
 		}
@@ -658,10 +678,13 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 			/* if not, generate events for it */
 			if (need_leader &&
 			    __event__synthesize_thread(comm_event, mmap_event,
-						       fork_event, namespaces_event,
+						       fork_event,
+						       namespaces_event,
 						       comm_event->comm.pid, 0,
 						       process, tool, machine,
-						       mmap_data, proc_map_timeout)) {
+						       mmap_data,
+						       proc_map_timeout,
+						       NULL)) {
 				err = -1;
 				break;
 			}
@@ -684,8 +707,8 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 					    bool mmap_data,
 					    unsigned int proc_map_timeout,
 					    struct dirent **dirent,
-					    int start,
-					    int num)
+					    int start, int num,
+					    struct thread_info *thread)
 {
 	union perf_event *comm_event, *mmap_event, *fork_event;
 	union perf_event *namespaces_event;
@@ -727,7 +750,7 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 		__event__synthesize_thread(comm_event, mmap_event, fork_event,
 					   namespaces_event, pid, 1, process,
 					   tool, machine, mmap_data,
-					   proc_map_timeout);
+					   proc_map_timeout, thread);
 	}
 	err = 0;
 
@@ -751,6 +774,7 @@ struct synthesize_threads_arg {
 	struct dirent **dirent;
 	int num;
 	int start;
+	struct thread_info thread;
 };
 
 static void *synthesize_threads_worker(void *arg)
@@ -760,7 +784,7 @@ static void *synthesize_threads_worker(void *arg)
 	__perf_event__synthesize_threads(args->tool, args->process,
 					 args->machine, args->mmap_data,
 					 args->proc_map_timeout, args->dirent,
-					 args->start, args->num);
+					 args->start, args->num, &args->thread);
 	return NULL;
 }
 
@@ -799,7 +823,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		err = __perf_event__synthesize_threads(tool, process,
 						       machine, mmap_data,
 						       proc_map_timeout,
-						       dirent, base, n);
+						       dirent, base, n, NULL);
 		goto free_dirent;
 	}
 	if (thread_nr > n)
@@ -822,6 +846,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		args[i].mmap_data = mmap_data;
 		args[i].proc_map_timeout = proc_map_timeout;
 		args[i].dirent = dirent;
+		args[i].thread.idx = i;
 	}
 	for (i = 0; i < m; i++) {
 		args[i].num = num_per_thread + 1;
@@ -940,7 +965,8 @@ int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
 	event->mmap.len   = map->end - event->mmap.start;
 	event->mmap.pid   = machine->pid;
 
-	err = perf_tool__process_synth_event(tool, event, machine, process);
+	err = perf_tool__process_synth_event(tool, event, machine,
+					     process, NULL);
 	free(event);
 
 	return err;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 200f3f8..6e7b08b3 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -785,13 +785,15 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 				  union perf_event *event, pid_t pid,
 				  perf_event__handler_t process,
-				  struct machine *machine);
+				  struct machine *machine,
+				  struct thread_info *thread);
 
 int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine);
+				      struct machine *machine,
+				      struct thread_info *thread);
 
 int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       union perf_event *event,
@@ -799,7 +801,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout);
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread);
 
 size_t perf_event__fprintf_comm(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] perf record: event synthesization multithreading support
  2017-10-13 14:09 [PATCH 0/4] event synthesization multithreading for perf record kan.liang
  2017-10-13 14:09 ` [PATCH 1/4] perf tools: pass thread info to process function kan.liang
  2017-10-13 14:09 ` [PATCH 2/4] perf tools: pass thread info in event synthesization kan.liang
@ 2017-10-13 14:09 ` kan.liang
  2017-10-13 14:38   ` Arnaldo Carvalho de Melo
  2017-10-13 14:09 ` [PATCH 4/4] perf record: add option to set the number of thread for event synthesize kan.liang
  3 siblings, 1 reply; 9+ messages in thread
From: kan.liang @ 2017-10-13 14:09 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The process function process_synthesized_event writes the process
result to perf.data, which is not multithreading friendly.

Realloc buffer for each thread to temporarily keep the processing
result. Write them to the perf.data at the end of event synthesization.
The new method doesn't impact the final result, because
 - The order of the synthesized event is not important.
 - The number of synthesized event is limited. Usually, it only needs
   hundreds of Kilobyte to store all the synthesized event.
   It's unlikly failed because of lack of memory.

The threads number hard code to online CPU number. The following patch
will introduce an option to set it.

The multithreading synthesize is only available for per cpu monitoring.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c | 86 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 81 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4ede9bf..1978021 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -62,6 +62,11 @@ struct switch_output {
 	bool		 set;
 };
 
+struct synthesize_buf {
+	void		*entry;
+	size_t		size;
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -80,6 +85,7 @@ struct record {
 	bool			timestamp_filename;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
+	struct synthesize_buf	*buf;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -120,14 +126,35 @@ static int record__write(struct record *rec, void *bf, size_t size)
 	return 0;
 }
 
+static int buf__write(struct record *rec, int idx, void *bf, size_t size)
+{
+	size_t target_size = rec->buf[idx].size;
+	void *target;
+
+	target = realloc(rec->buf[idx].entry, target_size + size);
+	if (target == NULL)
+		return -ENOMEM;
+
+	memcpy(target + target_size, bf, size);
+
+	rec->buf[idx].entry = target;
+	rec->buf[idx].size += size;
+
+	return 0;
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
 				     struct machine *machine __maybe_unused,
-				     struct thread_info *thread __maybe_unused)
+				     struct thread_info *thread)
 {
 	struct record *rec = container_of(tool, struct record, tool);
-	return record__write(rec, event, event->header.size);
+
+	if (!perf_singlethreaded && thread)
+		return buf__write(rec, thread->idx, event, event->header.size);
+	else
+		return record__write(rec, event, event->header.size);
 }
 
 static int record__pushfn(void *to, void *bf, size_t size)
@@ -690,6 +717,48 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
 	return NULL;
 }
 
+static int record__multithread_synthesize(struct record *rec,
+					  struct machine *machine,
+					  struct perf_tool *tool,
+					  struct record_opts *opts)
+{
+	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+
+	if (nr_thread > 1) {
+		perf_set_multithreaded();
+
+		rec->buf = calloc(nr_thread, sizeof(struct synthesize_buf));
+		if (rec->buf == NULL) {
+			pr_err("Could not do multithread synthesize\n");
+			nr_thread = 1;
+			perf_set_singlethreaded();
+		}
+	}
+
+	err = __machine__synthesize_threads(machine, tool, &opts->target,
+					    rec->evlist->threads,
+					    process_synthesized_event,
+					    opts->sample_address,
+					    opts->proc_map_timeout, nr_thread);
+	if (err < 0)
+		goto free;
+
+	if (nr_thread > 1) {
+		for (i = 0; i < nr_thread; i++)
+			record__write(rec, rec->buf[i].entry, rec->buf[i].size);
+	}
+
+free:
+	if (nr_thread > 1) {
+		for (i = 0; i < nr_thread; i++)
+			free(rec->buf[i].entry);
+		free(rec->buf);
+		perf_set_singlethreaded();
+	}
+
+	return err;
+}
+
 static int record__synthesize(struct record *rec, bool tail)
 {
 	struct perf_session *session = rec->session;
@@ -766,9 +835,16 @@ static int record__synthesize(struct record *rec, bool tail)
 					 perf_event__synthesize_guest_os, tool);
 	}
 
-	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
-					    process_synthesized_event, opts->sample_address,
-					    opts->proc_map_timeout, 1);
+	/* multithreading synthesize is only available for cpu monitoring */
+	if (target__has_cpu(&opts->target))
+		err = record__multithread_synthesize(rec, machine, tool, opts);
+	else
+		err = __machine__synthesize_threads(machine, tool,
+						    &opts->target,
+						    rec->evlist->threads,
+						    process_synthesized_event,
+						    opts->sample_address,
+						    opts->proc_map_timeout, 1);
 out:
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] perf record: add option to set the number of thread for event synthesize
  2017-10-13 14:09 [PATCH 0/4] event synthesization multithreading for perf record kan.liang
                   ` (2 preceding siblings ...)
  2017-10-13 14:09 ` [PATCH 3/4] perf record: event synthesization multithreading support kan.liang
@ 2017-10-13 14:09 ` kan.liang
  3 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2017-10-13 14:09 UTC (permalink / raw)
  To: acme, peterz, mingo, linux-kernel
  Cc: jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Using UINT_MAX to indicate the default thread#, which is the number of
online CPU.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/Documentation/perf-record.txt |  4 ++++
 tools/perf/builtin-record.c              | 13 +++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 68a1ffb..f759dc4 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -483,6 +483,10 @@ config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'.
 
 Implies --tail-synthesize.
 
+--num-thread-synthesize::
+The number of threads to run event synthesize.
+By default, the number of threads equals to the online CPU number.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1978021..98bb547 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -86,6 +86,7 @@ struct record {
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	struct synthesize_buf	*buf;
+	unsigned int		nr_threads_synthesize;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -722,7 +723,12 @@ static int record__multithread_synthesize(struct record *rec,
 					  struct perf_tool *tool,
 					  struct record_opts *opts)
 {
-	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	int i, err, nr_thread;
+
+	if (rec->nr_threads_synthesize == UINT_MAX)
+		nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	else
+		nr_thread = rec->nr_threads_synthesize;
 
 	if (nr_thread > 1) {
 		perf_set_multithreaded();
@@ -836,7 +842,7 @@ static int record__synthesize(struct record *rec, bool tail)
 	}
 
 	/* multithreading synthesize is only available for cpu monitoring */
-	if (target__has_cpu(&opts->target))
+	if (target__has_cpu(&opts->target) && (rec->nr_threads_synthesize > 1))
 		err = record__multithread_synthesize(rec, machine, tool, opts);
 	else
 		err = __machine__synthesize_threads(machine, tool,
@@ -1521,6 +1527,7 @@ static struct record record = {
 		.mmap2		= perf_event__process_mmap2,
 		.ordered_events	= true,
 	},
+	.nr_threads_synthesize = UINT_MAX,
 };
 
 const char record_callchain_help[] = CALLCHAIN_RECORD_HELP
@@ -1662,6 +1669,8 @@ static struct option __record_options[] = {
 			  "signal"),
 	OPT_BOOLEAN(0, "dry-run", &dry_run,
 		    "Parse options then exit"),
+	OPT_UINTEGER(0, "num-thread-synthesize", &record.nr_threads_synthesize,
+			"number of thread to run event synthesize"),
 	OPT_END()
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] perf record: event synthesization multithreading support
  2017-10-13 14:09 ` [PATCH 3/4] perf record: event synthesization multithreading support kan.liang
@ 2017-10-13 14:38   ` Arnaldo Carvalho de Melo
  2017-10-13 14:58     ` Liang, Kan
  2017-10-13 20:06     ` Ingo Molnar
  0 siblings, 2 replies; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-10-13 14:38 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, linux-kernel, jolsa, wangnan0, hekuang, namhyung,
	alexander.shishkin, adrian.hunter, ak

Em Fri, Oct 13, 2017 at 07:09:26AM -0700, kan.liang@intel.com escreveu:
> From: Kan Liang <Kan.liang@intel.com>
> 
> The process function process_synthesized_event writes the process
> result to perf.data, which is not multithreading friendly.
> 
> Realloc buffer for each thread to temporarily keep the processing
> result. Write them to the perf.data at the end of event synthesization.
> The new method doesn't impact the final result, because
>  - The order of the synthesized event is not important.
>  - The number of synthesized event is limited. Usually, it only needs
>    hundreds of Kilobyte to store all the synthesized event.
>    It's unlikly failed because of lack of memory.

Why not write to a per cpu file and then at the end merge them? Leave
the memory management to the kernel, i.e. in most cases you may even not
end up touching the disk, when memory is plentiful, just rewind the per
event files and go on dumping to the main perf.data file.

At some point we may just don't do this merging, and keep per cpu files
all the way to perf report, etc. This would be a first foray into
that...

- Arnaldo
 
> The threads number hard code to online CPU number. The following patch
> will introduce an option to set it.
> 
> The multithreading synthesize is only available for per cpu monitoring.
> 
> Signed-off-by: Kan Liang <Kan.liang@intel.com>
> ---
>  tools/perf/builtin-record.c | 86 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 81 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 4ede9bf..1978021 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -62,6 +62,11 @@ struct switch_output {
>  	bool		 set;
>  };
>  
> +struct synthesize_buf {
> +	void		*entry;
> +	size_t		size;
> +};
> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
> @@ -80,6 +85,7 @@ struct record {
>  	bool			timestamp_filename;
>  	struct switch_output	switch_output;
>  	unsigned long long	samples;
> +	struct synthesize_buf	*buf;
>  };
>  
>  static volatile int auxtrace_record__snapshot_started;
> @@ -120,14 +126,35 @@ static int record__write(struct record *rec, void *bf, size_t size)
>  	return 0;
>  }
>  
> +static int buf__write(struct record *rec, int idx, void *bf, size_t size)
> +{
> +	size_t target_size = rec->buf[idx].size;
> +	void *target;
> +
> +	target = realloc(rec->buf[idx].entry, target_size + size);
> +	if (target == NULL)
> +		return -ENOMEM;
> +
> +	memcpy(target + target_size, bf, size);
> +
> +	rec->buf[idx].entry = target;
> +	rec->buf[idx].size += size;
> +
> +	return 0;
> +}
> +
>  static int process_synthesized_event(struct perf_tool *tool,
>  				     union perf_event *event,
>  				     struct perf_sample *sample __maybe_unused,
>  				     struct machine *machine __maybe_unused,
> -				     struct thread_info *thread __maybe_unused)
> +				     struct thread_info *thread)
>  {
>  	struct record *rec = container_of(tool, struct record, tool);
> -	return record__write(rec, event, event->header.size);
> +
> +	if (!perf_singlethreaded && thread)
> +		return buf__write(rec, thread->idx, event, event->header.size);
> +	else
> +		return record__write(rec, event, event->header.size);
>  }
>  
>  static int record__pushfn(void *to, void *bf, size_t size)
> @@ -690,6 +717,48 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
>  	return NULL;
>  }
>  
> +static int record__multithread_synthesize(struct record *rec,
> +					  struct machine *machine,
> +					  struct perf_tool *tool,
> +					  struct record_opts *opts)
> +{
> +	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
> +
> +	if (nr_thread > 1) {
> +		perf_set_multithreaded();
> +
> +		rec->buf = calloc(nr_thread, sizeof(struct synthesize_buf));
> +		if (rec->buf == NULL) {
> +			pr_err("Could not do multithread synthesize\n");
> +			nr_thread = 1;
> +			perf_set_singlethreaded();
> +		}
> +	}
> +
> +	err = __machine__synthesize_threads(machine, tool, &opts->target,
> +					    rec->evlist->threads,
> +					    process_synthesized_event,
> +					    opts->sample_address,
> +					    opts->proc_map_timeout, nr_thread);
> +	if (err < 0)
> +		goto free;
> +
> +	if (nr_thread > 1) {
> +		for (i = 0; i < nr_thread; i++)
> +			record__write(rec, rec->buf[i].entry, rec->buf[i].size);
> +	}
> +
> +free:
> +	if (nr_thread > 1) {
> +		for (i = 0; i < nr_thread; i++)
> +			free(rec->buf[i].entry);
> +		free(rec->buf);
> +		perf_set_singlethreaded();
> +	}
> +
> +	return err;
> +}
> +
>  static int record__synthesize(struct record *rec, bool tail)
>  {
>  	struct perf_session *session = rec->session;
> @@ -766,9 +835,16 @@ static int record__synthesize(struct record *rec, bool tail)
>  					 perf_event__synthesize_guest_os, tool);
>  	}
>  
> -	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
> -					    process_synthesized_event, opts->sample_address,
> -					    opts->proc_map_timeout, 1);
> +	/* multithreading synthesize is only available for cpu monitoring */
> +	if (target__has_cpu(&opts->target))
> +		err = record__multithread_synthesize(rec, machine, tool, opts);
> +	else
> +		err = __machine__synthesize_threads(machine, tool,
> +						    &opts->target,
> +						    rec->evlist->threads,
> +						    process_synthesized_event,
> +						    opts->sample_address,
> +						    opts->proc_map_timeout, 1);
>  out:
>  	return err;
>  }
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH 3/4] perf record: event synthesization multithreading support
  2017-10-13 14:38   ` Arnaldo Carvalho de Melo
@ 2017-10-13 14:58     ` Liang, Kan
  2017-10-13 15:09       ` Arnaldo Carvalho de Melo
  2017-10-13 20:06     ` Ingo Molnar
  1 sibling, 1 reply; 9+ messages in thread
From: Liang, Kan @ 2017-10-13 14:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, jolsa@kernel.org,
	wangnan0@huawei.com, hekuang@huawei.com, namhyung@kernel.org,
	alexander.shishkin@linux.intel.com, Hunter, Adrian,
	ak@linux.intel.com

> Em Fri, Oct 13, 2017 at 07:09:26AM -0700, kan.liang@intel.com escreveu:
> > From: Kan Liang <Kan.liang@intel.com>
> >
> > The process function process_synthesized_event writes the process
> > result to perf.data, which is not multithreading friendly.
> >
> > Realloc buffer for each thread to temporarily keep the processing
> > result. Write them to the perf.data at the end of event synthesization.
> > The new method doesn't impact the final result, because
> >  - The order of the synthesized event is not important.
> >  - The number of synthesized event is limited. Usually, it only needs
> >    hundreds of Kilobyte to store all the synthesized event.
> >    It's unlikly failed because of lack of memory.
> 
> Why not write to a per cpu file and then at the end merge them? 

I just thought merging from memory should be faster than merging from files.

> Leave
> the memory management to the kernel, i.e. in most cases you may even not
> end up touching the disk, when memory is plentiful, just rewind the per
> event files and go on dumping to the main perf.data file.
>

Agree. I will do it.

> At some point we may just don't do this merging, and keep per cpu files
> all the way to perf report, etc. This would be a first foray into
> that...
>

Yes, but it's a little bit complex.
I think I will do it in a separate improvement patch series later.
For now, I will do the file merge.

Thanks,
Kan
 
> - Arnaldo
> 
> > The threads number hard code to online CPU number. The following patch
> > will introduce an option to set it.
> >
> > The multithreading synthesize is only available for per cpu monitoring.
> >
> > Signed-off-by: Kan Liang <Kan.liang@intel.com>
> > ---
> >  tools/perf/builtin-record.c | 86
> ++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 81 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 4ede9bf..1978021 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -62,6 +62,11 @@ struct switch_output {
> >  	bool		 set;
> >  };
> >
> > +struct synthesize_buf {
> > +	void		*entry;
> > +	size_t		size;
> > +};
> > +
> >  struct record {
> >  	struct perf_tool	tool;
> >  	struct record_opts	opts;
> > @@ -80,6 +85,7 @@ struct record {
> >  	bool			timestamp_filename;
> >  	struct switch_output	switch_output;
> >  	unsigned long long	samples;
> > +	struct synthesize_buf	*buf;
> >  };
> >
> >  static volatile int auxtrace_record__snapshot_started;
> > @@ -120,14 +126,35 @@ static int record__write(struct record *rec, void
> *bf, size_t size)
> >  	return 0;
> >  }
> >
> > +static int buf__write(struct record *rec, int idx, void *bf, size_t size)
> > +{
> > +	size_t target_size = rec->buf[idx].size;
> > +	void *target;
> > +
> > +	target = realloc(rec->buf[idx].entry, target_size + size);
> > +	if (target == NULL)
> > +		return -ENOMEM;
> > +
> > +	memcpy(target + target_size, bf, size);
> > +
> > +	rec->buf[idx].entry = target;
> > +	rec->buf[idx].size += size;
> > +
> > +	return 0;
> > +}
> > +
> >  static int process_synthesized_event(struct perf_tool *tool,
> >  				     union perf_event *event,
> >  				     struct perf_sample *sample
> __maybe_unused,
> >  				     struct machine *machine
> __maybe_unused,
> > -				     struct thread_info *thread
> __maybe_unused)
> > +				     struct thread_info *thread)
> >  {
> >  	struct record *rec = container_of(tool, struct record, tool);
> > -	return record__write(rec, event, event->header.size);
> > +
> > +	if (!perf_singlethreaded && thread)
> > +		return buf__write(rec, thread->idx, event, event-
> >header.size);
> > +	else
> > +		return record__write(rec, event, event->header.size);
> >  }
> >
> >  static int record__pushfn(void *to, void *bf, size_t size)
> > @@ -690,6 +717,48 @@ static const struct perf_event_mmap_page
> *record__pick_pc(struct record *rec)
> >  	return NULL;
> >  }
> >
> > +static int record__multithread_synthesize(struct record *rec,
> > +					  struct machine *machine,
> > +					  struct perf_tool *tool,
> > +					  struct record_opts *opts)
> > +{
> > +	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
> > +
> > +	if (nr_thread > 1) {
> > +		perf_set_multithreaded();
> > +
> > +		rec->buf = calloc(nr_thread, sizeof(struct synthesize_buf));
> > +		if (rec->buf == NULL) {
> > +			pr_err("Could not do multithread synthesize\n");
> > +			nr_thread = 1;
> > +			perf_set_singlethreaded();
> > +		}
> > +	}
> > +
> > +	err = __machine__synthesize_threads(machine, tool, &opts->target,
> > +					    rec->evlist->threads,
> > +					    process_synthesized_event,
> > +					    opts->sample_address,
> > +					    opts->proc_map_timeout,
> nr_thread);
> > +	if (err < 0)
> > +		goto free;
> > +
> > +	if (nr_thread > 1) {
> > +		for (i = 0; i < nr_thread; i++)
> > +			record__write(rec, rec->buf[i].entry, rec->buf[i].size);
> > +	}
> > +
> > +free:
> > +	if (nr_thread > 1) {
> > +		for (i = 0; i < nr_thread; i++)
> > +			free(rec->buf[i].entry);
> > +		free(rec->buf);
> > +		perf_set_singlethreaded();
> > +	}
> > +
> > +	return err;
> > +}
> > +
> >  static int record__synthesize(struct record *rec, bool tail)
> >  {
> >  	struct perf_session *session = rec->session;
> > @@ -766,9 +835,16 @@ static int record__synthesize(struct record *rec,
> bool tail)
> >  					 perf_event__synthesize_guest_os,
> tool);
> >  	}
> >
> > -	err = __machine__synthesize_threads(machine, tool, &opts->target,
> rec->evlist->threads,
> > -					    process_synthesized_event, opts-
> >sample_address,
> > -					    opts->proc_map_timeout, 1);
> > +	/* multithreading synthesize is only available for cpu monitoring */
> > +	if (target__has_cpu(&opts->target))
> > +		err = record__multithread_synthesize(rec, machine, tool,
> opts);
> > +	else
> > +		err = __machine__synthesize_threads(machine, tool,
> > +						    &opts->target,
> > +						    rec->evlist->threads,
> > +						    process_synthesized_event,
> > +						    opts->sample_address,
> > +						    opts->proc_map_timeout,
> 1);
> >  out:
> >  	return err;
> >  }
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] perf record: event synthesization multithreading support
  2017-10-13 14:58     ` Liang, Kan
@ 2017-10-13 15:09       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2017-10-13 15:09 UTC (permalink / raw)
  To: Liang, Kan
  Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, jolsa@kernel.org,
	wangnan0@huawei.com, hekuang@huawei.com, namhyung@kernel.org,
	alexander.shishkin@linux.intel.com, Hunter, Adrian,
	ak@linux.intel.com

Em Fri, Oct 13, 2017 at 02:58:40PM +0000, Liang, Kan escreveu:
> > Em Fri, Oct 13, 2017 at 07:09:26AM -0700, kan.liang@intel.com escreveu:
> > > From: Kan Liang <Kan.liang@intel.com>
> > >
> > > The process function process_synthesized_event writes the process
> > > result to perf.data, which is not multithreading friendly.
> > >
> > > Realloc buffer for each thread to temporarily keep the processing
> > > result. Write them to the perf.data at the end of event synthesization.
> > > The new method doesn't impact the final result, because
> > >  - The order of the synthesized event is not important.
> > >  - The number of synthesized event is limited. Usually, it only needs
> > >    hundreds of Kilobyte to store all the synthesized event.
> > >    It's unlikly failed because of lack of memory.
> > 
> > Why not write to a per cpu file and then at the end merge them? 
> 
> I just thought merging from memory should be faster than merging from files.
> 
> > Leave
> > the memory management to the kernel, i.e. in most cases you may even not
> > end up touching the disk, when memory is plentiful, just rewind the per
> > event files and go on dumping to the main perf.data file.
 
> Agree. I will do it.
 
> > At some point we may just don't do this merging, and keep per cpu files
> > all the way to perf report, etc. This would be a first foray into
> > that...
 
> Yes, but it's a little bit complex.

Sure thing :-)

I think I handwaved an outline in a previous patch discussion, but
basically we already read and immediately process events from multiple
perf_mmap entries that are being produced by the kernel in 'perf top' we
need to continue doing that, but with a thread for each perf_mmap, and
then probably use the ordered_samples code with some algorithm to pick a
"PERF_RECORD_FINISHED_ROUND" at regular intervals to order the batch and
then feed it to the hists code, something along those lines.

So doing it first for 'perf top' seems better, then for 'record' and
'report' (that is what 'top' is) its just to unplug the live event
processing and instead dump for multiple files, for report then it
becomes the 'perf top' event processing part, i.e. one thread per file +
ordered_samples.

> I think I will do it in a separate improvement patch series later.
> For now, I will do the file merge.

Thanks

- Arnaldo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] perf record: event synthesization multithreading support
  2017-10-13 14:38   ` Arnaldo Carvalho de Melo
  2017-10-13 14:58     ` Liang, Kan
@ 2017-10-13 20:06     ` Ingo Molnar
  1 sibling, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2017-10-13 20:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: kan.liang, peterz, mingo, linux-kernel, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak


* Arnaldo Carvalho de Melo <acme@kernel.org> wrote:

> Em Fri, Oct 13, 2017 at 07:09:26AM -0700, kan.liang@intel.com escreveu:
> > From: Kan Liang <Kan.liang@intel.com>
> > 
> > The process function process_synthesized_event writes the process
> > result to perf.data, which is not multithreading friendly.
> > 
> > Realloc buffer for each thread to temporarily keep the processing
> > result. Write them to the perf.data at the end of event synthesization.
> > The new method doesn't impact the final result, because
> >  - The order of the synthesized event is not important.
> >  - The number of synthesized event is limited. Usually, it only needs
> >    hundreds of Kilobyte to store all the synthesized event.
> >    It's unlikly failed because of lack of memory.
> 
> Why not write to a per cpu file and then at the end merge them? Leave
> the memory management to the kernel, i.e. in most cases you may even not
> end up touching the disk, when memory is plentiful, just rewind the per
> event files and go on dumping to the main perf.data file.
> 
> At some point we may just don't do this merging, and keep per cpu files
> all the way to perf report, etc. This would be a first foray into
> that...

Yeah, that sounds like a really good zero-copy recording scheme, and I bet it 
would scale like a crazed bat out of hell!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-10-13 20:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-13 14:09 [PATCH 0/4] event synthesization multithreading for perf record kan.liang
2017-10-13 14:09 ` [PATCH 1/4] perf tools: pass thread info to process function kan.liang
2017-10-13 14:09 ` [PATCH 2/4] perf tools: pass thread info in event synthesization kan.liang
2017-10-13 14:09 ` [PATCH 3/4] perf record: event synthesization multithreading support kan.liang
2017-10-13 14:38   ` Arnaldo Carvalho de Melo
2017-10-13 14:58     ` Liang, Kan
2017-10-13 15:09       ` Arnaldo Carvalho de Melo
2017-10-13 20:06     ` Ingo Molnar
2017-10-13 14:09 ` [PATCH 4/4] perf record: add option to set the number of thread for event synthesize kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox