[PATCH V2 0/5] event synthesization multithreading for perf record

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/5] event synthesization multithreading for perf record
@ 2017-10-18 14:29 kan.liang
  2017-10-18 14:29 ` [PATCH V2 1/5] perf tools: pass thread info to process function kan.liang
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The event synthesization multithreading is introduced in
("perf top optimization") https://lkml.org/lkml/2017/9/29/269
But it was not enabled for perf record. Because the process function
process_synthesized_event was not multithreading friendly.

The patch series temporarily stores the process result in per-thread file,
which make the processing in parallel. Then it dumps the file one by one to
the perf.data at the end of event synthesization.

The source code is also available at
https://github.com/kliang2/perf.git perf_record_opt

Usually, the event synthesization only happens once on either start or end.
With the snapshotting code, we synthesize events multiple times, once per
each new perf.data file. Both of the cases are verified.

Here are the latency test result on Knights Mill and Skylake server

The workload is to compile Linux kernel as below
"sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
Then, "sudo perf record -e cycles -a -- sleep 1"

The latency is the time cost of __machine__synthesize_threads or
its multithreading replacement, record__multithread_synthesize.

- Latency on Knights Mill (272 CPUs)

Original(s)     With patch(s)   Speedup
12.74           5.54            2.3X

- Latency on Skylake server (192 CPUs)

Original(s)     With patch(s)   Speedup
0.36            0.25            1.47X

Changes since V1:
 - Dump the synthesized result to per-thread file and merge them to perf.data
   at the end. (Arnaldo)

Kan Liang (5):
  perf tools: pass thread info to process function
  perf tools: pass thread info in event synthesization
  perf tools: expose copyfile_offset()
  perf record: synthesize event multithreading support
  perf record: add option to set the number of thread for event
    synthesize

 tools/perf/Documentation/perf-record.txt |   4 +
 tools/perf/arch/x86/util/tsc.c           |   2 +-
 tools/perf/builtin-inject.c              |  12 ++-
 tools/perf/builtin-record.c              | 129 ++++++++++++++++++++++++++++---
 tools/perf/builtin-sched.c               |  12 ++-
 tools/perf/builtin-stat.c                |   3 +-
 tools/perf/builtin-trace.c               |   3 +-
 tools/perf/tests/cpumap.c                |   6 +-
 tools/perf/tests/dwarf-unwind.c          |   6 +-
 tools/perf/tests/event_update.c          |  12 ++-
 tools/perf/tests/stat.c                  |   9 ++-
 tools/perf/tests/thread-map.c            |   3 +-
 tools/perf/util/auxtrace.c               |   2 +-
 tools/perf/util/event.c                  | 111 ++++++++++++++++----------
 tools/perf/util/event.h                  |  19 +++--
 tools/perf/util/header.c                 |  16 ++--
 tools/perf/util/intel-bts.c              |   3 +-
 tools/perf/util/intel-pt.c               |   3 +-
 tools/perf/util/session.c                |   4 +-
 tools/perf/util/util.c                   |   2 +-
 tools/perf/util/util.h                   |   2 +
 21 files changed, 271 insertions(+), 92 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 1/5] perf tools: pass thread info to process function
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
@ 2017-10-18 14:29 ` kan.liang
  2017-10-18 14:29 ` [PATCH V2 2/5] perf tools: pass thread info in event synthesization kan.liang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

For multithreading, the process function needs to know the thread
related information. E.g. saving the process result to the buffer or
file which belongs to specific thread.

Add struct thread_info parameter for process function.
Currently, it only includes thread index.

perf_event__repipe is shared by process function and event_op of
perf_tool in builtin-inject.c. Add dedicated process function
perf_event__repipe_threads.

No functional change.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/arch/x86/util/tsc.c  |  2 +-
 tools/perf/builtin-inject.c     | 12 +++++++++++-
 tools/perf/builtin-record.c     |  3 ++-
 tools/perf/builtin-sched.c      | 12 ++++++++----
 tools/perf/builtin-stat.c       |  3 ++-
 tools/perf/builtin-trace.c      |  3 ++-
 tools/perf/tests/cpumap.c       |  6 ++++--
 tools/perf/tests/dwarf-unwind.c |  3 ++-
 tools/perf/tests/event_update.c | 12 ++++++++----
 tools/perf/tests/stat.c         |  9 ++++++---
 tools/perf/tests/thread-map.c   |  3 ++-
 tools/perf/util/auxtrace.c      |  2 +-
 tools/perf/util/event.c         | 15 ++++++++-------
 tools/perf/util/event.h         | 10 ++++++++--
 tools/perf/util/header.c        | 16 ++++++++--------
 tools/perf/util/intel-bts.c     |  3 ++-
 tools/perf/util/intel-pt.c      |  3 ++-
 tools/perf/util/session.c       |  4 ++--
 18 files changed, 79 insertions(+), 42 deletions(-)

diff --git a/tools/perf/arch/x86/util/tsc.c b/tools/perf/arch/x86/util/tsc.c
index 2e5567c..0affc0f 100644
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@@ -76,5 +76,5 @@ int perf_event__synth_time_conv(const struct perf_event_mmap_page *pc,
 	event.time_conv.time_shift = tc.time_shift;
 	event.time_conv.time_zero  = tc.time_zero;
 
-	return process(tool, &event, NULL, machine);
+	return process(tool, &event, NULL, machine, NULL);
 }
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 2b80329..67a6701 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -191,6 +191,15 @@ static int perf_event__repipe(struct perf_tool *tool,
 	return perf_event__repipe_synth(tool, event);
 }
 
+static int perf_event__repipe_threads(struct perf_tool *tool,
+				      union perf_event *event,
+				      struct perf_sample *sample,
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
+{
+	return perf_event__repipe(tool, event, sample, machine);
+}
+
 static int perf_event__drop(struct perf_tool *tool __maybe_unused,
 			    union perf_event *event __maybe_unused,
 			    struct perf_sample *sample __maybe_unused,
@@ -413,7 +422,8 @@ static int dso__inject_build_id(struct dso *dso, struct perf_tool *tool,
 	if (dso->kernel)
 		misc = PERF_RECORD_MISC_KERNEL;
 
-	err = perf_event__synthesize_build_id(tool, dso, misc, perf_event__repipe,
+	err = perf_event__synthesize_build_id(tool, dso, misc,
+					      perf_event__repipe_threads,
 					      machine);
 	if (err) {
 		pr_err("Can't synthesize build_id event for %s\n", dso->long_name);
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a6cbf16..f53c1163 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -123,7 +123,8 @@ static int record__write(struct record *rec, void *bf, size_t size)
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct record *rec = container_of(tool, struct record, tool);
 	return record__write(rec, event, event->header.size);
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index b7e8812..ed34a14 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1432,7 +1432,8 @@ static void perf_sched__sort_lat(struct perf_sched *sched)
 static int process_sched_wakeup_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1603,7 +1604,8 @@ static int map_switch_event(struct perf_sched *sched, struct perf_evsel *evsel,
 static int process_sched_switch_event(struct perf_tool *tool,
 				      struct perf_evsel *evsel,
 				      struct perf_sample *sample,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 	int this_cpu = sample->cpu, err = 0;
@@ -1629,7 +1631,8 @@ static int process_sched_switch_event(struct perf_tool *tool,
 static int process_sched_runtime_event(struct perf_tool *tool,
 				       struct perf_evsel *evsel,
 				       struct perf_sample *sample,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
@@ -1659,7 +1662,8 @@ static int perf_sched__process_fork_event(struct perf_tool *tool,
 static int process_sched_migrate_task_event(struct perf_tool *tool,
 					    struct perf_evsel *evsel,
 					    struct perf_sample *sample,
-					    struct machine *machine)
+					    struct machine *machine,
+					    struct thread_info *thread __maybe_unused)
 {
 	struct perf_sched *sched = container_of(tool, struct perf_sched, tool);
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index dd52541..80d5add 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -293,7 +293,8 @@ static inline int nsec_counter(struct perf_evsel *evsel)
 static int process_synthesized_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	if (perf_data_file__write(&perf_stat.file, event, event->header.size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index afef6fe..f737416 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -1091,7 +1091,8 @@ static int trace__process_event(struct trace *trace, struct machine *machine,
 static int trace__tool_process(struct perf_tool *tool,
 			       union perf_event *event,
 			       struct perf_sample *sample,
-			       struct machine *machine)
+			       struct machine *machine,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct trace *trace = container_of(tool, struct trace, tool);
 	return trace__process_event(trace, machine, event, sample);
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 1997022..fec51c7 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -11,7 +11,8 @@ struct machine;
 static int process_event_mask(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_mask *mask;
@@ -45,7 +46,8 @@ static int process_event_mask(struct perf_tool *tool __maybe_unused,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct cpu_map_event *map_event = &event->cpu_map;
 	struct cpu_map_entries *cpus;
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 9ba1d21..5ed2271 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -22,7 +22,8 @@
 static int mmap_handler(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_mmap2_event(machine, event, sample);
 }
diff --git a/tools/perf/tests/event_update.c b/tools/perf/tests/event_update.c
index 9484da2..b5f4ab1 100644
--- a/tools/perf/tests/event_update.c
+++ b/tools/perf/tests/event_update.c
@@ -8,7 +8,8 @@
 static int process_event_unit(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 
@@ -21,7 +22,8 @@ static int process_event_unit(struct perf_tool *tool __maybe_unused,
 static int process_event_scale(struct perf_tool *tool __maybe_unused,
 			       union perf_event *event,
 			       struct perf_sample *sample __maybe_unused,
-			       struct machine *machine __maybe_unused)
+			       struct machine *machine __maybe_unused,
+			       struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event *) event;
 	struct event_update_event_scale *ev_data;
@@ -42,7 +44,8 @@ struct event_name {
 static int process_event_name(struct perf_tool *tool,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_name *tmp = container_of(tool, struct event_name, tool);
 	struct event_update_event *ev = (struct event_update_event*) event;
@@ -56,7 +59,8 @@ static int process_event_name(struct perf_tool *tool,
 static int process_event_cpus(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct event_update_event *ev = (struct event_update_event*) event;
 	struct event_update_event_cpus *ev_data;
diff --git a/tools/perf/tests/stat.c b/tools/perf/tests/stat.c
index 7f988a9..846cbec 100644
--- a/tools/perf/tests/stat.c
+++ b/tools/perf/tests/stat.c
@@ -22,7 +22,8 @@ static bool has_term(struct stat_config_event *config,
 static int process_stat_config_event(struct perf_tool *tool __maybe_unused,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
-				     struct machine *machine __maybe_unused)
+				     struct machine *machine __maybe_unused,
+				     struct thread_info *thread __maybe_unused)
 {
 	struct stat_config_event *config = &event->stat_config;
 	struct perf_stat_config stat_config;
@@ -62,7 +63,8 @@ int test__synthesize_stat_config(struct test *test __maybe_unused, int subtest _
 static int process_stat_event(struct perf_tool *tool __maybe_unused,
 			      union perf_event *event,
 			      struct perf_sample *sample __maybe_unused,
-			      struct machine *machine __maybe_unused)
+			      struct machine *machine __maybe_unused,
+			      struct thread_info *thread __maybe_unused)
 {
 	struct stat_event *st = &event->stat;
 
@@ -92,7 +94,8 @@ int test__synthesize_stat(struct test *test __maybe_unused, int subtest __maybe_
 static int process_stat_round_event(struct perf_tool *tool __maybe_unused,
 				    union perf_event *event,
 				    struct perf_sample *sample __maybe_unused,
-				    struct machine *machine __maybe_unused)
+				    struct machine *machine __maybe_unused,
+				    struct thread_info *thread __maybe_unused)
 {
 	struct stat_round_event *stat_round = &event->stat_round;
 
diff --git a/tools/perf/tests/thread-map.c b/tools/perf/tests/thread-map.c
index b3423c7..d2f42ad 100644
--- a/tools/perf/tests/thread-map.c
+++ b/tools/perf/tests/thread-map.c
@@ -52,7 +52,8 @@ int test__thread_map(struct test *test __maybe_unused, int subtest __maybe_unuse
 static int process_event(struct perf_tool *tool __maybe_unused,
 			 union perf_event *event,
 			 struct perf_sample *sample __maybe_unused,
-			 struct machine *machine __maybe_unused)
+			 struct machine *machine __maybe_unused,
+			 struct thread_info *thread __maybe_unused)
 {
 	struct thread_map_event *map = &event->thread_map;
 	struct thread_map *threads;
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 5547457..c4ab2c8 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -887,7 +887,7 @@ int perf_event__synthesize_auxtrace_info(struct auxtrace_record *itr,
 	if (err)
 		goto out_free;
 
-	err = process(tool, ev, NULL, NULL);
+	err = process(tool, ev, NULL, NULL, NULL);
 out_free:
 	free(ev);
 	return err;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 47eff47..fd523ca7 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -102,7 +102,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine);
+	return process(tool, event, &synth_sample, machine, NULL);
 };
 
 /*
@@ -976,7 +976,7 @@ int perf_event__synthesize_thread_map2(struct perf_tool *tool,
 		strncpy((char *) &entry->comm, comm, sizeof(entry->comm));
 	}
 
-	err = process(tool, event, NULL, machine);
+	err = process(tool, event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1107,7 +1107,7 @@ int perf_event__synthesize_cpu_map(struct perf_tool *tool,
 	if (!event)
 		return -ENOMEM;
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1145,7 +1145,7 @@ int perf_event__synthesize_stat_config(struct perf_tool *tool,
 		  "stat config terms unbalanced\n");
 #undef ADD
 
-	err = process(tool, (union perf_event *) event, NULL, machine);
+	err = process(tool, (union perf_event *) event, NULL, machine, NULL);
 
 	free(event);
 	return err;
@@ -1170,7 +1170,7 @@ int perf_event__synthesize_stat(struct perf_tool *tool,
 	event.ena       = count->ena;
 	event.run       = count->run;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 int perf_event__synthesize_stat_round(struct perf_tool *tool,
@@ -1187,7 +1187,7 @@ int perf_event__synthesize_stat_round(struct perf_tool *tool,
 	event.time = evtime;
 	event.type = type;
 
-	return process(tool, (union perf_event *) &event, NULL, machine);
+	return process(tool, (union perf_event *) &event, NULL, machine, NULL);
 }
 
 void perf_event__read_stat_config(struct perf_stat_config *config,
@@ -1476,7 +1476,8 @@ size_t perf_event__fprintf(union perf_event *event, FILE *fp)
 int perf_event__process(struct perf_tool *tool __maybe_unused,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine)
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused)
 {
 	return machine__process_event(machine, event, sample);
 }
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d6cbb0a..200f3f8 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -659,10 +659,15 @@ struct cpu_map;
 struct perf_stat_config;
 struct perf_counts_values;
 
+struct thread_info {
+	int	idx;
+};
+
 typedef int (*perf_event__handler_t)(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample,
-				     struct machine *machine);
+				     struct machine *machine,
+				     struct thread_info *thread);
 
 int perf_event__synthesize_thread_map(struct perf_tool *tool,
 				      struct thread_map *threads,
@@ -751,7 +756,8 @@ int perf_event__process_exit(struct perf_tool *tool,
 int perf_event__process(struct perf_tool *tool,
 			union perf_event *event,
 			struct perf_sample *sample,
-			struct machine *machine);
+			struct machine *machine,
+			struct thread_info *thread __maybe_unused);
 
 struct addr_location;
 
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 605bbd5..c0183fb 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -2986,7 +2986,7 @@ int perf_event__synthesize_attr(struct perf_tool *tool,
 	ev->attr.header.size = (u16)size;
 
 	if (ev->attr.header.size == size)
-		err = process(tool, ev, NULL, NULL);
+		err = process(tool, ev, NULL, NULL, NULL);
 	else
 		err = -E2BIG;
 
@@ -3040,7 +3040,7 @@ int perf_event__synthesize_features(struct perf_tool *tool,
 		fe->header.type = PERF_RECORD_HEADER_FEATURE;
 		fe->header.size = ff.offset;
 
-		ret = process(tool, ff.buf, NULL, NULL);
+		ret = process(tool, ff.buf, NULL, NULL, NULL);
 		if (ret) {
 			free(ff.buf);
 			return ret;
@@ -3124,7 +3124,7 @@ perf_event__synthesize_event_update_unit(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->unit, size);
-	err = process(tool, (union perf_event *)ev, NULL, NULL);
+	err = process(tool, (union perf_event *)ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3144,7 +3144,7 @@ perf_event__synthesize_event_update_scale(struct perf_tool *tool,
 
 	ev_data = (struct event_update_event_scale *) ev->data;
 	ev_data->scale = evsel->scale;
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3163,7 +3163,7 @@ perf_event__synthesize_event_update_name(struct perf_tool *tool,
 		return -ENOMEM;
 
 	strncpy(ev->data, evsel->name, len);
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3194,7 +3194,7 @@ perf_event__synthesize_event_update_cpus(struct perf_tool *tool,
 				 evsel->own_cpus,
 				 type, max);
 
-	err = process(tool, (union perf_event*) ev, NULL, NULL);
+	err = process(tool, (union perf_event *) ev, NULL, NULL, NULL);
 	free(ev);
 	return err;
 }
@@ -3377,7 +3377,7 @@ int perf_event__synthesize_tracing_data(struct perf_tool *tool, int fd,
 	ev.tracing_data.header.size = sizeof(ev.tracing_data);
 	ev.tracing_data.size = aligned_size;
 
-	process(tool, &ev, NULL, NULL);
+	process(tool, &ev, NULL, NULL, NULL);
 
 	/*
 	 * The put function will copy all the tracing data
@@ -3455,7 +3455,7 @@ int perf_event__synthesize_build_id(struct perf_tool *tool,
 	ev.build_id.header.size = sizeof(ev.build_id) + len;
 	memcpy(&ev.build_id.filename, pos->long_name, pos->long_name_len);
 
-	err = process(tool, &ev, NULL, machine);
+	err = process(tool, &ev, NULL, machine, NULL);
 
 	return err;
 }
diff --git a/tools/perf/util/intel-bts.c b/tools/perf/util/intel-bts.c
index 218ee2b..8c15309 100644
--- a/tools/perf/util/intel-bts.c
+++ b/tools/perf/util/intel-bts.c
@@ -755,7 +755,8 @@ struct intel_bts_synth {
 static int intel_bts_event_synth(struct perf_tool *tool,
 				 union perf_event *event,
 				 struct perf_sample *sample __maybe_unused,
-				 struct machine *machine __maybe_unused)
+				 struct machine *machine __maybe_unused,
+				 struct thread_info *thread __maybe_unused)
 {
 	struct intel_bts_synth *intel_bts_synth =
 			container_of(tool, struct intel_bts_synth, dummy_tool);
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index b58f9fd..4858634 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2121,7 +2121,8 @@ struct intel_pt_synth {
 static int intel_pt_event_synth(struct perf_tool *tool,
 				union perf_event *event,
 				struct perf_sample *sample __maybe_unused,
-				struct machine *machine __maybe_unused)
+				struct machine *machine __maybe_unused,
+				struct thread_info *thread __maybe_unused)
 {
 	struct intel_pt_synth *intel_pt_synth =
 			container_of(tool, struct intel_pt_synth, dummy_tool);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ceac084..f044bad 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2211,7 +2211,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 			struct perf_sample_id *sid;
 
 			if (i >= n) {
-				err = process(tool, ev, NULL, machine);
+				err = process(tool, ev, NULL, machine, NULL);
 				if (err)
 					goto out_err;
 				nr -= n;
@@ -2238,7 +2238,7 @@ int perf_event__synthesize_id_index(struct perf_tool *tool,
 	ev->id_index.header.size = sz;
 	ev->id_index.nr = nr;
 
-	err = process(tool, ev, NULL, machine);
+	err = process(tool, ev, NULL, machine, NULL);
 out_err:
 	free(ev);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 2/5] perf tools: pass thread info in event synthesization
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
  2017-10-18 14:29 ` [PATCH V2 1/5] perf tools: pass thread info to process function kan.liang
@ 2017-10-18 14:29 ` kan.liang
  2017-10-18 14:29 ` [PATCH V2 3/5] perf tools: expose copyfile_offset() kan.liang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Pass the thread idx to process function, which is used by the following
patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c     |  4 +-
 tools/perf/tests/dwarf-unwind.c |  3 +-
 tools/perf/util/event.c         | 98 ++++++++++++++++++++++++++---------------
 tools/perf/util/event.h         |  9 ++--
 4 files changed, 72 insertions(+), 42 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f53c1163..4ede9bf 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -922,7 +922,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		tgid = perf_event__synthesize_comm(tool, event,
 						   rec->evlist->workload.pid,
 						   process_synthesized_event,
-						   machine);
+						   machine, NULL);
 		free(event);
 
 		if (tgid == -1)
@@ -942,7 +942,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		perf_event__synthesize_namespaces(tool, event,
 						  rec->evlist->workload.pid,
 						  tgid, process_synthesized_event,
-						  machine);
+						  machine, NULL);
 		free(event);
 
 		perf_evlist__start_workload(rec->evlist);
diff --git a/tools/perf/tests/dwarf-unwind.c b/tools/perf/tests/dwarf-unwind.c
index 5ed2271..792c277 100644
--- a/tools/perf/tests/dwarf-unwind.c
+++ b/tools/perf/tests/dwarf-unwind.c
@@ -34,7 +34,8 @@ static int init_live_machine(struct machine *machine)
 	pid_t pid = getpid();
 
 	return perf_event__synthesize_mmap_events(NULL, &event, pid, pid,
-						  mmap_handler, machine, true, 500);
+						  mmap_handler, machine,
+						  true, 500, NULL);
 }
 
 #define MAX_STACK 8
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index fd523ca7..43e1dfa 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -90,7 +90,8 @@ static const char *perf_ns__name(unsigned int id)
 static int perf_tool__process_synth_event(struct perf_tool *tool,
 					  union perf_event *event,
 					  struct machine *machine,
-					  perf_event__handler_t process)
+					  perf_event__handler_t process,
+					  struct thread_info *thread)
 {
 	struct perf_sample synth_sample = {
 	.pid	   = -1,
@@ -102,7 +103,7 @@ static int perf_tool__process_synth_event(struct perf_tool *tool,
 	.cpumode   = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK,
 	};
 
-	return process(tool, event, &synth_sample, machine, NULL);
+	return process(tool, event, &synth_sample, machine, thread);
 };
 
 /*
@@ -219,14 +220,16 @@ static int perf_event__prepare_comm(union perf_event *event, pid_t pid,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 					 union perf_event *event, pid_t pid,
 					 perf_event__handler_t process,
-					 struct machine *machine)
+					 struct machine *machine,
+					 struct thread_info *thread)
 {
 	pid_t tgid, ppid;
 
 	if (perf_event__prepare_comm(event, pid, machine, &tgid, &ppid) != 0)
 		return -1;
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return tgid;
@@ -249,7 +252,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine)
+				      struct machine *machine,
+				      struct thread_info *thread)
 {
 	u32 idx;
 	struct perf_ns_link_info *ns_link_info;
@@ -278,7 +282,8 @@ int perf_event__synthesize_namespaces(struct perf_tool *tool,
 			(NR_NAMESPACES * sizeof(struct perf_ns_link_info)) +
 			machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -288,7 +293,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 				       union perf_event *event,
 				       pid_t pid, pid_t tgid, pid_t ppid,
 				       perf_event__handler_t process,
-				       struct machine *machine)
+				       struct machine *machine,
+				       struct thread_info *thread)
 {
 	memset(&event->fork, 0, sizeof(event->fork) + machine->id_hdr_size);
 
@@ -310,7 +316,8 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 
 	event->fork.header.size = (sizeof(event->fork) + machine->id_hdr_size);
 
-	if (perf_tool__process_synth_event(tool, event, machine, process) != 0)
+	if (perf_tool__process_synth_event(tool, event, machine,
+					   process, thread) != 0)
 		return -1;
 
 	return 0;
@@ -322,7 +329,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout)
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	FILE *fp;
@@ -444,7 +452,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		event->mmap2.pid = tgid;
 		event->mmap2.tid = pid;
 
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, thread) != 0) {
 			rc = -1;
 			break;
 		}
@@ -502,7 +511,8 @@ int perf_event__synthesize_modules(struct perf_tool *tool,
 
 		memcpy(event->mmap.filename, pos->dso->long_name,
 		       pos->dso->long_name_len + 1);
-		if (perf_tool__process_synth_event(tool, event, machine, process) != 0) {
+		if (perf_tool__process_synth_event(tool, event, machine,
+						   process, NULL) != 0) {
 			rc = -1;
 			break;
 		}
@@ -521,7 +531,8 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 				      struct perf_tool *tool,
 				      struct machine *machine,
 				      bool mmap_data,
-				      unsigned int proc_map_timeout)
+				      unsigned int proc_map_timeout,
+				      struct thread_info *thread)
 {
 	char filename[PATH_MAX];
 	DIR *tasks;
@@ -532,19 +543,22 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 	/* special case: only send one comm event using passed in pid */
 	if (!full) {
 		tgid = perf_event__synthesize_comm(tool, comm_event, pid,
-						   process, machine);
+						   process, machine, thread);
 
 		if (tgid == -1)
 			return -1;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      pid, tgid, process,
+						      machine, thread) < 0)
 			return -1;
 
 
-		return perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-							  process, machine, mmap_data,
-							  proc_map_timeout);
+		return perf_event__synthesize_mmap_events(tool, mmap_event,
+							  pid, tgid, process,
+							  machine, mmap_data,
+							  proc_map_timeout,
+							  thread);
 	}
 
 	if (machine__is_default_guest(machine))
@@ -572,25 +586,30 @@ static int __event__synthesize_thread(union perf_event *comm_event,
 					     &tgid, &ppid) != 0)
 			break;
 
-		if (perf_event__synthesize_fork(tool, fork_event, _pid, tgid,
-						ppid, process, machine) < 0)
+		if (perf_event__synthesize_fork(tool, fork_event, _pid,
+						tgid, ppid, process,
+						machine, thread) < 0)
 			break;
 
-		if (perf_event__synthesize_namespaces(tool, namespaces_event, _pid,
-						      tgid, process, machine) < 0)
+		if (perf_event__synthesize_namespaces(tool, namespaces_event,
+						      _pid, tgid, process,
+						      machine, thread) < 0)
 			break;
 
 		/*
 		 * Send the prepared comm event
 		 */
-		if (perf_tool__process_synth_event(tool, comm_event, machine, process) != 0)
+		if (perf_tool__process_synth_event(tool, comm_event, machine,
+						   process, thread) != 0)
 			break;
 
 		rc = 0;
 		if (_pid == pid) {
 			/* process the parent's maps too */
-			rc = perf_event__synthesize_mmap_events(tool, mmap_event, pid, tgid,
-						process, machine, mmap_data, proc_map_timeout);
+			rc = perf_event__synthesize_mmap_events(tool,
+						mmap_event, pid, tgid,
+						process, machine, mmap_data,
+						proc_map_timeout, thread);
 			if (rc)
 				break;
 		}
@@ -633,9 +652,10 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 	for (thread = 0; thread < threads->nr; ++thread) {
 		if (__event__synthesize_thread(comm_event, mmap_event,
 					       fork_event, namespaces_event,
-					       thread_map__pid(threads, thread), 0,
-					       process, tool, machine,
-					       mmap_data, proc_map_timeout)) {
+					       thread_map__pid(threads, thread),
+					       0, process, tool, machine,
+					       mmap_data, proc_map_timeout,
+					       NULL)) {
 			err = -1;
 			break;
 		}
@@ -658,10 +678,13 @@ int perf_event__synthesize_thread_map(struct perf_tool *tool,
 			/* if not, generate events for it */
 			if (need_leader &&
 			    __event__synthesize_thread(comm_event, mmap_event,
-						       fork_event, namespaces_event,
+						       fork_event,
+						       namespaces_event,
 						       comm_event->comm.pid, 0,
 						       process, tool, machine,
-						       mmap_data, proc_map_timeout)) {
+						       mmap_data,
+						       proc_map_timeout,
+						       NULL)) {
 				err = -1;
 				break;
 			}
@@ -684,8 +707,8 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 					    bool mmap_data,
 					    unsigned int proc_map_timeout,
 					    struct dirent **dirent,
-					    int start,
-					    int num)
+					    int start, int num,
+					    struct thread_info *thread)
 {
 	union perf_event *comm_event, *mmap_event, *fork_event;
 	union perf_event *namespaces_event;
@@ -727,7 +750,7 @@ static int __perf_event__synthesize_threads(struct perf_tool *tool,
 		__event__synthesize_thread(comm_event, mmap_event, fork_event,
 					   namespaces_event, pid, 1, process,
 					   tool, machine, mmap_data,
-					   proc_map_timeout);
+					   proc_map_timeout, thread);
 	}
 	err = 0;
 
@@ -751,6 +774,7 @@ struct synthesize_threads_arg {
 	struct dirent **dirent;
 	int num;
 	int start;
+	struct thread_info thread;
 };
 
 static void *synthesize_threads_worker(void *arg)
@@ -760,7 +784,7 @@ static void *synthesize_threads_worker(void *arg)
 	__perf_event__synthesize_threads(args->tool, args->process,
 					 args->machine, args->mmap_data,
 					 args->proc_map_timeout, args->dirent,
-					 args->start, args->num);
+					 args->start, args->num, &args->thread);
 	return NULL;
 }
 
@@ -799,7 +823,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		err = __perf_event__synthesize_threads(tool, process,
 						       machine, mmap_data,
 						       proc_map_timeout,
-						       dirent, base, n);
+						       dirent, base, n, NULL);
 		goto free_dirent;
 	}
 	if (thread_nr > n)
@@ -822,6 +846,7 @@ int perf_event__synthesize_threads(struct perf_tool *tool,
 		args[i].mmap_data = mmap_data;
 		args[i].proc_map_timeout = proc_map_timeout;
 		args[i].dirent = dirent;
+		args[i].thread.idx = i;
 	}
 	for (i = 0; i < m; i++) {
 		args[i].num = num_per_thread + 1;
@@ -940,7 +965,8 @@ int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
 	event->mmap.len   = map->end - event->mmap.start;
 	event->mmap.pid   = machine->pid;
 
-	err = perf_tool__process_synth_event(tool, event, machine, process);
+	err = perf_tool__process_synth_event(tool, event, machine,
+					     process, NULL);
 	free(event);
 
 	return err;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 200f3f8..6e7b08b3 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -785,13 +785,15 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type,
 pid_t perf_event__synthesize_comm(struct perf_tool *tool,
 				  union perf_event *event, pid_t pid,
 				  perf_event__handler_t process,
-				  struct machine *machine);
+				  struct machine *machine,
+				  struct thread_info *thread);
 
 int perf_event__synthesize_namespaces(struct perf_tool *tool,
 				      union perf_event *event,
 				      pid_t pid, pid_t tgid,
 				      perf_event__handler_t process,
-				      struct machine *machine);
+				      struct machine *machine,
+				      struct thread_info *thread);
 
 int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       union perf_event *event,
@@ -799,7 +801,8 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       perf_event__handler_t process,
 				       struct machine *machine,
 				       bool mmap_data,
-				       unsigned int proc_map_timeout);
+				       unsigned int proc_map_timeout,
+				       struct thread_info *thread);
 
 size_t perf_event__fprintf_comm(union perf_event *event, FILE *fp);
 size_t perf_event__fprintf_mmap(union perf_event *event, FILE *fp);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 3/5] perf tools: expose copyfile_offset()
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
  2017-10-18 14:29 ` [PATCH V2 1/5] perf tools: pass thread info to process function kan.liang
  2017-10-18 14:29 ` [PATCH V2 2/5] perf tools: pass thread info in event synthesization kan.liang
@ 2017-10-18 14:29 ` kan.liang
  2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

copyfile_offset could be used to merge per thread file to perf.data in
the following patch.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/util/util.c | 2 +-
 tools/perf/util/util.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 97e0c8e..a003ce4 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -188,7 +188,7 @@ static int slow_copyfile(const char *from, const char *to, struct nsinfo *nsi)
 	return err;
 }
 
-static int copyfile_offset(int ifd, loff_t off_in, int ofd, loff_t off_out, u64 size)
+int copyfile_offset(int ifd, loff_t off_in, int ofd, loff_t off_out, u64 size)
 {
 	void *ptr;
 	loff_t pgoff;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 6c7e6cc..0e1358f 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -5,6 +5,7 @@
 /* glibc 2.20 deprecates _BSD_SOURCE in favour of _DEFAULT_SOURCE */
 #define _DEFAULT_SOURCE 1
 
+#include <fcntl.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdlib.h>
@@ -34,6 +35,7 @@ bool lsdir_no_dot_filter(const char *name, struct dirent *d);
 int copyfile(const char *from, const char *to);
 int copyfile_mode(const char *from, const char *to, mode_t mode);
 int copyfile_ns(const char *from, const char *to, struct nsinfo *nsi);
+int copyfile_offset(int fromfd, loff_t from_ofs, int tofd, loff_t to_ofs, u64 size);
 
 ssize_t readn(int fd, void *buf, size_t n);
 ssize_t writen(int fd, const void *buf, size_t n);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 4/5] perf record: synthesize event multithreading support
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
                   ` (2 preceding siblings ...)
  2017-10-18 14:29 ` [PATCH V2 3/5] perf tools: expose copyfile_offset() kan.liang
@ 2017-10-18 14:29 ` kan.liang
  2017-10-18 15:37   ` Jiri Olsa
                     ` (2 more replies)
  2017-10-18 14:29 ` [PATCH V2 5/5] perf record: add option to set the number of thread for event synthesize kan.liang
  2017-10-18 17:04 ` [PATCH V2 0/5] event synthesization multithreading for perf record Ingo Molnar
  5 siblings, 3 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The process function process_synthesized_event writes the process
result to perf.data, which is not multithreading friendly.

Create per thread file to temporarily keep the processing result.
Write them to the perf.data at the end of event synthesization.
The new method doesn't impact the final result, because the order of the
synthesized event is not important.

The threads number hard code to online CPU number. The following patch
will introduce an option to set it.

The multithreading synthesize is only available for per cpu monitoring.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/builtin-record.c | 115 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 106 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4ede9bf..bbe8009 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -50,9 +50,12 @@
 #include <signal.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
+#include <sys/stat.h>
 #include <asm/bug.h>
 #include <linux/time64.h>
 
+#define SYNTHESIZED_PATH	"perf.synthesized"
+
 struct switch_output {
 	bool		 enabled;
 	bool		 signal;
@@ -80,6 +83,7 @@ struct record {
 	bool			timestamp_filename;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
+	struct perf_data_file	*synthesized_file;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -105,6 +109,14 @@ static bool switch_output_time(struct record *rec)
 	       trigger_is_ready(&switch_output_trigger);
 }
 
+static void update_bytes_written(struct record *rec, size_t size)
+{
+	rec->bytes_written += size;
+
+	if (switch_output_size(rec))
+		trigger_hit(&switch_output_trigger);
+}
+
 static int record__write(struct record *rec, void *bf, size_t size)
 {
 	if (perf_data_file__write(rec->session->file, bf, size) < 0) {
@@ -112,10 +124,7 @@ static int record__write(struct record *rec, void *bf, size_t size)
 		return -1;
 	}
 
-	rec->bytes_written += size;
-
-	if (switch_output_size(rec))
-		trigger_hit(&switch_output_trigger);
+	update_bytes_written(rec, size);
 
 	return 0;
 }
@@ -124,10 +133,15 @@ static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
 				     struct machine *machine __maybe_unused,
-				     struct thread_info *thread __maybe_unused)
+				     struct thread_info *thread)
 {
 	struct record *rec = container_of(tool, struct record, tool);
-	return record__write(rec, event, event->header.size);
+
+	if (!perf_singlethreaded && thread)
+		return (perf_data_file__write(&rec->synthesized_file[thread->idx],
+					      event, event->header.size) < 0) ? -1 : 0;
+	else
+		return record__write(rec, event, event->header.size);
 }
 
 static int record__pushfn(void *to, void *bf, size_t size)
@@ -690,6 +704,82 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
 	return NULL;
 }
 
+static int record__multithread_synthesize(struct record *rec,
+					  struct machine *machine,
+					  struct perf_tool *tool,
+					  struct record_opts *opts)
+{
+	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	char name[PATH_MAX];
+	struct stat st;
+
+	if (nr_thread <= 1)
+		return __machine__synthesize_threads(machine, tool,
+						     &opts->target,
+						     rec->evlist->threads,
+						     process_synthesized_event,
+						     opts->sample_address,
+						     opts->proc_map_timeout,
+						     1);
+
+	rec->synthesized_file = calloc(nr_thread, sizeof(struct perf_data_file));
+	if (rec->synthesized_file == NULL) {
+		pr_debug("Could not do multithread synthesize."
+			 "Roll back to single thread\n");
+		nr_thread = 1;
+	} else {
+		perf_set_multithreaded();
+		for (i = 0; i < nr_thread; i++) {
+			snprintf(name, sizeof(name), "%s.%d",
+				 SYNTHESIZED_PATH, i);
+			rec->synthesized_file[i].path = name;
+			err = perf_data_file__open(&rec->synthesized_file[i]);
+			if (err) {
+				pr_err("Failed to open file %s\n",
+				       rec->synthesized_file[i].path);
+				goto free;
+			}
+		}
+	}
+
+	err = __machine__synthesize_threads(machine, tool, &opts->target,
+					    rec->evlist->threads,
+					    process_synthesized_event,
+					    opts->sample_address,
+					    opts->proc_map_timeout, nr_thread);
+	if (err < 0)
+		goto free;
+
+	if (nr_thread > 1) {
+		int fd_from, fd_to;
+
+		fd_to = rec->session->file->fd;
+		for (i = 0; i < nr_thread; i++) {
+			fd_from = rec->synthesized_file[i].fd;
+
+			fstat(fd_from, &st);
+			if (st.st_size == 0)
+				continue;
+			err = copyfile_offset(fd_from, 0, fd_to,
+					      lseek(fd_to, 0, SEEK_END),
+					      st.st_size);
+			update_bytes_written(rec, st.st_size);
+		}
+	}
+
+free:
+	if (nr_thread > 1) {
+		for (i = 0; i < nr_thread; i++) {
+			if (rec->synthesized_file[i].fd > 0)
+				perf_data_file__close(&rec->synthesized_file[i]);
+		}
+		free(rec->synthesized_file);
+		perf_set_singlethreaded();
+	}
+
+	return err;
+}
+
 static int record__synthesize(struct record *rec, bool tail)
 {
 	struct perf_session *session = rec->session;
@@ -766,9 +856,16 @@ static int record__synthesize(struct record *rec, bool tail)
 					 perf_event__synthesize_guest_os, tool);
 	}
 
-	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
-					    process_synthesized_event, opts->sample_address,
-					    opts->proc_map_timeout, 1);
+	/* multithreading synthesize is only available for cpu monitoring */
+	if (target__has_cpu(&opts->target))
+		err = record__multithread_synthesize(rec, machine, tool, opts);
+	else
+		err = __machine__synthesize_threads(machine, tool,
+						    &opts->target,
+						    rec->evlist->threads,
+						    process_synthesized_event,
+						    opts->sample_address,
+						    opts->proc_map_timeout, 1);
 out:
 	return err;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 4/5] perf record: synthesize event multithreading support
  2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
@ 2017-10-18 15:37   ` Jiri Olsa
  2017-10-18 15:39   ` Jiri Olsa
  2017-10-18 15:43   ` Jiri Olsa
  2 siblings, 0 replies; 12+ messages in thread
From: Jiri Olsa @ 2017-10-18 15:37 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak

On Wed, Oct 18, 2017 at 07:29:32AM -0700, kan.liang@intel.com wrote:

SNIP

> +static int record__multithread_synthesize(struct record *rec,
> +					  struct machine *machine,
> +					  struct perf_tool *tool,
> +					  struct record_opts *opts)
> +{
> +	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
> +	char name[PATH_MAX];
> +	struct stat st;
> +
> +	if (nr_thread <= 1)
> +		return __machine__synthesize_threads(machine, tool,
> +						     &opts->target,
> +						     rec->evlist->threads,
> +						     process_synthesized_event,
> +						     opts->sample_address,
> +						     opts->proc_map_timeout,
> +						     1);
> +
> +	rec->synthesized_file = calloc(nr_thread, sizeof(struct perf_data_file));
> +	if (rec->synthesized_file == NULL) {
> +		pr_debug("Could not do multithread synthesize."
> +			 "Roll back to single thread\n");
> +		nr_thread = 1;

I guess we could fail here anyway.. it'd simplify the code below ;-)

jirka

> +	} else {
> +		perf_set_multithreaded();
> +		for (i = 0; i < nr_thread; i++) {
> +			snprintf(name, sizeof(name), "%s.%d",
> +				 SYNTHESIZED_PATH, i);
> +			rec->synthesized_file[i].path = name;
> +			err = perf_data_file__open(&rec->synthesized_file[i]);
> +			if (err) {
> +				pr_err("Failed to open file %s\n",
> +				       rec->synthesized_file[i].path);
> +				goto free;
> +			}
> +		}
> +	}
> +
> +	err = __machine__synthesize_threads(machine, tool, &opts->target,
> +					    rec->evlist->threads,
> +					    process_synthesized_event,
> +					    opts->sample_address,
> +					    opts->proc_map_timeout, nr_thread);
> +	if (err < 0)
> +		goto free;
> +
> +	if (nr_thread > 1) {
> +		int fd_from, fd_to;
> +
> +		fd_to = rec->session->file->fd;
> +		for (i = 0; i < nr_thread; i++) {
> +			fd_from = rec->synthesized_file[i].fd;
> +
> +			fstat(fd_from, &st);
> +			if (st.st_size == 0)
> +				continue;
> +			err = copyfile_offset(fd_from, 0, fd_to,
> +					      lseek(fd_to, 0, SEEK_END),
> +					      st.st_size);
> +			update_bytes_written(rec, st.st_size);
> +		}
> +	}
> +
> +free:
> +	if (nr_thread > 1) {
> +		for (i = 0; i < nr_thread; i++) {
> +			if (rec->synthesized_file[i].fd > 0)
> +				perf_data_file__close(&rec->synthesized_file[i]);
> +		}
> +		free(rec->synthesized_file);
> +		perf_set_singlethreaded();
> +	}
> +
> +	return err;
> +}

SNIP

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 4/5] perf record: synthesize event multithreading support
  2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
  2017-10-18 15:37   ` Jiri Olsa
@ 2017-10-18 15:39   ` Jiri Olsa
  2017-10-18 15:43   ` Jiri Olsa
  2 siblings, 0 replies; 12+ messages in thread
From: Jiri Olsa @ 2017-10-18 15:39 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak

On Wed, Oct 18, 2017 at 07:29:32AM -0700, kan.liang@intel.com wrote:

SNIP

> +static int record__multithread_synthesize(struct record *rec,
> +					  struct machine *machine,
> +					  struct perf_tool *tool,
> +					  struct record_opts *opts)
> +{
> +	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
> +	char name[PATH_MAX];
> +	struct stat st;
> +
> +	if (nr_thread <= 1)
> +		return __machine__synthesize_threads(machine, tool,
> +						     &opts->target,
> +						     rec->evlist->threads,
> +						     process_synthesized_event,
> +						     opts->sample_address,
> +						     opts->proc_map_timeout,
> +						     1);

          ^^^^^^^^^^^^^^^^^^

> +
>  static int record__synthesize(struct record *rec, bool tail)
>  {
>  	struct perf_session *session = rec->session;
> @@ -766,9 +856,16 @@ static int record__synthesize(struct record *rec, bool tail)
>  					 perf_event__synthesize_guest_os, tool);
>  	}
>  
> -	err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
> -					    process_synthesized_event, opts->sample_address,
> -					    opts->proc_map_timeout, 1);
> +	/* multithreading synthesize is only available for cpu monitoring */
> +	if (target__has_cpu(&opts->target))
> +		err = record__multithread_synthesize(rec, machine, tool, opts);
> +	else
> +		err = __machine__synthesize_threads(machine, tool,
> +						    &opts->target,
> +						    rec->evlist->threads,
> +						    process_synthesized_event,
> +						    opts->sample_address,
> +						    opts->proc_map_timeout, 1);

hum, this could be checked together with nr_thread in record__multithread_synthesize

jirka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 4/5] perf record: synthesize event multithreading support
  2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
  2017-10-18 15:37   ` Jiri Olsa
  2017-10-18 15:39   ` Jiri Olsa
@ 2017-10-18 15:43   ` Jiri Olsa
  2017-10-18 16:57     ` Liang, Kan
  2 siblings, 1 reply; 12+ messages in thread
From: Jiri Olsa @ 2017-10-18 15:43 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak

On Wed, Oct 18, 2017 at 07:29:32AM -0700, kan.liang@intel.com wrote:

SNIP

> +	rec->synthesized_file = calloc(nr_thread, sizeof(struct perf_data_file));
> +	if (rec->synthesized_file == NULL) {
> +		pr_debug("Could not do multithread synthesize."
> +			 "Roll back to single thread\n");
> +		nr_thread = 1;
> +	} else {
> +		perf_set_multithreaded();
> +		for (i = 0; i < nr_thread; i++) {
> +			snprintf(name, sizeof(name), "%s.%d",
> +				 SYNTHESIZED_PATH, i);

hum, I think we want some uniq temp names in here..

> +			rec->synthesized_file[i].path = name;
> +			err = perf_data_file__open(&rec->synthesized_file[i]);
> +			if (err) {
> +				pr_err("Failed to open file %s\n",
> +				       rec->synthesized_file[i].path);
> +				goto free;
> +			}
> +		}
> +	}
> +
> +	err = __machine__synthesize_threads(machine, tool, &opts->target,
> +					    rec->evlist->threads,
> +					    process_synthesized_event,
> +					    opts->sample_address,
> +					    opts->proc_map_timeout, nr_thread);
> +	if (err < 0)
> +		goto free;
> +
> +	if (nr_thread > 1) {
> +		int fd_from, fd_to;
> +
> +		fd_to = rec->session->file->fd;
> +		for (i = 0; i < nr_thread; i++) {
> +			fd_from = rec->synthesized_file[i].fd;
> +
> +			fstat(fd_from, &st);
> +			if (st.st_size == 0)
> +				continue;
> +			err = copyfile_offset(fd_from, 0, fd_to,
> +					      lseek(fd_to, 0, SEEK_END),
> +					      st.st_size);
> +			update_bytes_written(rec, st.st_size);
> +		}
> +	}
> +
> +free:
> +	if (nr_thread > 1) {
> +		for (i = 0; i < nr_thread; i++) {
> +			if (rec->synthesized_file[i].fd > 0)
> +				perf_data_file__close(&rec->synthesized_file[i]);

also those files should be removed

thanks,
jirka

> +		}
> +		free(rec->synthesized_file);
> +		perf_set_singlethreaded();
> +	}
> +
> +	return err;
> +}
> +

SNIP

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH V2 4/5] perf record: synthesize event multithreading support
  2017-10-18 15:43   ` Jiri Olsa
@ 2017-10-18 16:57     ` Liang, Kan
  0 siblings, 0 replies; 12+ messages in thread
From: Liang, Kan @ 2017-10-18 16:57 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org,
	peterz@infradead.org, jolsa@kernel.org, wangnan0@huawei.com,
	hekuang@huawei.com, namhyung@kernel.org,
	alexander.shishkin@linux.intel.com, Hunter, Adrian,
	ak@linux.intel.com

> On Wed, Oct 18, 2017 at 07:29:32AM -0700, kan.liang@intel.com wrote:
> 
> SNIP
> 
> > +	rec->synthesized_file = calloc(nr_thread, sizeof(struct
> perf_data_file));
> > +	if (rec->synthesized_file == NULL) {
> > +		pr_debug("Could not do multithread synthesize."
> > +			 "Roll back to single thread\n");
> > +		nr_thread = 1;
> > +	} else {
> > +		perf_set_multithreaded();
> > +		for (i = 0; i < nr_thread; i++) {
> > +			snprintf(name, sizeof(name), "%s.%d",
> > +				 SYNTHESIZED_PATH, i);
> 
> hum, I think we want some uniq temp names in here..

Any suggestions for the name?
perf.tmp.X?

> 
> > +			rec->synthesized_file[i].path = name;
> > +			err = perf_data_file__open(&rec->synthesized_file[i]);
> > +			if (err) {
> > +				pr_err("Failed to open file %s\n",
> > +				       rec->synthesized_file[i].path);
> > +				goto free;
> > +			}
> > +		}
> > +	}
> > +
> > +	err = __machine__synthesize_threads(machine, tool, &opts->target,
> > +					    rec->evlist->threads,
> > +					    process_synthesized_event,
> > +					    opts->sample_address,
> > +					    opts->proc_map_timeout,
> nr_thread);
> > +	if (err < 0)
> > +		goto free;
> > +
> > +	if (nr_thread > 1) {
> > +		int fd_from, fd_to;
> > +
> > +		fd_to = rec->session->file->fd;
> > +		for (i = 0; i < nr_thread; i++) {
> > +			fd_from = rec->synthesized_file[i].fd;
> > +
> > +			fstat(fd_from, &st);
> > +			if (st.st_size == 0)
> > +				continue;
> > +			err = copyfile_offset(fd_from, 0, fd_to,
> > +					      lseek(fd_to, 0, SEEK_END),
> > +					      st.st_size);
> > +			update_bytes_written(rec, st.st_size);
> > +		}
> > +	}
> > +
> > +free:
> > +	if (nr_thread > 1) {
> > +		for (i = 0; i < nr_thread; i++) {
> > +			if (rec->synthesized_file[i].fd > 0)
> > +				perf_data_file__close(&rec-
> >synthesized_file[i]);
> 
> also those files should be removed

Sure.
I will modify the perf_data_file__close(file, bool remove) to do it.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 5/5] perf record: add option to set the number of thread for event synthesize
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
                   ` (3 preceding siblings ...)
  2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
@ 2017-10-18 14:29 ` kan.liang
  2017-10-18 17:04 ` [PATCH V2 0/5] event synthesization multithreading for perf record Ingo Molnar
  5 siblings, 0 replies; 12+ messages in thread
From: kan.liang @ 2017-10-18 14:29 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, jolsa, wangnan0, hekuang, namhyung, alexander.shishkin,
	adrian.hunter, ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

Using UINT_MAX to indicate the default thread#, which is the number of
online CPU.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---
 tools/perf/Documentation/perf-record.txt |  4 ++++
 tools/perf/builtin-record.c              | 13 +++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 68a1ffb..f759dc4 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -483,6 +483,10 @@ config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'.
 
 Implies --tail-synthesize.
 
+--num-thread-synthesize::
+The number of threads to run event synthesize.
+By default, the number of threads equals to the online CPU number.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index bbe8009..0b7873a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -84,6 +84,7 @@ struct record {
 	struct switch_output	switch_output;
 	unsigned long long	samples;
 	struct perf_data_file	*synthesized_file;
+	unsigned int		nr_threads_synthesize;
 };
 
 static volatile int auxtrace_record__snapshot_started;
@@ -709,10 +710,15 @@ static int record__multithread_synthesize(struct record *rec,
 					  struct perf_tool *tool,
 					  struct record_opts *opts)
 {
-	int i, err, nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	int i, err, nr_thread;
 	char name[PATH_MAX];
 	struct stat st;
 
+	if (rec->nr_threads_synthesize == UINT_MAX)
+		nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
+	else
+		nr_thread = rec->nr_threads_synthesize;
+
 	if (nr_thread <= 1)
 		return __machine__synthesize_threads(machine, tool,
 						     &opts->target,
@@ -857,7 +863,7 @@ static int record__synthesize(struct record *rec, bool tail)
 	}
 
 	/* multithreading synthesize is only available for cpu monitoring */
-	if (target__has_cpu(&opts->target))
+	if (target__has_cpu(&opts->target) && (rec->nr_threads_synthesize > 1))
 		err = record__multithread_synthesize(rec, machine, tool, opts);
 	else
 		err = __machine__synthesize_threads(machine, tool,
@@ -1542,6 +1548,7 @@ static struct record record = {
 		.mmap2		= perf_event__process_mmap2,
 		.ordered_events	= true,
 	},
+	.nr_threads_synthesize = UINT_MAX,
 };
 
 const char record_callchain_help[] = CALLCHAIN_RECORD_HELP
@@ -1683,6 +1690,8 @@ static struct option __record_options[] = {
 			  "signal"),
 	OPT_BOOLEAN(0, "dry-run", &dry_run,
 		    "Parse options then exit"),
+	OPT_UINTEGER(0, "num-thread-synthesize", &record.nr_threads_synthesize,
+			"number of thread to run event synthesize"),
 	OPT_END()
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 0/5] event synthesization multithreading for perf record
  2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
                   ` (4 preceding siblings ...)
  2017-10-18 14:29 ` [PATCH V2 5/5] perf record: add option to set the number of thread for event synthesize kan.liang
@ 2017-10-18 17:04 ` Ingo Molnar
  2017-10-20 20:19   ` Liang, Kan
  5 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2017-10-18 17:04 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, peterz, jolsa, wangnan0, hekuang,
	namhyung, alexander.shishkin, adrian.hunter, ak


* kan.liang@intel.com <kan.liang@intel.com> wrote:

> From: Kan Liang <Kan.liang@intel.com>
> 
> The event synthesization multithreading is introduced in
> ("perf top optimization") https://lkml.org/lkml/2017/9/29/269
> But it was not enabled for perf record. Because the process function
> process_synthesized_event was not multithreading friendly.
> 
> The patch series temporarily stores the process result in per-thread file,
> which make the processing in parallel. Then it dumps the file one by one to
> the perf.data at the end of event synthesization.
> 
> The source code is also available at
> https://github.com/kliang2/perf.git perf_record_opt
> 
> Usually, the event synthesization only happens once on either start or end.
> With the snapshotting code, we synthesize events multiple times, once per
> each new perf.data file. Both of the cases are verified.
> 
> Here are the latency test result on Knights Mill and Skylake server
> 
> The workload is to compile Linux kernel as below
> "sudo nice make -j$(grep -c '^processor' /proc/cpuinfo)"
> Then, "sudo perf record -e cycles -a -- sleep 1"
> 
> The latency is the time cost of __machine__synthesize_threads or
> its multithreading replacement, record__multithread_synthesize.
> 
> - Latency on Knights Mill (272 CPUs)
> 
> Original(s)     With patch(s)   Speedup
> 12.74           5.54            2.3X
> 
> - Latency on Skylake server (192 CPUs)
> 
> Original(s)     With patch(s)   Speedup
> 0.36            0.25            1.47X

Btw., just as an interesting experiment, could you try to measure how it performs 
to create just the per-CPU files, and *not* dump them into a single file?

I.e. how much faster will it get if the serialization at the end is avoided?

Of course nothing can read such per-CPU files yet, so this is just for scalability 
measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH V2 0/5] event synthesization multithreading for perf record
  2017-10-18 17:04 ` [PATCH V2 0/5] event synthesization multithreading for perf record Ingo Molnar
@ 2017-10-20 20:19   ` Liang, Kan
  0 siblings, 0 replies; 12+ messages in thread
From: Liang, Kan @ 2017-10-20 20:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: acme@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org,
	peterz@infradead.org, jolsa@kernel.org, wangnan0@huawei.com,
	hekuang@huawei.com, namhyung@kernel.org,
	alexander.shishkin@linux.intel.com, Hunter, Adrian,
	ak@linux.intel.com

> 
> * kan.liang@intel.com <kan.liang@intel.com> wrote:
> 
> > From: Kan Liang <Kan.liang@intel.com>
> >
> > The event synthesization multithreading is introduced in ("perf top
> > optimization") https://lkml.org/lkml/2017/9/29/269
> > But it was not enabled for perf record. Because the process function
> > process_synthesized_event was not multithreading friendly.
> >
> > The patch series temporarily stores the process result in per-thread
> > file, which make the processing in parallel. Then it dumps the file
> > one by one to the perf.data at the end of event synthesization.
> >
> > The source code is also available at
> > https://github.com/kliang2/perf.git perf_record_opt
> >
> > Usually, the event synthesization only happens once on either start or end.
> > With the snapshotting code, we synthesize events multiple times, once
> > per each new perf.data file. Both of the cases are verified.
> >
> > Here are the latency test result on Knights Mill and Skylake server
> >
> > The workload is to compile Linux kernel as below "sudo nice make
> > -j$(grep -c '^processor' /proc/cpuinfo)"
> > Then, "sudo perf record -e cycles -a -- sleep 1"
> >
> > The latency is the time cost of __machine__synthesize_threads or its
> > multithreading replacement, record__multithread_synthesize.
> >
> > - Latency on Knights Mill (272 CPUs)
> >
> > Original(s)     With patch(s)   Speedup
> > 12.74           5.54            2.3X
> >
> > - Latency on Skylake server (192 CPUs)
> >
> > Original(s)     With patch(s)   Speedup
> > 0.36            0.25            1.47X
> 
> Btw., just as an interesting experiment, could you try to measure how it
> performs to create just the per-CPU files, and *not* dump them into a single
> file?
>

Sure, please find the experiment result in the cover letter of V3 patch series.

Thanks,
Kan
 
> I.e. how much faster will it get if the serialization at the end is avoided?
> 
> Of course nothing can read such per-CPU files yet, so this is just for scalability
> measurement.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-10-20 20:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-18 14:29 [PATCH V2 0/5] event synthesization multithreading for perf record kan.liang
2017-10-18 14:29 ` [PATCH V2 1/5] perf tools: pass thread info to process function kan.liang
2017-10-18 14:29 ` [PATCH V2 2/5] perf tools: pass thread info in event synthesization kan.liang
2017-10-18 14:29 ` [PATCH V2 3/5] perf tools: expose copyfile_offset() kan.liang
2017-10-18 14:29 ` [PATCH V2 4/5] perf record: synthesize event multithreading support kan.liang
2017-10-18 15:37   ` Jiri Olsa
2017-10-18 15:39   ` Jiri Olsa
2017-10-18 15:43   ` Jiri Olsa
2017-10-18 16:57     ` Liang, Kan
2017-10-18 14:29 ` [PATCH V2 5/5] perf record: add option to set the number of thread for event synthesize kan.liang
2017-10-18 17:04 ` [PATCH V2 0/5] event synthesization multithreading for perf record Ingo Molnar
2017-10-20 20:19   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox