linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] perf: Dwarf cfi based user callchains
@ 2010-10-13  5:06 Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available Frederic Weisbecker
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

Hi,

This brings dwarf cfi based callchain for userspace apps that don't have
frame pointers.

To test it, you can try:

perf record -g dwarf,24000 -e cycles:u ./hackbench 2
perf report

It seems to work but there are of course many things to improve:

- do only userspace profiling with that mode for now (the :u flag as above).
  The reason is that if you profile also the kernel, the user callchains
  will often start from vdso if the user made a syscall, and vdso doesn't
  have cfi informations, so we get stuck there. I need to find a solution for
  that, like doing a single frame pointer deref on the first entry (vdso)
  and continue with dwarves, but I need to know if we came from a syscall for
  that. Not sure yet how I'll handle that.

- it only works with .eh_frame, I think there is an elf section that is made
  almost the same but with few differences. I don't remember the name at that
  time but that needs a look.

- it's slow. A first improvement to make it faster is to support binary
  search from .eh_frame_hdr. This will probably be one of the next things
  I'll focus in. And the whole needs perhaps more caching and so on.

- only support for x86-32. I need to split some arch specific code from
  generic and add at least x86-64 support.

- there are still some callchains that are not unwind. I need to investigate.

This can be found in:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
	perf/unwind-v1

Thanks,
	Frederic
---

Frederic Weisbecker (9):
      uaccess: Make copy_from_user_nmi() globally available
      perf: Add ability to dump user regs
      perf: Add ability to dump part of the user stack
      perf: Don't record frame pointer based user stacktraces if we dump stack and regs
      perf: Support for dwarf mode callchain on perf record
      perf: Build with dwarf cfi
      perf: Support for error passed over pointers
      perf: Add libunwind dependency for dwarf cfi unwinding
      perf: Support for dwarf cfi unwinding on post processing


 arch/x86/include/asm/uaccess.h      |    5 +
 arch/x86/kernel/cpu/perf_event.c    |    4 +-
 include/asm-generic/uaccess.h       |    4 +
 include/linux/perf_event.h          |   15 +-
 kernel/perf_event.c                 |  182 +++++-
 tools/perf/Makefile                 |   23 +-
 tools/perf/builtin-record.c         |   76 +++-
 tools/perf/builtin-report.c         |    9 +-
 tools/perf/feature-tests.mak        |   14 +
 tools/perf/perf.h                   |    5 +
 tools/perf/util/callchain.c         |   35 +-
 tools/perf/util/callchain.h         |   19 +-
 tools/perf/util/event.c             |   29 +
 tools/perf/util/event.h             |    7 +
 tools/perf/util/include/linux/err.h |   24 +
 tools/perf/util/unwind.c            | 1077 +++++++++++++++++++++++++++++++++++
 16 files changed, 1485 insertions(+), 43 deletions(-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  7:15   ` Peter Zijlstra
  2010-10-13  5:06 ` [RFC PATCH 2/9] perf: Add ability to dump user regs Frederic Weisbecker
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

In order to support user stack dump safely in perf samples from
generic code, export copy_from_user_nmi() from x86 and make it
generally available. For most archs it will map to
copy_from_user_inatomic, but for x86 we need to take care of
not faulting from NMIs.

Since perf is the first user for now, let the overriden x86
implementation in the perf source file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 arch/x86/include/asm/uaccess.h   |    5 +++++
 arch/x86/kernel/cpu/perf_event.c |    4 ++--
 include/asm-generic/uaccess.h    |    4 ++++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index abd3e0e..901daf3 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -229,6 +229,11 @@ extern void __put_user_2(void);
 extern void __put_user_4(void);
 extern void __put_user_8(void);
 
+extern unsigned long
+__copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
+
+#define copy_from_user_nmi(to, from, n)	__copy_from_user_nmi(to, from, n)
+
 #ifdef CONFIG_X86_WP_WORKS_OK
 
 /**
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e2513f2..d3278da 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -45,8 +45,8 @@ do {								\
 /*
  * best effort, GUP based copy_from_user() that assumes IRQ or NMI context
  */
-static unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
+unsigned long
+__copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
 {
 	unsigned long offset, addr = (unsigned long)from;
 	int type = in_nmi() ? KM_NMI : KM_IRQ0;
diff --git a/include/asm-generic/uaccess.h b/include/asm-generic/uaccess.h
index b218b85..a8d5525 100644
--- a/include/asm-generic/uaccess.h
+++ b/include/asm-generic/uaccess.h
@@ -240,6 +240,10 @@ extern int __get_user_bad(void) __attribute__((noreturn));
 #define __copy_to_user_inatomic __copy_to_user
 #endif
 
+#ifndef copy_from_user_nmi
+#define copy_from_user_nmi __copy_from_user_inatomic
+#endif
+
 static inline long copy_from_user(void *to,
 		const void __user * from, unsigned long n)
 {
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  7:20   ` Peter Zijlstra
  2010-10-13  5:06 ` [RFC PATCH 3/9] perf: Add ability to dump part of the user stack Frederic Weisbecker
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
user space context as it was before the user entered the kernel for
whatever reason.

This is going to be useful to bring Dwarf CFI based stack unwinding
on top of samples.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 include/linux/perf_event.h |    4 +++-
 kernel/perf_event.c        |   42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61b1e2d..71b108b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_UREGS			= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
@@ -932,6 +933,7 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct pt_regs			*uregs;
 };
 
 static inline
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 64507ea..f1c0d72 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1993,6 +1993,19 @@ exit_put:
 	return entry;
 }
 
+static struct pt_regs *perf_sample_uregs(struct pt_regs *regs)
+{
+	if (!user_mode(regs)) {
+		if (current->mm)
+			regs = task_pt_regs(current);
+		else
+			regs = NULL;
+	}
+
+	return regs;
+}
+
+
 /*
  * Initialize the perf_event context in a task_struct:
  */
@@ -3607,6 +3620,25 @@ void perf_output_sample(struct perf_output_handle *handle,
 			perf_output_put(handle, raw);
 		}
 	}
+
+	if (sample_type & PERF_SAMPLE_UREGS) {
+		u64 size;
+
+		if (data->uregs) {
+			size = round_up(sizeof(*data->uregs), sizeof(u64));
+			perf_output_put(handle, size);
+			perf_output_copy(handle, data->uregs, sizeof(*data->uregs));
+
+			if (sizeof(*data->uregs) < size) {
+				u64 pad = 0;
+				int pad_size = size - sizeof(*data->uregs);
+				perf_output_copy(handle, &pad, pad_size);
+			}
+		} else {
+			size = 0;
+			perf_output_put(handle, size);
+		}
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -3694,6 +3726,16 @@ void perf_prepare_sample(struct perf_event_header *header,
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_UREGS) {
+		int size = sizeof(u64);
+
+		data->uregs = perf_sample_uregs(regs);
+		if (data->uregs)
+			size += round_up(sizeof(*data->uregs), sizeof(u64));
+
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event, int nmi,
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 3/9] perf: Add ability to dump part of the user stack
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 2/9] perf: Add ability to dump user regs Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  7:22   ` Peter Zijlstra
  2010-10-13  5:06 ` [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs Frederic Weisbecker
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

Beeing able to dump parts of the user stack, starting from the
stack pointer, will be useful to make a post mortem dwarf CFI based
stack unwinding.

This is done through the new ustack_dump_size perf attribute. If it
is non zero, the user stack will dumped in samples following the
requested size in bytes.

The longer is the dump, the deeper will be the resulting retrieved
callchain.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 include/linux/perf_event.h |   11 +++-
 kernel/perf_event.c        |  123 ++++++++++++++++++++++++++++++++++++--------
 2 files changed, 110 insertions(+), 24 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 71b108b..0b1b039 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -227,6 +227,12 @@ struct perf_event_attr {
 	__u32			bp_type;
 	__u64			bp_addr;
 	__u64			bp_len;
+
+	__u64			__reserved_2;
+	__u64			__reserved_3;
+
+	__u32			ustack_dump_size;
+	__u32			__reserved_4;
 };
 
 /*
@@ -1061,8 +1067,9 @@ extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size,
 			     int nmi, int sample);
 extern void perf_output_end(struct perf_output_handle *handle);
-extern void perf_output_copy(struct perf_output_handle *handle,
-			     const void *buf, unsigned int len);
+extern unsigned int
+perf_output_copy(struct perf_output_handle *handle,
+		 const void *buf, unsigned int len);
 extern int perf_swevent_get_recursion_context(void);
 extern void perf_swevent_put_recursion_context(int rctx);
 extern void perf_event_enable(struct perf_event *event);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index f1c0d72..0a1f0df 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3332,28 +3332,43 @@ out:
 	preempt_enable();
 }
 
-__always_inline void perf_output_copy(struct perf_output_handle *handle,
-		      const void *buf, unsigned int len)
-{
-	do {
-		unsigned long size = min_t(unsigned long, handle->size, len);
-
-		memcpy(handle->addr, buf, size);
-
-		len -= size;
-		handle->addr += size;
-		buf += size;
-		handle->size -= size;
-		if (!handle->size) {
-			struct perf_buffer *buffer = handle->buffer;
-
-			handle->page++;
-			handle->page &= buffer->nr_pages - 1;
-			handle->addr = buffer->data_pages[handle->page];
-			handle->size = PAGE_SIZE << page_order(buffer);
-		}
-	} while (len);
-}
+static int memcpy_common(void *dst, const void *src, size_t n)
+{
+	memcpy(dst, src, n);
+
+	return n;
+}
+
+#define DEFINE_PERF_OUTPUT_COPY(func_name, memcpy_func)				\
+__always_inline unsigned int func_name(struct perf_output_handle *handle,	\
+				       const void *buf, unsigned int len)	\
+{										\
+	unsigned long size, written;						\
+										\
+	do {									\
+		size = min_t(unsigned long, handle->size, len);			\
+										\
+		written = memcpy_func(handle->addr, buf, size);			\
+										\
+		len -= written;							\
+		handle->addr += written;					\
+		buf += written;							\
+		handle->size -= written;					\
+		if (!handle->size) {						\
+			struct perf_buffer *buffer = handle->buffer;		\
+										\
+			handle->page++;						\
+			handle->page &= buffer->nr_pages - 1;			\
+			handle->addr = buffer->data_pages[handle->page];	\
+			handle->size = PAGE_SIZE << page_order(buffer);		\
+		}								\
+	} while (len && written == size);					\
+										\
+	return len;								\
+}
+
+DEFINE_PERF_OUTPUT_COPY(perf_output_copy, memcpy_common)
+DEFINE_PERF_OUTPUT_COPY(perf_output_copy_user_nmi, copy_from_user_nmi)
 
 int perf_output_begin(struct perf_output_handle *handle,
 		      struct perf_event *event, unsigned int size,
@@ -3639,6 +3654,44 @@ void perf_output_sample(struct perf_output_handle *handle,
 			perf_output_put(handle, size);
 		}
 	}
+
+	if (event->attr.ustack_dump_size) {
+		unsigned long sp;
+		unsigned int rem;
+		u64 size, dyn_size;
+
+		/* Case of a kernel thread, nothing to dump */
+		if (!data->uregs) {
+			size = 0;
+			perf_output_put(handle, size);
+
+			return;
+		}
+
+		/*
+		 * Static size: we always dump the size requested by the user
+		 * because most of the time, the top of the user stack is not
+		 * paged out. Perhaps we should force ustack_dump_size
+		 * to be % 8.
+		 */
+		size = event->attr.ustack_dump_size;
+		size = round_up(size, sizeof(u64));
+		perf_output_put(handle, size);
+
+		/* CHECKME: might me missing on some archs */
+		sp = user_stack_pointer(data->uregs);
+		rem = perf_output_copy_user_nmi(handle, (void *)sp, size);
+		dyn_size = size - rem;
+
+		/* What couldn't be dumped is zero padded */
+		while (rem--) {
+			char zero = 0;
+			perf_output_put(handle, zero);
+		}
+
+		/* Dynamic size: whole dump - padding */
+		perf_output_put(handle, dyn_size);
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -3736,6 +3789,32 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 		header->size += size;
 	}
+
+	if (event->attr.ustack_dump_size) {
+		if (!(sample_type & PERF_SAMPLE_UREGS))
+			data->uregs = perf_sample_uregs(regs);
+
+		/*
+		 * A first field that tells the _static_ size of the dump. 0 if
+		 * there is nothing to dump (ie: we are in a kernel thread)
+		 * otherwise the requested size.
+		 */
+		header->size += sizeof(u64);
+
+		/*
+		 * If there is something to dump, add space for the dump itself
+		 * and for the field that tells the _dynamic_ size, which is
+		 * how many have been actually dumped. What couldn't be dumped
+		 * will be zero-padded.
+		 */
+		if (data->uregs) {
+			u64 size = event->attr.ustack_dump_size;
+
+			size = round_up(size, sizeof(u64));
+			header->size += size;
+			header->size += sizeof(u64);
+		}
+	}
 }
 
 static void perf_event_output(struct perf_event *event, int nmi,
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2010-10-13  5:06 ` [RFC PATCH 3/9] perf: Add ability to dump part of the user stack Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  7:23   ` Peter Zijlstra
  2010-10-13  5:06 ` [RFC PATCH 5/9] perf: Support for dwarf mode callchain on perf record Frederic Weisbecker
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

If we record user stack and regs, then don't record frame pointer
based user callchain, because it means we will do the user callchain
with the dwarf CFI information.

I'm not sure this is a good thing. I don't even know why I'm sending
this patch.

User and kernel stack might be selected for other uses than callchain
in the future, this probably shouldn't mess with the regular callchain
code. Instead we should probably have an exclude_callchain_user
attribute, that could be also useful to filter out user callchains
when people don't want them.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 kernel/perf_event.c |   17 +++++++++++++----
 1 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 0a1f0df..018a098 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1958,7 +1958,8 @@ put_callchain_entry(int rctx)
 	put_recursion_context(__get_cpu_var(callchain_recursion), rctx);
 }
 
-static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
+static struct perf_callchain_entry *
+perf_callchain(struct pt_regs *regs, struct perf_event *event, u64 sample_type)
 {
 	int rctx;
 	struct perf_callchain_entry *entry;
@@ -1983,8 +1984,16 @@ static struct perf_callchain_entry *perf_callchain(struct pt_regs *regs)
 	}
 
 	if (regs) {
-		perf_callchain_store(entry, PERF_CONTEXT_USER);
-		perf_callchain_user(entry, regs);
+		/*
+		 * CHECKME: if user regs and user stack dumps are used
+		 * for something else than callchains one day, we may
+		 * want a exclude_callchain_user?
+		 */
+		if (!(sample_type & PERF_SAMPLE_UREGS &&
+		      event->attr.ustack_dump_size)) {
+			perf_callchain_store(entry, PERF_CONTEXT_USER);
+			perf_callchain_user(entry, regs);
+		}
 	}
 
 exit_put:
@@ -3760,7 +3769,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		data->callchain = perf_callchain(regs);
+		data->callchain = perf_callchain(regs, event, sample_type);
 
 		if (data->callchain)
 			size += data->callchain->nr;
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 5/9] perf: Support for dwarf mode callchain on perf record
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2010-10-13  5:06 ` [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 6/9] perf: Build with dwarf cfi Frederic Weisbecker
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

"perf record -g" is the command to record frame pointer based
callchains. This patch extends the "-g" option to support the
dwarf cfi mode.

The new behaviour is:

- "perf record -g" will record frame pointer based callchains
as it did before.

- "perf record -g fp" is the same as "-g" alone

- "perf record -g dwarf" will record frame pointer based kernel
callchains but will ignore the user part. Instead it will dump
user regs and 4000 bytes of stack by default to each samples.

- "perf record -g dwarf,x" does the same but overrides the
default 4000 bytes of user stack dumped to x bytes instead.
The higher is the size, the deeper will be the callchain but
the higher will be the overhead of the profiling and the size
of the output file.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 tools/perf/builtin-record.c |   76 ++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ff77b80..132d710 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -32,6 +32,12 @@ enum write_mode_t {
 	WRITE_APPEND
 };
 
+enum call_graph_mode {
+	CALLCHAIN_NONE,
+	CALLCHAIN_FP,
+	CALLCHAIN_DWARF
+};
+
 static int			*fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static u64			user_interval			= ULLONG_MAX;
@@ -56,7 +62,6 @@ static int			thread_num			=      0;
 static pid_t			child_pid			=     -1;
 static bool			no_inherit			=  false;
 static enum write_mode_t	write_mode			= WRITE_FORCE;
-static bool			call_graph			=  false;
 static bool			inherit_stat			=  false;
 static bool			no_samples			=  false;
 static bool			sample_address			=  false;
@@ -76,6 +81,10 @@ static off_t			post_processing_offset;
 static struct perf_session	*session;
 static const char		*cpu_list;
 
+static char			callchain_default_opt[]		=      "fp";
+static unsigned long		stack_dump_size			=	4000;
+static enum call_graph_mode	call_graph			= CALLCHAIN_NONE;
+
 struct mmap_data {
 	int			counter;
 	void			*base;
@@ -274,9 +283,15 @@ static void create_counter(int counter, int cpu)
 		attr->mmap_data = track;
 	}
 
-	if (call_graph)
+	if (call_graph) {
 		attr->sample_type	|= PERF_SAMPLE_CALLCHAIN;
 
+		if (call_graph == CALLCHAIN_DWARF) {
+			attr->sample_type |= PERF_SAMPLE_UREGS;
+			attr->ustack_dump_size = stack_dump_size;
+		}
+	}
+
 	if (system_wide)
 		attr->sample_type	|= PERF_SAMPLE_CPU;
 
@@ -779,6 +794,58 @@ out_delete_session:
 	return err;
 }
 
+static int
+parse_callchain_opt(const struct option *opt __used, const char *arg,
+		    int unset)
+{
+	char *tok;
+
+	/*
+	 * --no-call-graph
+	 */
+	if (unset)
+		return 0;
+
+	if (!arg)
+		return 0;
+
+	tok = strtok((char *)arg, ",");
+	if (!tok)
+		return -1;
+
+	/* get the output mode */
+	if (!strncmp(tok, "fp", strlen(arg)))
+		call_graph = CALLCHAIN_FP;
+
+	else if (!strncmp(tok, "dwarf", strlen(arg)))
+		call_graph = CALLCHAIN_DWARF;
+
+	else if (!strncmp(tok, "none", strlen(arg)))
+		return 0;
+
+	else
+		return -1;
+
+	/* get the stack dump size */
+	tok = strtok(NULL, ",");
+	if (!tok)
+		return 0;
+
+	/* No stack dump size if we record using frame pointers */
+	if (call_graph == CALLCHAIN_FP) {
+		fprintf(stderr, "Stack dump size is only necessary for -g dwarf\n");
+		return -1;
+	}
+
+	stack_dump_size = strtoul(tok, NULL, 0);
+	if (stack_dump_size == ULONG_MAX) {
+		perror("Incorrect stack dump size\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 static const char * const record_usage[] = {
 	"perf record [<options>] [<command>]",
 	"perf record [<options>] -- <command> [<options>]",
@@ -816,8 +883,9 @@ static const struct option options[] = {
 		    "child tasks do not inherit counters"),
 	OPT_UINTEGER('F', "freq", &user_freq, "profile at this frequency"),
 	OPT_UINTEGER('m', "mmap-pages", &mmap_pages, "number of mmap data pages"),
-	OPT_BOOLEAN('g', "call-graph", &call_graph,
-		    "do call-graph (stack chain/backtrace) recording"),
+	OPT_CALLBACK_DEFAULT('g', "call-graph", NULL, "mode,dump_size",
+		     "do call-graph (stack chain/backtrace) recording"
+		     "Default: fp", &parse_callchain_opt, callchain_default_opt),
 	OPT_INCR('v', "verbose", &verbose,
 		    "be more verbose (show counter open errors, etc)"),
 	OPT_BOOLEAN('s', "stat", &inherit_stat,
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 6/9] perf: Build with dwarf cfi
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2010-10-13  5:06 ` [RFC PATCH 5/9] perf: Support for dwarf mode callchain on perf record Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  5:06 ` [RFC PATCH 7/9] perf: Support for error passed over pointers Frederic Weisbecker
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

Build perf with the dwarf call frame informations so that we can
unwind callchains in perf itself.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 tools/perf/Makefile |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index d1db0f6..28de60b 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -224,7 +224,7 @@ ifndef PERF_DEBUG
   CFLAGS_OPTIMIZE = -O6
 endif
 
-CFLAGS = -ggdb3 -Wall -Wextra -std=gnu99 -Werror $(CFLAGS_OPTIMIZE) -D_FORTIFY_SOURCE=2 $(EXTRA_WARNINGS) $(EXTRA_CFLAGS)
+CFLAGS = -ggdb3 -funwind-tables -Wall -Wextra -std=gnu99 -Werror $(CFLAGS_OPTIMIZE) -D_FORTIFY_SOURCE=2 $(EXTRA_WARNINGS) $(EXTRA_CFLAGS)
 EXTLIBS = -lpthread -lrt -lelf -lm
 ALL_CFLAGS = $(CFLAGS) -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64
 ALL_LDFLAGS = $(LDFLAGS)
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 7/9] perf: Support for error passed over pointers
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2010-10-13  5:06 ` [RFC PATCH 6/9] perf: Build with dwarf cfi Frederic Weisbecker
@ 2010-10-13  5:06 ` Frederic Weisbecker
  2010-10-13  5:07 ` [RFC PATCH 8/9] perf: Add libunwind dependency for dwarf cfi unwinding Frederic Weisbecker
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:06 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

Export the linux/err.h that carries the ERR_PTR/PTR_ERR/IS_ERR
macros so that they also become usable by perf.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 tools/perf/Makefile                 |    1 +
 tools/perf/util/include/linux/err.h |   24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/util/include/linux/err.h

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 28de60b..1146195 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -375,6 +375,7 @@ LIB_H += util/include/linux/prefetch.h
 LIB_H += util/include/linux/rbtree.h
 LIB_H += util/include/linux/string.h
 LIB_H += util/include/linux/types.h
+LIB_H += util/include/linux/err.h
 LIB_H += util/include/asm/asm-offsets.h
 LIB_H += util/include/asm/bug.h
 LIB_H += util/include/asm/byteorder.h
diff --git a/tools/perf/util/include/linux/err.h b/tools/perf/util/include/linux/err.h
new file mode 100644
index 0000000..4e6dc36
--- /dev/null
+++ b/tools/perf/util/include/linux/err.h
@@ -0,0 +1,24 @@
+#ifndef PERF_ERR_H
+#define PERF_ERR_H
+
+
+#define MAX_ERRNO	4095
+
+#define IS_ERR_VALUE(x) ((x) >= (unsigned long)-MAX_ERRNO)
+
+static inline void *ERR_PTR(long err)
+{
+	return (void *) err;
+}
+
+static inline long PTR_ERR(const void *ptr)
+{
+	return (long) ptr;
+}
+
+static inline long IS_ERR(const void *ptr)
+{
+	return IS_ERR_VALUE((unsigned long)ptr);
+}
+
+#endif /* PERF_ERR_H */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 8/9] perf: Add libunwind dependency for dwarf cfi unwinding
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2010-10-13  5:06 ` [RFC PATCH 7/9] perf: Support for error passed over pointers Frederic Weisbecker
@ 2010-10-13  5:07 ` Frederic Weisbecker
  2010-10-13  5:07 ` [RFC PATCH 9/9] perf: Support for dwarf cfi unwinding on post processing Frederic Weisbecker
  2010-10-13 15:13 ` [RFC] perf: Dwarf cfi based user callchains Frank Ch. Eigler
  9 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:07 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

This is not mandatory, but required to get dwarf cfi unwinding
support.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 tools/perf/Makefile          |   19 +++++++++++++++++++
 tools/perf/feature-tests.mak |   14 ++++++++++++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 1146195..ca4cde1 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -515,6 +515,20 @@ ifneq ($(call try-cc,$(SOURCE_DWARF),$(FLAGS_DWARF)),y)
 endif # Dwarf support
 endif # NO_DWARF
 
+
+# Only x86-32 is supported for now
+ifneq ($(ARCH),x86)
+	NO_LIBUNWIND := 1
+endif
+
+ifndef NO_LIBUNWIND
+FLAGS_UNWIND=$(ALL_CFLAGS) -lunwind-x86 -lunwind-ptrace $(ALL_LDFLAGS) $(EXTLIBS)
+ifneq ($(call try-cc,$(SOURCE_LIBUNWIND),$(FLAGS_UNWIND)),y)
+	msg := $(warning No libunwind found. Please install libunwind >= 0.99);
+	NO_LIBUNWIND := 1
+endif # Libunwind support
+endif # NO_LIBUNWIND
+
 -include arch/$(ARCH)/Makefile
 
 ifeq ($(uname_S),Darwin)
@@ -561,6 +575,11 @@ else
 endif # PERF_HAVE_DWARF_REGS
 endif # NO_DWARF
 
+ifndef NO_LIBUNWIND
+	BASIC_CFLAGS += -DLIBUNWIND_SUPPORT
+	EXTLIBS += -lunwind-ptrace -lunwind-x86
+endif
+
 ifdef NO_NEWT
 	BASIC_CFLAGS += -DNO_NEWT_SUPPORT
 else
diff --git a/tools/perf/feature-tests.mak b/tools/perf/feature-tests.mak
index b253db6..367b213 100644
--- a/tools/perf/feature-tests.mak
+++ b/tools/perf/feature-tests.mak
@@ -90,6 +90,20 @@ int main(void)
 endef
 endif
 
+ifndef NO_LIBUNWIND
+define SOURCE_LIBUNWIND
+#include <libunwind.h>
+#include <stdlib.h>
+
+int main(void)
+{
+	unw_addr_space_t addr_space;
+	addr_space = unw_create_addr_space(NULL, 0);
+	return 0;
+}
+endef
+endif
+
 define SOURCE_BFD
 #include <bfd.h>
 
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 9/9] perf: Support for dwarf cfi unwinding on post processing
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2010-10-13  5:07 ` [RFC PATCH 8/9] perf: Add libunwind dependency for dwarf cfi unwinding Frederic Weisbecker
@ 2010-10-13  5:07 ` Frederic Weisbecker
  2010-10-13 15:13 ` [RFC] perf: Dwarf cfi based user callchains Frank Ch. Eigler
  9 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13  5:07 UTC (permalink / raw)
  To: LKML
  Cc: LKML, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Paul Mackerras, Stephane Eranian,
	Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu, Steven Rostedt,
	Robert Richter

This brings the support for dwarf cfi unwinding on perf
post processing. Call frame informations are retrieved and
then passed to libunwind that requests memory and register content
from the applications. We use the stack and regs dumped with
samples to retrieve the context state necessary for this unwinding.

No specific options are needed for perf report to handle this,
just run it as usual.

Only x86-32 is supported for now.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
---
 tools/perf/Makefile         |    1 +
 tools/perf/builtin-report.c |    9 +-
 tools/perf/perf.h           |    5 +
 tools/perf/util/callchain.c |   35 ++-
 tools/perf/util/callchain.h |   19 +-
 tools/perf/util/event.c     |   29 ++
 tools/perf/util/event.h     |    7 +
 tools/perf/util/unwind.c    | 1077 +++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 1175 insertions(+), 7 deletions(-)
 create mode 100644 tools/perf/util/unwind.c

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index ca4cde1..469d694 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -465,6 +465,7 @@ LIB_OBJS += $(OUTPUT)util/hist.o
 LIB_OBJS += $(OUTPUT)util/probe-event.o
 LIB_OBJS += $(OUTPUT)util/util.o
 LIB_OBJS += $(OUTPUT)util/cpumap.o
+LIB_OBJS += $(OUTPUT)util/unwind.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o
 
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5de405d..b48e218 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -30,6 +30,8 @@
 #include "util/sort.h"
 #include "util/hist.h"
 
+#include <linux/err.h>
+
 static char		const *input_name = "perf.data";
 
 static bool		force, use_tui, use_stdio;
@@ -87,6 +89,7 @@ static int perf_session__add_hist_entry(struct perf_session *self,
 	struct hist_entry *he;
 	struct hists *hists;
 	struct perf_event_attr *attr;
+	struct dwarf_callchain *dc;
 
 	if ((sort__has_parent || symbol_conf.use_callchain) && data->callchain) {
 		syms = perf_session__resolve_callchain(self, al->thread,
@@ -95,6 +98,10 @@ static int perf_session__add_hist_entry(struct perf_session *self,
 			return -ENOMEM;
 	}
 
+	dc = callchain_unwind(self, al->thread, data);
+	if (IS_ERR(dc))
+		return PTR_ERR(dc);
+
 	attr = perf_header__find_attr(data->id, &self->header);
 	if (attr)
 		hists = perf_session__hists_findnew(self, data->id, attr->type, attr->config);
@@ -108,7 +115,7 @@ static int perf_session__add_hist_entry(struct perf_session *self,
 	err = 0;
 	if (symbol_conf.use_callchain) {
 		err = callchain_append(he->callchain, data->callchain, syms,
-				       data->period);
+				       dc, data->period);
 		if (err)
 			goto out_free_syms;
 	}
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index ef7aa0a..2f7a6eb 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -132,6 +132,11 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+struct raw_stack_dump {
+	u64 size;
+	char stack[0];
+};
+
 extern bool perf_host, perf_guest;
 
 #endif
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index e12d539..54ca387 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -365,8 +365,10 @@ append_chain(struct callchain_node *root, struct resolved_chain *chain,
 }
 
 static void filter_context(struct ip_callchain *old, struct resolved_chain *new,
-			   struct map_symbol *syms)
+			   struct map_symbol *syms,
+			   struct dwarf_callchain *dwarf_chain)
 {
+	struct dwarf_callchain_entry *entry, *tmp;
 	int i, j = 0;
 
 	for (i = 0; i < (int)old->nr; i++) {
@@ -379,23 +381,46 @@ static void filter_context(struct ip_callchain *old, struct resolved_chain *new,
 	}
 
 	new->nr = j;
+
+	if (!dwarf_chain)
+		return;
+
+	list_for_each_entry_safe(entry, tmp, &dwarf_chain->chain_head, list) {
+		new->ips[j].ip = entry->ip;
+		new->ips[j].ms = entry->ms;
+		j++;
+
+		list_del(&entry->list);
+		free(entry);
+	}
+
+	new->nr += dwarf_chain->nb;
+
+	free(dwarf_chain);
 }
 
 
 int callchain_append(struct callchain_root *root, struct ip_callchain *chain,
-		     struct map_symbol *syms, u64 period)
+		     struct map_symbol *syms,
+		     struct dwarf_callchain *dwarf_chain, u64 period)
 {
 	struct resolved_chain *filtered;
+	int entries;
+
+	entries = chain->nr;
+
+	if (dwarf_chain)
+		entries += dwarf_chain->nb;
 
-	if (!chain->nr)
+	if (!entries)
 		return 0;
 
 	filtered = zalloc(sizeof(*filtered) +
-			  chain->nr * sizeof(struct resolved_ip));
+			  entries * sizeof(struct resolved_ip));
 	if (!filtered)
 		return -ENOMEM;
 
-	filter_context(chain, filtered, syms);
+	filter_context(chain, filtered, syms, dwarf_chain);
 
 	if (!filtered->nr)
 		goto end;
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index c15fb8c..9fb55b4 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -31,6 +31,17 @@ struct callchain_root {
 	struct callchain_node	node;
 };
 
+struct dwarf_callchain_entry {
+	struct list_head	list;
+	u64			ip;
+	struct map_symbol	ms;
+};
+
+struct dwarf_callchain {
+	int			nb;
+	struct list_head	chain_head;
+};
+
 struct callchain_param;
 
 typedef void (*sort_chain_func_t)(struct rb_root *, struct callchain_root *,
@@ -68,8 +79,14 @@ static inline u64 cumul_hits(struct callchain_node *node)
 
 int register_callchain_param(struct callchain_param *param);
 int callchain_append(struct callchain_root *root, struct ip_callchain *chain,
-		     struct map_symbol *syms, u64 period);
+		     struct map_symbol *syms,
+		     struct dwarf_callchain *dwarf_chain, u64 period);
+
 int callchain_merge(struct callchain_root *dst, struct callchain_root *src);
 
 bool ip_callchain__valid(struct ip_callchain *chain, const event_t *event);
+
+struct dwarf_callchain *callchain_unwind(struct perf_session *session,
+					 struct thread *thread,
+					 struct sample_data *data);
 #endif	/* __PERF_CALLCHAIN_H */
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index dab9e75..5e9013b 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -766,6 +766,8 @@ out_filtered:
 	return 0;
 }
 
+struct pt_regs;
+
 int event__parse_sample(const event_t *event, u64 type, struct sample_data *data)
 {
 	const u64 *array = event->sample.array;
@@ -830,6 +832,33 @@ int event__parse_sample(const event_t *event, u64 type, struct sample_data *data
 		data->raw_size = *p;
 		p++;
 		data->raw_data = p;
+		array += 1 + (data->raw_size * sizeof(u64));
+	}
+
+	if (type & PERF_SAMPLE_UREGS) {
+		u64 size = *array;
+		array++;
+		if (!size) {
+			data->regs = NULL;
+		} else {
+			data->regs = (struct pt_regs *)array;
+			array += size / sizeof(*array);
+		}
+
+		size = *array++;
+
+		/*
+		 * FIXME: Even if user regs and user stack dumps are
+		 * always paired right now, one might be recorded without the
+		 * other.
+		 */
+		if (!size) {
+			data->stack.size = 0;
+		} else {
+			data->stack.data = (char *)array;
+			array += size / sizeof(*array);
+			data->stack.size = *array;
+		}
 	}
 
 	return 0;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 8e790da..9c43cd0 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -61,6 +61,11 @@ struct sample_event {
 	u64 array[];
 };
 
+struct user_stack_dump {
+	u64 size;
+	char *data;
+};
+
 struct sample_data {
 	u64 ip;
 	u32 pid, tid;
@@ -73,6 +78,8 @@ struct sample_data {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct pt_regs *regs;
+	struct user_stack_dump stack;
 };
 
 #define BUILD_ID_SIZE 20
diff --git a/tools/perf/util/unwind.c b/tools/perf/util/unwind.c
new file mode 100644
index 0000000..cd45486
--- /dev/null
+++ b/tools/perf/util/unwind.c
@@ -0,0 +1,1077 @@
+/*
+ * Post mortem Dwarf CFI based unwinding on top of regs and stack dumps.
+ *
+ * Lots of this code have been borrowed or heavily inspired from parts of
+ * the libunwind 0.99 code which are (amongst other contributors I may have
+ * forgotten):
+ *
+ * Copyright (C) 2002-2007 Hewlett-Packard Co
+ *	Contributed by David Mosberger-Tang <davidm@hpl.hp.com>
+ *
+ * And the bugs have been added by:
+ *
+ * Copyright (C) 2010, Frederic Weisbecker <fweisbec@gmail.com>
+ *
+ */
+
+#include "util.h"
+#include <elf.h>
+#include <fcntl.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <linux/list.h>
+#include <linux/err.h>
+#include "thread.h"
+#include "session.h"
+
+
+#ifdef LIBUNWIND_SUPPORT
+
+#include <libunwind-ptrace.h>
+#include <libunwind.h>
+
+struct pt_regs {
+	unsigned long ebx;
+	unsigned long ecx;
+	unsigned long edx;
+	unsigned long esi;
+	unsigned long edi;
+	unsigned long ebp;
+	unsigned long eax;
+	unsigned long ds;
+	unsigned long es;
+	unsigned long fs;
+	unsigned long gs;
+	unsigned long orig_ax;
+	unsigned long eip;
+	unsigned long cs;
+	unsigned long flags;
+	unsigned long esp;
+	unsigned long ss;
+};
+
+#define DW_EH_PE_FORMAT_MASK	0x0f	/* format of the encoded value */
+#define DW_EH_PE_APPL_MASK	0x70	/* how the value is to be applied */
+/*
+ * Flag bit.  If set, the resulting pointer is the address of the word
+ * that contains the final address.
+ */
+#define DW_EH_PE_indirect	0x80
+
+/* Pointer-encoding formats: */
+#define DW_EH_PE_omit		0xff
+#define DW_EH_PE_ptr		0x00	/* pointer-sized unsigned value */
+#define DW_EH_PE_uleb128	0x01	/* unsigned LE base-128 value */
+#define DW_EH_PE_udata2		0x02	/* unsigned 16-bit value */
+#define DW_EH_PE_udata4		0x03	/* unsigned 32-bit value */
+#define DW_EH_PE_udata8		0x04	/* unsigned 64-bit value */
+#define DW_EH_PE_sleb128	0x09	/* signed LE base-128 value */
+#define DW_EH_PE_sdata2		0x0a	/* signed 16-bit value */
+#define DW_EH_PE_sdata4		0x0b	/* signed 32-bit value */
+#define DW_EH_PE_sdata8		0x0c	/* signed 64-bit value */
+
+/* Pointer-encoding application: */
+#define DW_EH_PE_absptr		0x00	/* absolute value */
+#define DW_EH_PE_pcrel		0x10	/* rel. to addr. of encoded value */
+#define DW_EH_PE_textrel	0x20	/* text-relative (GCC-specific???) */
+#define DW_EH_PE_datarel	0x30	/* data-relative */
+/*
+ * The following are not documented by LSB v1.3, yet they are used by
+ * GCC, presumably they aren't documented by LSB since they aren't
+ * used on Linux:
+ */
+#define DW_EH_PE_funcrel	0x40	/* start-of-procedure-relative */
+#define DW_EH_PE_aligned	0x50	/* aligned pointer */
+
+struct dwarf_instr_addr {
+	u64			start;
+	u64			end;
+	struct dso		*dso;
+	struct list_head	list;
+};
+
+struct unwind_info {
+	struct sample_data	*sample;
+	struct perf_session	*session;
+	struct thread		*thread;
+	struct list_head	dia_head;
+};
+
+static int
+resolve_section_name(int fd, Elf32_Ehdr *ehdr, int idx, char *buf, int size)
+{
+	Elf32_Shdr shdr;
+	int offset;
+	int old;
+	int i;
+	char c;
+
+	old = lseek(fd, 0, SEEK_CUR);
+	offset = ehdr->e_shoff + (ehdr->e_shstrndx * ehdr->e_shentsize);
+	lseek(fd, offset, SEEK_SET);
+	if (read(fd, &shdr, ehdr->e_shentsize) == -1)
+		return -errno;
+
+	offset = shdr.sh_offset + idx;
+	lseek(fd, offset, SEEK_SET);
+
+	for (i = 0; i < size - 1; i++) {
+		if (read(fd, &c, 1) == -1)
+			return -errno;
+		if (!c)
+			break;
+		buf[i] = c;
+	}
+
+	buf[i] = 0;
+	lseek(fd, old, SEEK_SET);
+
+	return 0;
+}
+
+
+static int eh_frame_section(int fd, Elf32_Ehdr *ehdr, Elf32_Shdr *shdr)
+{
+	int i, err;
+
+	lseek(fd, ehdr->e_shoff + ehdr->e_shentsize, SEEK_SET);
+
+	for (i = 1; i < ehdr->e_shnum; i++) {
+		char buf[128];
+
+		if (read(fd, shdr, ehdr->e_shentsize) == -1)
+			return -errno;
+
+		err = resolve_section_name(fd, ehdr, shdr->sh_name, buf, sizeof(buf));
+		if (err)
+			return err;
+
+		if (!strcmp(buf, ".eh_frame"))
+			return 0;
+	}
+
+	return -ENOENT;
+}
+
+static int parse_elf_headers(int fd, Elf32_Ehdr *ehdr)
+{
+	if (read(fd, ehdr, sizeof(*ehdr)) == -1)
+		return -errno;
+
+	if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
+		ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
+		ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
+		ehdr->e_ident[EI_MAG3] != ELFMAG3) {
+
+		return -EINVAL;
+	}
+
+	if (!ehdr->e_shoff)
+		return -ENOENT;
+
+	return 0;
+}
+
+static u64 dwarf_read_uleb128(int fd, u64 *val)
+{
+	u64 shift = 0;
+	unsigned char byte;
+
+	*val = 0;
+
+	do {
+		if (read(fd, &byte, sizeof(byte)) == -1)
+			return -errno;
+
+		*val |= (byte & 0x7f) << shift;
+		shift += 7;
+	} while (byte & 0x80);
+
+	return 0;
+}
+
+static s64 dwarf_read_sleb128(int fd, s64 *val)
+{
+	s64 shift = 0;
+	unsigned char byte;
+
+	*val = 0;
+
+	do {
+		if (read(fd, &byte, sizeof(byte)) == -1)
+			return -errno;
+		*val |= (byte & 0x7f) << shift;
+		shift += 7;
+	} while (byte & 0x80);
+
+	if (shift < 8 * sizeof(*val) && (byte & 0x40) != 0)
+		/* sign-extend negative value */
+		*val |= -1LL << shift;
+
+	return 0;
+}
+
+static int dwarf_read_encoded_pointer(int fd, unsigned char encoding,
+				      u64 drop __used, u64 *val)
+{
+	u64 base = lseek(fd, 0, SEEK_CUR);
+
+	if (encoding == DW_EH_PE_omit || encoding == DW_EH_PE_aligned) {
+		pr_err("Unsupported dwarf encoding\n");
+		return -ENOSYS;
+	}
+
+	switch (encoding & DW_EH_PE_FORMAT_MASK) {
+	case DW_EH_PE_ptr: {
+		unsigned long lval;
+
+		if (read(fd, &lval, sizeof(lval)) == -1)
+			return -errno;
+		*val = lval;
+		break;
+	}
+	case DW_EH_PE_sdata4: {
+		s32 s32val;
+		s64 s64val;
+
+		if (read(fd, &s32val, sizeof(s32val)) == -1)
+			return -errno;
+		s64val = s32val;
+		*val = *(u64 *)&s64val;
+		break;
+	}
+	default:
+		pr_err("Unsupported encoded pointer: %d\n", encoding & DW_EH_PE_FORMAT_MASK);
+		return -EINVAL;
+	}
+
+	switch (encoding & DW_EH_PE_APPL_MASK) {
+	case DW_EH_PE_absptr:
+		break;
+	case DW_EH_PE_pcrel:
+		*val += base;
+		break;
+	default:
+		pr_err("Unsupported DW_EH_PE_APPL_MASK: %d\n", encoding & DW_EH_PE_APPL_MASK);
+		return -EINVAL;
+	}
+
+	if (encoding & DW_EH_PE_indirect) {
+		int prev_offset = lseek(fd, 0, SEEK_CUR);
+		unsigned long lval;
+
+		lseek(fd, *val, SEEK_SET);
+		if (read(fd, &lval, sizeof(unsigned long)) == -1)
+			return -errno;
+		*val = lval;
+		lseek(fd, prev_offset, SEEK_SET);
+	}
+
+	return 0;
+}
+
+struct cie {
+	u32		length;
+	u64		ext_length;
+	union {
+			u32 id32;
+			u64 id64;
+	};
+	u8		version;
+	u64		code_align;
+	s64		data_align;
+	u8		ret_column;
+	u64		aug_length;
+	u8		lsda_encoding;
+	u8		fde_encoding; /* Should have defaults */
+	u8		handler_encoding;
+	u8		have_abi_marker;
+	u64		handler;
+	u64		instr;
+	u64		end;
+};
+
+static int parse_cie(int fd, struct cie *cie)
+{
+	char aug_str[10];
+	int size, end, base;
+	int err;
+	int i;
+
+	memset(cie, 0, sizeof(*cie));
+
+	if (read(fd, &cie->length, sizeof(cie->length)) == -1)
+		return -errno;
+
+	if (cie->length == 0xffffffff) {
+		if (read(fd, &cie->ext_length, sizeof(cie->ext_length)) == -1)
+			return -errno;
+
+		size = cie->ext_length;
+		base = lseek(fd, 0, SEEK_CUR);
+		if (read(fd, &cie->id64, sizeof(cie->id64)) == -1)
+			return -errno;
+
+		if (cie->id64)
+			return -EINVAL;
+	} else {
+		base = lseek(fd, 0, SEEK_CUR);
+		if (read(fd, &cie->id32, sizeof(cie->id32)) == -1)
+			return -errno;
+		size = cie->length;
+		if (cie->id32)
+			return -EINVAL;
+	}
+	end = base + size;
+
+	if (read(fd, &cie->version, sizeof(cie->version)) == -1)
+		return -errno;
+
+	/* Should be else in 64 bits? */
+	if (cie->version != 1)
+		return -EINVAL;
+
+	memset(aug_str, 0, sizeof(aug_str));
+	for (i = 0; i < (int)sizeof(aug_str); i++) {
+		char c;
+
+		if (read(fd, &c, 1) == -1)
+			return -errno;
+
+		aug_str[i] = c;
+		if (!c)
+			break;
+	}
+
+	if (!strcmp("eh", aug_str))
+		lseek(fd, 4, SEEK_CUR);
+
+	err = dwarf_read_uleb128(fd, &cie->code_align);
+	if (err)
+		return err;
+
+	err = dwarf_read_sleb128(fd, &cie->data_align);
+	if (err)
+		return err;
+
+
+	if (read(fd, &cie->ret_column, sizeof(cie->ret_column)) == -1)
+		return -errno;
+
+	if (aug_str[0] == 'z') {
+		err = dwarf_read_uleb128(fd, &cie->aug_length);
+		if (err)
+			return err;
+	}
+
+
+	for (i = 1; aug_str[i]; i++) {
+		switch (aug_str[i]) {
+		case 'L':
+			if (read(fd, &cie->lsda_encoding, sizeof(cie->lsda_encoding)) == -1)
+				return -errno;
+			break;
+		case 'R':
+			if (read(fd, &cie->fde_encoding, sizeof(cie->fde_encoding)) == -1)
+				return -errno;
+			break;
+		case 'P':
+			if (read(fd, &cie->handler_encoding, sizeof(cie->handler_encoding)) == -1)
+				return -errno;
+
+			err = dwarf_read_encoded_pointer(fd, cie->handler_encoding, 0,
+								&cie->handler);
+			if (err)
+				return err;
+			break;
+		case 'S':
+			if (read(fd, &cie->have_abi_marker, sizeof(cie->have_abi_marker)) == -1)
+				return -errno;
+			break;
+		default:
+			break;
+		}
+	}
+	cie->instr = lseek(fd, 0, SEEK_CUR);
+	cie->end = end;
+
+	return 0;
+}
+
+struct fde {
+	u32	length;
+	u64	ext_length;
+	union {
+		u32	cie_offset32;
+		u64	cie_offset64;
+	};
+	u64	pc_begin;
+	u64	pc_range;
+	u64	aug_length;
+	u64	aug_end;
+	u64	lsda;
+	u16	abi;
+	u16	tag;
+	u64	end;
+};
+
+static int parse_fde(int fd, struct fde *fde, struct cie *cie, int fde_end)
+{
+	int ip_range_encoding;
+	int err;
+
+	memset(fde, 0, sizeof(*fde));
+	ip_range_encoding = cie->fde_encoding & DW_EH_PE_FORMAT_MASK;
+
+	err = dwarf_read_encoded_pointer(fd, cie->fde_encoding, 0, &fde->pc_begin);
+	if (err)
+		return err;
+	err = dwarf_read_encoded_pointer(fd, ip_range_encoding, 0, &fde->pc_range);
+	if (err)
+		return err;
+	fde->pc_range += fde->pc_begin;
+
+	if (cie->aug_length) {
+		err = dwarf_read_uleb128(fd, &fde->aug_length);
+		if (err)
+			return err;
+		fde->aug_end = lseek(fd, 0, SEEK_CUR) + fde->aug_length;
+	}
+
+	err = dwarf_read_encoded_pointer(fd, cie->lsda_encoding, 0, &fde->lsda);
+	if (err)
+		return err;
+
+	if (cie->have_abi_marker) {
+		if (read(fd, &fde->abi, sizeof(fde->abi)) == -1)
+			return -EINVAL;
+		if (read(fd, &fde->tag, sizeof(fde->tag)) == -1)
+			return -EINVAL;
+	}
+
+	if (!cie->aug_length)
+		fde->aug_end = lseek(fd, 0, SEEK_CUR);
+	fde->end = fde_end;
+
+	return 0;
+}
+
+struct dwarf_cie_info {
+	unw_word_t cie_instr_start;	/* start addr. of CIE "initial_instructions" */
+	unw_word_t cie_instr_end;	/* end addr. of CIE "initial_instructions" */
+	unw_word_t fde_instr_start;	/* start addr. of FDE "instructions" */
+	unw_word_t fde_instr_end;	/* end addr. of FDE "instructions" */
+	unw_word_t code_align;		/* code-alignment factor */
+	unw_word_t data_align;		/* data-alignment factor */
+	unw_word_t ret_addr_column;	/* column of return-address register */
+	unw_word_t handler;		/* address of personality-routine */
+	uint16_t abi;
+	uint16_t tag;
+	uint8_t fde_encoding;
+	uint8_t lsda_encoding;
+	unsigned int sized_augmentation : 1;
+	unsigned int have_abi_marker : 1;
+};
+
+static int
+cfi_match_fill(unw_word_t addr, struct cie *cie, struct fde *fde,
+		unw_proc_info_t *pi, int need_unwind_info)
+{
+	struct dwarf_cie_info *dci;
+
+	if (addr < fde->pc_begin || addr >= fde->pc_range)
+		return -1;
+
+	pi->start_ip = fde->pc_begin;
+	pi->end_ip = fde->pc_range;
+	pi->lsda = fde->lsda;
+	pi->handler = cie->handler;
+	pi->format = UNW_INFO_FORMAT_TABLE;
+
+	if (!need_unwind_info)
+		return 0;
+
+	dci = calloc(1, sizeof(*dci));
+	if (!dci)
+		return -ENOMEM;
+
+	dci->cie_instr_start = cie->instr;
+	dci->cie_instr_end = cie->end;
+	dci->fde_instr_start = fde->aug_end;
+	dci->fde_instr_end = fde->end;
+	dci->code_align = cie->code_align;
+	dci->data_align = cie->data_align;
+	dci->ret_addr_column = cie->ret_column;
+	dci->handler = pi->handler;
+	dci->abi = fde->abi;
+	dci->tag = fde->tag;
+	dci->fde_encoding = cie->fde_encoding;
+	dci->lsda_encoding = cie->lsda_encoding;
+	dci->sized_augmentation = !!cie->aug_length;
+	dci->have_abi_marker = cie->have_abi_marker;
+	pi->unwind_info_size = sizeof(*dci);
+	pi->unwind_info = dci;
+
+	return 0;
+}
+
+static int get_cfi_info(int fd, unw_word_t addr, Elf32_Shdr *eh_shdr,
+			unw_proc_info_t *pi, int need_unwind_info)
+{
+	int start, end;
+
+	start = eh_shdr->sh_offset;
+	end = start + eh_shdr->sh_size;
+
+	lseek(fd, start, SEEK_SET);
+
+	/*
+	 * For now, do a slow linear search of the fde matching that address.
+	 * Support for binary search across .eh_frame_hdr will come after.
+	 */
+	for (;;) {
+		int base, val_offset;
+		struct cie cie;
+		struct fde fde;
+		u32 size32, val32;
+		s64 size, val;
+
+		base = lseek(fd, 0, SEEK_CUR);
+		if (base >= end)
+			break;
+
+		if (read(fd, &size32, sizeof(size32)) == -1)
+			return -EINVAL;
+
+		if (size32 == 0xffffffff) {
+			if (read(fd, &size, sizeof(size)) == -1)
+				return -EINVAL;
+			val_offset = lseek(fd, 0, SEEK_CUR);
+			if (read(fd, &val, sizeof(val)) == -1)
+				return -EINVAL;
+		} else {
+			val_offset = lseek(fd, 0, SEEK_CUR);
+			if (read(fd, &val32, sizeof(val32)) == -1)
+				return -EINVAL;
+			val = val32;
+			size = size32;
+		}
+
+		if (val) {
+			int offset, err;
+
+			offset = lseek(fd, 0, SEEK_CUR);
+			lseek(fd, val_offset - val, SEEK_SET);
+			if (parse_cie(fd, &cie)) {
+				pr_debug("Incorrect cie %llx %llx\n", val_offset - val, val);
+				break;
+			}
+
+			lseek(fd, offset, SEEK_SET);
+			if (parse_fde(fd, &fde, &cie, size + val_offset)) {
+				pr_debug("Incorrect fde\n");
+				break;
+			}
+
+			err = cfi_match_fill(addr, &cie, &fde, pi, need_unwind_info);
+			if (err != -1)
+				return err;
+			if (!err)
+				return 0;
+		}
+
+		lseek(fd, val_offset + size, SEEK_SET);
+	}
+
+	return -ENOENT;
+}
+
+static void find_address_location(unw_word_t ip, struct unwind_info *ui,
+				  struct addr_location *al)
+{
+	thread__find_addr_map(ui->thread, ui->session, PERF_RECORD_MISC_USER,
+			   MAP__FUNCTION, ui->thread->pid, ip, al);
+}
+
+static int track_dwarf_instr_addr(unw_proc_info_t *pi, struct unwind_info *ui,
+				  struct addr_location *al)
+{
+	struct dwarf_instr_addr *dia_cie, *dia_fde;
+	struct dwarf_cie_info *dci;
+
+	dia_cie = malloc(sizeof(*dia_cie));
+	if (!dia_cie)
+		return -ENOMEM;
+
+	dci = (struct dwarf_cie_info *)pi->unwind_info;
+
+	dia_cie->start = dci->cie_instr_start - 1;
+	dia_cie->end = dci->cie_instr_end;
+	dia_cie->dso = al->map->dso;
+
+	list_add_tail(&dia_cie->list, &ui->dia_head);
+
+	dia_fde = malloc(sizeof(*dia_fde));
+	if (!dia_fde) {
+		list_del(&dia_cie->list);
+		free(dia_cie);
+		return -ENOMEM;
+	}
+
+	dia_fde->start = dci->fde_instr_start - 1;
+	dia_fde->end = dci->fde_instr_end;
+	dia_fde->dso = al->map->dso;
+
+	list_add_tail(&dia_fde->list, &ui->dia_head);
+
+	return 0;
+}
+
+static int
+find_proc_info(unw_addr_space_t as __used, unw_word_t ip, unw_proc_info_t *pi,
+	      int need_unwind_info, void *arg)
+{
+	int err;
+	int fd;
+	unw_word_t addr;
+	char *path;
+	Elf32_Ehdr ehdr;
+	Elf32_Shdr eh_shdr;
+	struct addr_location al;
+	struct unwind_info *ui = arg;
+
+	find_address_location(ip, ui, &al);
+
+	if (!al.map || !al.map->dso)
+		return -EINVAL;
+
+	path = al.map->dso->long_name;
+	if (!path)
+		return -ENOENT;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		close(fd);
+		return fd;
+	}
+
+	err = parse_elf_headers(fd, &ehdr);
+	if (err) {
+		close(fd);
+		return err;
+	}
+
+	err = eh_frame_section(fd, &ehdr, &eh_shdr);
+	if (err) {
+		close(fd);
+		return err;
+	}
+
+	if (ehdr.e_type == ET_DYN)
+		addr = al.map->map_ip(al.map, ip);
+	else
+		addr = ip;
+
+	err = get_cfi_info(fd, addr, &eh_shdr, pi, need_unwind_info);
+	if (err) {
+		close(fd);
+		return err;
+	}
+
+	if (need_unwind_info) {
+		err = track_dwarf_instr_addr(pi, ui, &al);
+		close(fd);
+		return err;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+static int access_fpreg(unw_addr_space_t __used as, unw_regnum_t __used num,
+			unw_fpreg_t __used *val, int __used __write,
+			void __used *arg)
+{
+	pr_warning("Unwind: fpreg unsupported yet\n");
+
+	return -1;
+}
+
+static int get_dyn_info_list_addr(unw_addr_space_t __used as,
+				  unw_word_t __used *dil_addr,
+				  void __used *arg)
+{
+	return -UNW_ENOINFO;
+}
+
+static int resume(unw_addr_space_t __used as, unw_cursor_t __used *cu,
+		  void __used *arg)
+{
+	pr_warning("Unwind: resume\n");
+
+	return 0;
+}
+
+static int
+get_proc_name(unw_addr_space_t __used as, unw_word_t __used addr,
+		char __used *bufp, size_t __used buf_len,
+		unw_word_t __used *offp, void __used *arg)
+{
+	*offp = 0;
+
+	return 0;
+}
+
+static int access_dso_mem(struct unwind_info *ui, unw_word_t addr,
+			  unw_word_t *valp)
+{
+	struct thread *thread = ui->thread;
+	struct perf_session *session = ui->session;
+	struct addr_location al;
+	int fd;
+	u64 offset;
+
+	thread__find_addr_map(thread, session, PERF_RECORD_MISC_USER,
+			   MAP__FUNCTION, thread->pid, addr, &al);
+	if (!al.map) {
+		pr_debug("unwind: not found map for %lx\n", (unsigned long)addr);
+		return -1;
+	}
+
+	offset = al.map->map_ip(al.map, addr);
+	fd = open(al.map->dso->long_name, O_RDONLY);
+	if (fd < 0) {
+		const char *name;
+
+		name = al.map ? al.map->dso->long_name : "Sais pas";
+		pr_debug("unwind: Can't open dso %s\n", name);
+
+		return -1;
+	}
+
+	if (lseek(fd, offset, SEEK_SET) == -1) {
+		close(fd);
+		pr_err("unwind: Can't seek\n");
+		return -1;
+	}
+	if (read(fd, valp, sizeof(*valp)) == -1) {
+		close(fd);
+		return -errno;
+	}
+	close(fd);
+
+	pr_debug("access mem offset: %llx va: %lx val: %lx\n",
+			offset, (unsigned long)addr, (unsigned long)*valp);
+
+	return 0;
+}
+
+static int access_dwarf_instr(struct unwind_info *ui, unw_word_t addr,
+			  unw_word_t *valp)
+{
+	struct dwarf_instr_addr *dia;
+	int found = 0;
+	int fd;
+
+	/*
+	 * This is quite crappy. There may be conflicts between dso adresses
+	 * here. Probably we only need to keep track of the last dso here.
+	 */
+	list_for_each_entry(dia, &ui->dia_head, list) {
+		if (addr >= dia->start && addr < dia->end) {
+			found = 1;
+			break;
+		}
+	}
+
+	if (!found)
+		return -ENOENT;
+
+	fd = open(dia->dso->long_name, O_RDONLY);
+	if (fd < 0)
+		return -EINVAL;
+
+	lseek(fd, addr, SEEK_SET);
+	if (read(fd, valp, sizeof(*valp)) == -1)
+		return -errno;
+
+	close(fd);
+
+	return 0;
+}
+
+static int access_mem(unw_addr_space_t __used as,
+                      unw_word_t addr, unw_word_t *valp,
+                      int __write, void *arg)
+{
+	struct unwind_info *ui = arg;
+	struct user_stack_dump *stack = &ui->sample->stack;
+	unsigned long start, end;
+	unw_word_t *val;
+	int offset;
+	int ret;
+
+	/* Don't support write, probably not needed */
+	if (__write || !stack || !ui->sample->regs) {
+		*valp = 0;
+		return 0;
+	}
+
+	start = ui->sample->regs->esp;
+	end = start + stack->size;
+
+	ret = access_dwarf_instr(ui, addr, valp);
+	if (!ret)
+		return 0;
+
+	if (addr < start || addr + sizeof(unw_word_t) >= end) {
+		ret = access_dso_mem(ui, addr, valp);
+		if (ret) {
+			pr_debug("access_mem %p not inside range %p-%p\n",
+				(void *)addr, (void *)start, (void *)end);
+			*valp = 0;
+			return ret;
+		}
+		return 0;
+	}
+
+	offset = addr - start;
+	val = (void *)&stack->data[offset];
+	*valp = *val;
+
+	pr_debug("access_mem %p %lx\n", (void *)addr, (unsigned long)*valp);
+
+	return 0;
+}
+
+static int access_reg(unw_addr_space_t __used as,
+                      unw_regnum_t regnum, unw_word_t *valp,
+                      int __write, void *arg)
+{
+	struct unwind_info *ui = arg;
+
+	/* Don't support write, I suspect we don't need it */
+	if (__write) {
+		pr_err("access_reg w %d\n", regnum);
+		return 0;
+	}
+
+	if (!ui->sample->regs) {
+		*valp = 0;
+		return 0;
+	}
+
+	switch (regnum) {
+	case UNW_X86_EAX:
+		*valp = ui->sample->regs->eax;
+		break;
+	case UNW_X86_EDX:
+		*valp = ui->sample->regs->edx;
+		break;
+	case UNW_X86_ECX:
+		*valp = ui->sample->regs->ecx;
+		break;
+	case UNW_X86_EBX:
+		*valp = ui->sample->regs->ebx;
+		break;
+	case UNW_X86_ESI:
+		*valp = ui->sample->regs->esi;
+		break;
+	case UNW_X86_EDI:
+		*valp = ui->sample->regs->edi;
+		break;
+	case UNW_X86_EBP:
+		*valp = ui->sample->regs->ebp;
+		break;
+	case UNW_X86_ESP:
+		*valp = ui->sample->regs->esp;
+		break;
+	case UNW_X86_EIP:
+		*valp = ui->sample->regs->eip;
+		break;
+	default:
+		pr_err("can't read reg %d\n", regnum);
+		return -1;
+	}
+
+	pr_debug("reg: %d val: %lx\n", regnum, (unsigned long)*valp);
+
+	return 0;
+}
+
+static void put_unwind_info(unw_addr_space_t __used as, unw_proc_info_t *pi,
+			    void *arg)
+{
+	struct unwind_info *ui = arg;
+	struct dwarf_instr_addr *dia, *tmp;
+
+	if (pi->unwind_info) {
+		free(pi->unwind_info);
+		pi->unwind_info = NULL;
+	}
+
+	list_for_each_entry_safe(dia, tmp, &ui->dia_head, list) {
+		list_del(&dia->list);
+		free(dia);
+	}
+}
+
+static unw_accessors_t accessors = {
+	.find_proc_info		= find_proc_info,
+	.put_unwind_info	= put_unwind_info,
+	.get_dyn_info_list_addr	= get_dyn_info_list_addr,
+	.access_mem		= access_mem,
+	.access_reg		= access_reg,
+	.access_fpreg		= access_fpreg,
+	.resume			= resume,
+	.get_proc_name		= get_proc_name,
+};
+
+static int
+append_dwarf_chain(struct dwarf_callchain *callchain,
+		   struct addr_location *al)
+{
+	struct dwarf_callchain_entry *entry;
+
+	entry = calloc(sizeof(*entry), 1);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->ip = al->addr;
+	entry->ms.map = al->map;
+	entry->ms.sym = al->sym;
+
+	list_add_tail(&entry->list, &callchain->chain_head);
+
+	callchain->nb++;
+
+	return 0;
+}
+
+
+static void callchain_unwind_release(struct dwarf_callchain *callchain)
+{
+	struct dwarf_callchain_entry *entry, *tmp;
+
+	list_for_each_entry_safe(entry, tmp, &callchain->chain_head, list) {
+		list_del(&entry->list);
+		free(entry);
+	}
+
+	free(callchain);
+}
+
+
+struct dwarf_callchain *callchain_unwind(struct perf_session *session,
+					 struct thread *thread,
+					 struct sample_data *data)
+{
+	struct dwarf_callchain *callchain;
+	unw_addr_space_t addr_space;
+	struct addr_location al;
+	struct unwind_info ui;
+	unw_cursor_t c;
+	int ret;
+
+	if (!data->regs)
+		return NULL;
+
+	callchain = malloc(sizeof(*callchain));
+	if (!callchain)
+		return ERR_PTR(-ENOMEM);
+
+	callchain->nb = 0;
+	INIT_LIST_HEAD(&callchain->chain_head);
+
+	thread__find_addr_location(thread, session,
+				   PERF_RECORD_MISC_USER,
+				   MAP__FUNCTION, thread->pid,
+				   data->regs->eip, &al, NULL);
+
+	ret = append_dwarf_chain(callchain, &al);
+	if (ret)
+		goto fail;
+
+	memset(&ui, 0, sizeof(ui));
+	addr_space = unw_create_addr_space(&accessors, 0);
+	if (!addr_space) {
+		pr_err("Can't create unwind address space\n");
+		goto fail;
+	}
+
+	pr_debug("\n----- %s -----\n", al.map ? al.map->dso->long_name : "Unknown");
+	ui.sample = data;
+	ui.thread = thread;
+	ui.session = session;
+	INIT_LIST_HEAD(&ui.dia_head);
+
+	ret = unw_init_remote(&c, addr_space, &ui);
+	switch (ret) {
+	case UNW_EINVAL:
+		pr_err("Unwind: only supports local\n");
+		break;
+	case UNW_EUNSPEC:
+		pr_err("Unwind: unspecified error\n");
+		break;
+	case UNW_EBADREG:
+		pr_err("Unwind: register unavailable\n");
+		break;
+	default:
+		break;
+	}
+
+	if (ret)
+		goto fail_addrspace;
+
+	if (al.sym)
+		pr_debug("%s:ip = %llx\n", al.sym->name, al.addr);
+	else
+		pr_debug("ip = %llx\n", al.addr);
+
+	while (unw_step(&c) > 0) {
+		unw_word_t ip;
+		char name[250];
+		unsigned int offset;
+
+		unw_get_reg(&c, UNW_REG_IP, &ip);
+
+		thread__find_addr_location(thread, session,
+				   PERF_RECORD_MISC_USER,
+				   MAP__FUNCTION, thread->pid,
+				   ip, &al, NULL);
+
+		unw_get_proc_name(&c, name, sizeof(name), &offset);
+		if (al.sym)
+			pr_debug("%s:ip = %lx\n", al.sym->name, (unsigned long)ip);
+		else
+			pr_debug("ip = %lx (%llx)\n", (unsigned long)ip,
+					al.map ? al.map->map_ip(al.map, ip) : (u64)ip);
+
+		ret = append_dwarf_chain(callchain, &al);
+		if (ret)
+			goto fail_addrspace;
+	}
+
+	unw_destroy_addr_space(addr_space);
+
+	return callchain;
+
+fail_addrspace:
+	unw_destroy_addr_space(addr_space);
+fail:
+	callchain_unwind_release(callchain);
+	return ERR_PTR(ret);
+}
+
+#else /* LIBUNWIND_SUPPORT */
+
+struct dwarf_callchain *callchain_unwind(struct perf_session *session __used,
+					 struct thread *thread __used,
+					 struct sample_data *data __used)
+{
+	return NULL;
+}
+
+#endif
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available
  2010-10-13  5:06 ` [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available Frederic Weisbecker
@ 2010-10-13  7:15   ` Peter Zijlstra
  2010-10-13 14:47     ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-13  7:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> In order to support user stack dump safely in perf samples from
> generic code, export copy_from_user_nmi() from x86 and make it
> generally available. For most archs it will map to
> copy_from_user_inatomic, but for x86 we need to take care of
> not faulting from NMIs.
> 
> Since perf is the first user for now, let the overriden x86
> implementation in the perf source file.

It might make sense to call it copy_from_user_gup() because that's
bascially what it does, it doesn't rely on NMI context anymore, its just
NMI-safe.

Its a best effort software page table walk, and with the stacked
kmap_atomic bits Andrew took it should work from any context.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-13  5:06 ` [RFC PATCH 2/9] perf: Add ability to dump user regs Frederic Weisbecker
@ 2010-10-13  7:20   ` Peter Zijlstra
  2010-10-13 14:56     ` Frederic Weisbecker
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-13  7:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
> user space context as it was before the user entered the kernel for
> whatever reason.
> 
> This is going to be useful to bring Dwarf CFI based stack unwinding
> on top of samples.

This doesn't address any of the issues that were raised previously.

There's a reason we don't have PERF_SAMPLE_*REGS like things.

See: http://lkml.org/lkml/2010/3/3/308

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 3/9] perf: Add ability to dump part of the user stack
  2010-10-13  5:06 ` [RFC PATCH 3/9] perf: Add ability to dump part of the user stack Frederic Weisbecker
@ 2010-10-13  7:22   ` Peter Zijlstra
  2010-10-13 15:01     ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-13  7:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> +++ b/include/linux/perf_event.h
> @@ -227,6 +227,12 @@ struct perf_event_attr {
>         __u32                   bp_type;
>         __u64                   bp_addr;
>         __u64                   bp_len;
> +
> +       __u64                   __reserved_2;
> +       __u64                   __reserved_3;
> +
> +       __u32                   ustack_dump_size;
> +       __u32                   __reserved_4;
>  }; 

Why add those two __u64 fields?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs
  2010-10-13  5:06 ` [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs Frederic Weisbecker
@ 2010-10-13  7:23   ` Peter Zijlstra
  2010-10-13 15:02     ` Frederic Weisbecker
  2010-10-16  0:19     ` Frederic Weisbecker
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-13  7:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> User and kernel stack might be selected for other uses than callchain
> in the future, this probably shouldn't mess with the regular callchain
> code. Instead we should probably have an exclude_callchain_user
> attribute, that could be also useful to filter out user callchains
> when people don't want them. 

Probably ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available
  2010-10-13  7:15   ` Peter Zijlstra
@ 2010-10-13 14:47     ` Frederic Weisbecker
  2010-10-13 20:43       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13 14:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, Oct 13, 2010 at 09:15:56AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > In order to support user stack dump safely in perf samples from
> > generic code, export copy_from_user_nmi() from x86 and make it
> > generally available. For most archs it will map to
> > copy_from_user_inatomic, but for x86 we need to take care of
> > not faulting from NMIs.
> > 
> > Since perf is the first user for now, let the overriden x86
> > implementation in the perf source file.
> 
> It might make sense to call it copy_from_user_gup() because that's
> bascially what it does, it doesn't rely on NMI context anymore, its just
> NMI-safe.
> 
> Its a best effort software page table walk, and with the stacked
> kmap_atomic bits Andrew took it should work from any context.


Ok, I'll do the rename. Does that work on any arch?

Thanks.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-13  7:20   ` Peter Zijlstra
@ 2010-10-13 14:56     ` Frederic Weisbecker
  2010-10-13 14:58     ` Frederic Weisbecker
  2010-10-14 11:06     ` Stephane Eranian
  2 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Wed, Oct 13, 2010 at 09:20:53AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
> > user space context as it was before the user entered the kernel for
> > whatever reason.
> > 
> > This is going to be useful to bring Dwarf CFI based stack unwinding
> > on top of samples.
> 
> This doesn't address any of the issues that were raised previously.
> 
> There's a reason we don't have PERF_SAMPLE_*REGS like things.
> 
> See: http://lkml.org/lkml/2010/3/3/308


Right, we indeed have no way currently to know where these regs came
from. So we need to dump the kernel arch informations.

Probably we should do that from the kernel, so that compat archs
really can't get it wrong.

But I suspect we should do it once and not on every sample, so it
must be no sample type.

A silly idea would be to implement a minimal counter which counting
value gives a code that tells about the arch. No sample at all for this
event, just a count.

Hmm?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-13  7:20   ` Peter Zijlstra
  2010-10-13 14:56     ` Frederic Weisbecker
@ 2010-10-13 14:58     ` Frederic Weisbecker
  2010-10-14 11:06     ` Stephane Eranian
  2 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Wed, Oct 13, 2010 at 09:20:53AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
> > user space context as it was before the user entered the kernel for
> > whatever reason.
> > 
> > This is going to be useful to bring Dwarf CFI based stack unwinding
> > on top of samples.
> 
> This doesn't address any of the issues that were raised previously.
> 
> There's a reason we don't have PERF_SAMPLE_*REGS like things.
> 
> See: http://lkml.org/lkml/2010/3/3/308


But the best would be to get it from sysfs I think.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 3/9] perf: Add ability to dump part of the user stack
  2010-10-13  7:22   ` Peter Zijlstra
@ 2010-10-13 15:01     ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, Oct 13, 2010 at 09:22:17AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > +++ b/include/linux/perf_event.h
> > @@ -227,6 +227,12 @@ struct perf_event_attr {
> >         __u32                   bp_type;
> >         __u64                   bp_addr;
> >         __u64                   bp_len;
> > +
> > +       __u64                   __reserved_2;
> > +       __u64                   __reserved_3;
> > +
> > +       __u32                   ustack_dump_size;
> > +       __u32                   __reserved_4;
> >  }; 
> 
> Why add those two __u64 fields?


Because I suspect one day the breakpoint fields will need to be
extended. Some archs have quite fancy features for breakpoints
that include value comparison for example. So a field could be
used for the comparison operator and the other for the value
to be compared against.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs
  2010-10-13  7:23   ` Peter Zijlstra
@ 2010-10-13 15:02     ` Frederic Weisbecker
  2010-10-16  0:19     ` Frederic Weisbecker
  1 sibling, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-13 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, Oct 13, 2010 at 09:23:06AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > User and kernel stack might be selected for other uses than callchain
> > in the future, this probably shouldn't mess with the regular callchain
> > code. Instead we should probably have an exclude_callchain_user
> > attribute, that could be also useful to filter out user callchains
> > when people don't want them. 
> 
> Probably ;-)


Will change that then :)

Thanks.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] perf: Dwarf cfi based user callchains
  2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
                   ` (8 preceding siblings ...)
  2010-10-13  5:07 ` [RFC PATCH 9/9] perf: Support for dwarf cfi unwinding on post processing Frederic Weisbecker
@ 2010-10-13 15:13 ` Frank Ch. Eigler
  2010-10-20 15:35   ` Frederic Weisbecker
  9 siblings, 1 reply; 33+ messages in thread
From: Frank Ch. Eigler @ 2010-10-13 15:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter


Frederic Weisbecker <fweisbec@gmail.com> writes:

> [...]
> This brings dwarf cfi based callchain for userspace apps that don't have
> frame pointers.

Interesting approach!

> [...]
> - it's slow. A first improvement to make it faster is to support binary
>   search from .eh_frame_hdr. 

In systemtap land, we did find a dramatic improvement from this too.

Have you measured the cost of transcribing of potentially large chunks
of the user stacks?  We did not seriously evaluate this path, since we
encounter megabyte+ stacks in larger userspace apps, and copying THAT
out seemed absurd.


- FChE

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available
  2010-10-13 14:47     ` Frederic Weisbecker
@ 2010-10-13 20:43       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-13 20:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, 2010-10-13 at 16:47 +0200, Frederic Weisbecker wrote:
> Ok, I'll do the rename. Does that work on any arch?

On any arch that implements __get_user_pages_fast() which is only x86
atm, but any arch that supports get_user_pages_fast() could do it.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-13  7:20   ` Peter Zijlstra
  2010-10-13 14:56     ` Frederic Weisbecker
  2010-10-13 14:58     ` Frederic Weisbecker
@ 2010-10-14 11:06     ` Stephane Eranian
  2010-10-14 11:20       ` Frederic Weisbecker
  2 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2010-10-14 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

Hi,



On Wed, Oct 13, 2010 at 9:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
>> Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
>> user space context as it was before the user entered the kernel for
>> whatever reason.
>>
>> This is going to be useful to bring Dwarf CFI based stack unwinding
>> on top of samples.
>
> This doesn't address any of the issues that were raised previously.
>
> There's a reason we don't have PERF_SAMPLE_*REGS like things.
>
We definitively need to find a solution to this problem. It is important
to export this kind of information to users when using PEBS, for instance

What is exported depends on what is monitored and not just the ABI
of the kernel. On a 64-bit kernel, you may capture samples from
i386 or x86_64. Somehow the record needs to be self describing.

What about something like:
struct  {
      int type; /* 32-bit, 64-bit */
      int nr;    /* number of regs */
      struct {
          int reg_name; /* taken from an enum with all possible regs */
          u64 reg_value;
      } [0]
};

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-14 11:06     ` Stephane Eranian
@ 2010-10-14 11:20       ` Frederic Weisbecker
  2010-10-15  8:39         ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-14 11:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Thu, Oct 14, 2010 at 01:06:30PM +0200, Stephane Eranian wrote:
> Hi,
> 
> 
> 
> On Wed, Oct 13, 2010 at 9:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> >> Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
> >> user space context as it was before the user entered the kernel for
> >> whatever reason.
> >>
> >> This is going to be useful to bring Dwarf CFI based stack unwinding
> >> on top of samples.
> >
> > This doesn't address any of the issues that were raised previously.
> >
> > There's a reason we don't have PERF_SAMPLE_*REGS like things.
> >
> We definitively need to find a solution to this problem. It is important
> to export this kind of information to users when using PEBS, for instance



Would you need to export only a part of the regs for cases like PEBS?



> 
> What is exported depends on what is monitored and not just the ABI
> of the kernel. On a 64-bit kernel, you may capture samples from
> i386 or x86_64. Somehow the record needs to be self describing.
> 
> What about something like:
> struct  {
>       int type; /* 32-bit, 64-bit */
>       int nr;    /* number of regs */
>       struct {
>           int reg_name; /* taken from an enum with all possible regs */
>           u64 reg_value;
>       } [0]
> };



Yeah but in this case we can probably avoid to embed all the regnames
in every dumps. This can be retrieved from what we asked in the attrs,
which could be a u64 bitmap that tells which regs we want? (that only
if we want a per reg granularity).


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-14 11:20       ` Frederic Weisbecker
@ 2010-10-15  8:39         ` Stephane Eranian
  2010-10-15 22:58           ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2010-10-15  8:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

Frederic,

On Thu, Oct 14, 2010 at 1:20 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Thu, Oct 14, 2010 at 01:06:30PM +0200, Stephane Eranian wrote:
>> Hi,
>>
>>
>>
>> On Wed, Oct 13, 2010 at 9:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
>> >> Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
>> >> user space context as it was before the user entered the kernel for
>> >> whatever reason.
>> >>
>> >> This is going to be useful to bring Dwarf CFI based stack unwinding
>> >> on top of samples.
>> >
>> > This doesn't address any of the issues that were raised previously.
>> >
>> > There's a reason we don't have PERF_SAMPLE_*REGS like things.
>> >
>> We definitively need to find a solution to this problem. It is important
>> to export this kind of information to users when using PEBS, for instance
>
>
>
> Would you need to export only a part of the regs for cases like PEBS?
>
Yes, PEBS does not capture the entire state.

Here is what you get on Intel Core:
        u64 flags, ip;
        u64 ax, bx, cx, dx;
        u64 si, di, bp, sp;
        u64 r8,  r9,  r10, r11;
        u64 r12, r13, r14, r15;

In 32-bit, the rXX are zero and would not need to be
exposed.

>
>
>>
>> What is exported depends on what is monitored and not just the ABI
>> of the kernel. On a 64-bit kernel, you may capture samples from
>> i386 or x86_64. Somehow the record needs to be self describing.
>>
>> What about something like:
>> struct  {
>>       int type; /* 32-bit, 64-bit */
>>       int nr;    /* number of regs */
>>       struct {
>>           int reg_name; /* taken from an enum with all possible regs */
>>           u64 reg_value;
>>       } [0]
>> };
>
>
>
> Yeah but in this case we can probably avoid to embed all the regnames
> in every dumps. This can be retrieved from what we asked in the attrs,
> which could be a u64 bitmap that tells which regs we want? (that only
> if we want a per reg granularity).
>
Yes, that's another possibility and it may be better because if you want
only one register,e.g,, EAX, then you can ask for it. That would limit the
memory consumption in the sampling buffer.

You need a bitmask to name the registers you want. If the register is not
accessible in the sampling mode you're requesting, then you should get
an error.

I would not necessarily use the attr.sample_type because we might run out
of bits on architecture with lots of registers. Another reason is that
the register
names are arch-specific, unlike what's in attr.sample_type.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-15  8:39         ` Stephane Eranian
@ 2010-10-15 22:58           ` Frederic Weisbecker
  2010-10-17 10:07             ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-15 22:58 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Fri, Oct 15, 2010 at 10:39:43AM +0200, Stephane Eranian wrote:
> Frederic,
> 
> On Thu, Oct 14, 2010 at 1:20 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Thu, Oct 14, 2010 at 01:06:30PM +0200, Stephane Eranian wrote:
> >> Hi,
> >>
> >>
> >>
> >> On Wed, Oct 13, 2010 at 9:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> >> >> Add new PERF_SAMPLE_UREGS to perf sample type. This will dump the
> >> >> user space context as it was before the user entered the kernel for
> >> >> whatever reason.
> >> >>
> >> >> This is going to be useful to bring Dwarf CFI based stack unwinding
> >> >> on top of samples.
> >> >
> >> > This doesn't address any of the issues that were raised previously.
> >> >
> >> > There's a reason we don't have PERF_SAMPLE_*REGS like things.
> >> >
> >> We definitively need to find a solution to this problem. It is important
> >> to export this kind of information to users when using PEBS, for instance
> >
> >
> >
> > Would you need to export only a part of the regs for cases like PEBS?
> >
> Yes, PEBS does not capture the entire state.
> 
> Here is what you get on Intel Core:
>         u64 flags, ip;
>         u64 ax, bx, cx, dx;
>         u64 si, di, bp, sp;
>         u64 r8,  r9,  r10, r11;
>         u64 r12, r13, r14, r15;



Ok, that seems to cover most of the state. I guess few people care
about cs, ds, es, fs, gs, most of the time.



> 
> In 32-bit, the rXX are zero and would not need to be
> exposed.
> 
> >
> >
> >>
> >> What is exported depends on what is monitored and not just the ABI
> >> of the kernel. On a 64-bit kernel, you may capture samples from
> >> i386 or x86_64. Somehow the record needs to be self describing.
> >>
> >> What about something like:
> >> struct  {
> >>       int type; /* 32-bit, 64-bit */
> >>       int nr;    /* number of regs */
> >>       struct {
> >>           int reg_name; /* taken from an enum with all possible regs */
> >>           u64 reg_value;
> >>       } [0]
> >> };
> >
> >
> >
> > Yeah but in this case we can probably avoid to embed all the regnames
> > in every dumps. This can be retrieved from what we asked in the attrs,
> > which could be a u64 bitmap that tells which regs we want? (that only
> > if we want a per reg granularity).
> >
> Yes, that's another possibility and it may be better because if you want
> only one register,e.g,, EAX, then you can ask for it. That would limit the
> memory consumption in the sampling buffer.


But then I wonder who needs EAX only? If you need eax, then you almost
certainly need most the other general registers.

May be we can group them by "family"? Like one group for general registers
(r0 - r15), one for segment registers (cs - gs) and one for eflags.
We can perhaps isolate single groups for stack pointer, frame pointer and
instruction pointer.

But I can't imagine every possible uses of the regs dump, may be we should
just have one flag per register to enforce a maximum flexibility
and don't bother further.

Hm?


> You need a bitmask to name the registers you want. If the register is not
> accessible in the sampling mode you're requesting, then you should get
> an error.


Yep.


 
> I would not necessarily use the attr.sample_type because we might run out
> of bits on architecture with lots of registers. Another reason is that
> the register
> names are arch-specific, unlike what's in attr.sample_type.


Yeah we really need a new field for that, u64 so that we have enough bits
for every possible regs set (at least I hope...).


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs
  2010-10-13  7:23   ` Peter Zijlstra
  2010-10-13 15:02     ` Frederic Weisbecker
@ 2010-10-16  0:19     ` Frederic Weisbecker
  1 sibling, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-16  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, Oct 13, 2010 at 09:23:06AM +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-13 at 07:06 +0200, Frederic Weisbecker wrote:
> > User and kernel stack might be selected for other uses than callchain
> > in the future, this probably shouldn't mess with the regular callchain
> > code. Instead we should probably have an exclude_callchain_user
> > attribute, that could be also useful to filter out user callchains
> > when people don't want them. 
> 
> Probably ;-)


There is another solution that would solve my vdso problem
in the meantime.

The problem with vdso is that if we entered the kernel with
a syscall, the first user entry in the callchain will be a
vdso address. But vdso doesn't have dwarf informations so we
can't unwind further.

One solution would be having max_callchain_user as an attribute.
If we do a normal frame pointer based callchain, set it to -1
and you won't have limitations in your callchain. Or you
can set it to n so that your callchains are bound to a maximum
depth.
If you don't want user callchains, set it to 0.
If you do a dwarf based callchain, set it to 2, so that if userspace
was in a vdso, we just deref the frame pointer to find what called
the vdso, and then we can start the dwarf unwinding from there.

I think I'll try that.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-15 22:58           ` Frederic Weisbecker
@ 2010-10-17 10:07             ` Peter Zijlstra
  2010-10-18 10:01               ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-17 10:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Stephane Eranian, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Sat, 2010-10-16 at 00:58 +0200, Frederic Weisbecker wrote:
> > Yes, PEBS does not capture the entire state.
> > 
> > Here is what you get on Intel Core:
> >         u64 flags, ip;
> >         u64 ax, bx, cx, dx;
> >         u64 si, di, bp, sp;
> >         u64 r8,  r9,  r10, r11;
> >         u64 r12, r13, r14, r15;

> Ok, that seems to cover most of the state. I guess few people care
> about cs, ds, es, fs, gs, most of the time. 

Yeah, except if you want to profile wine or something like that ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-17 10:07             ` Peter Zijlstra
@ 2010-10-18 10:01               ` Stephane Eranian
  2010-10-18 22:35                 ` Frederic Weisbecker
  0 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2010-10-18 10:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Sun, Oct 17, 2010 at 12:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, 2010-10-16 at 00:58 +0200, Frederic Weisbecker wrote:
>> > Yes, PEBS does not capture the entire state.
>> >
>> > Here is what you get on Intel Core:
>> >         u64 flags, ip;
>> >         u64 ax, bx, cx, dx;
>> >         u64 si, di, bp, sp;
>> >         u64 r8,  r9,  r10, r11;
>> >         u64 r12, r13, r14, r15;
>
>> Ok, that seems to cover most of the state. I guess few people care
>> about cs, ds, es, fs, gs, most of the time.
>
> Yeah, except if you want to profile wine or something like that ;-)
>
That means that if you want the segment registers, then you cannot
use PEBS. I think you could catch that when the event is created.

The other problem here is how to name registers at the API level.
You would be introducing architecture-specific register names
in perf_event.h. There is no such a thing today.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-18 10:01               ` Stephane Eranian
@ 2010-10-18 22:35                 ` Frederic Weisbecker
  2010-10-20  9:24                   ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-18 22:35 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Mon, Oct 18, 2010 at 12:01:18PM +0200, Stephane Eranian wrote:
> On Sun, Oct 17, 2010 at 12:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, 2010-10-16 at 00:58 +0200, Frederic Weisbecker wrote:
> >> > Yes, PEBS does not capture the entire state.
> >> >
> >> > Here is what you get on Intel Core:
> >> >         u64 flags, ip;
> >> >         u64 ax, bx, cx, dx;
> >> >         u64 si, di, bp, sp;
> >> >         u64 r8,  r9,  r10, r11;
> >> >         u64 r12, r13, r14, r15;
> >
> >> Ok, that seems to cover most of the state. I guess few people care
> >> about cs, ds, es, fs, gs, most of the time.
> >
> > Yeah, except if you want to profile wine or something like that ;-)
> >
> That means that if you want the segment registers, then you cannot
> use PEBS. I think you could catch that when the event is created.
> 
> The other problem here is how to name registers at the API level.
> You would be introducing architecture-specific register names
> in perf_event.h. There is no such a thing today.


That can go into an asm/perf_regs.h or something. It's up to the
arch to name its registers.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-18 22:35                 ` Frederic Weisbecker
@ 2010-10-20  9:24                   ` Stephane Eranian
  2010-10-20 16:13                     ` Frederic Weisbecker
  2010-10-20 16:19                     ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Stephane Eranian @ 2010-10-20  9:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Tue, Oct 19, 2010 at 12:35 AM, Frederic Weisbecker
<fweisbec@gmail.com> wrote:
> On Mon, Oct 18, 2010 at 12:01:18PM +0200, Stephane Eranian wrote:
>> On Sun, Oct 17, 2010 at 12:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Sat, 2010-10-16 at 00:58 +0200, Frederic Weisbecker wrote:
>> >> > Yes, PEBS does not capture the entire state.
>> >> >
>> >> > Here is what you get on Intel Core:
>> >> >         u64 flags, ip;
>> >> >         u64 ax, bx, cx, dx;
>> >> >         u64 si, di, bp, sp;
>> >> >         u64 r8,  r9,  r10, r11;
>> >> >         u64 r12, r13, r14, r15;
>> >
>> >> Ok, that seems to cover most of the state. I guess few people care
>> >> about cs, ds, es, fs, gs, most of the time.
>> >
>> > Yeah, except if you want to profile wine or something like that ;-)
>> >
>> That means that if you want the segment registers, then you cannot
>> use PEBS. I think you could catch that when the event is created.
>>
>> The other problem here is how to name registers at the API level.
>> You would be introducing architecture-specific register names
>> in perf_event.h. There is no such a thing today.
>
>
> That can go into an asm/perf_regs.h or something. It's up to the
> arch to name its registers.
>
I am fine with that.

Starting with Nehalem,  there is a PEBS mode where HW captures
not just actual register state but also information about cache misses
such as the data address, miss latency, data source. Those are
stored in the PEBS record as u64. I believe we could also expose
this thru this register bitmask mechanism. Of course, you'd get a
failure if PEBS is not programmed correctly.

The alternative would be to invent yet another generic abstraction
to sample cache misses. Note that PEBS cache miss sampling
cannot be attached to an existing generic cache miss event. It
uses a dedicated event which does not count all cache misses.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] perf: Dwarf cfi based user callchains
  2010-10-13 15:13 ` [RFC] perf: Dwarf cfi based user callchains Frank Ch. Eigler
@ 2010-10-20 15:35   ` Frederic Weisbecker
  0 siblings, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-20 15:35 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: LKML, Peter Zijlstra, Arnaldo Carvalho de Melo, Paul Mackerras,
	Stephane Eranian, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter

On Wed, Oct 13, 2010 at 11:13:27AM -0400, Frank Ch. Eigler wrote:
> 
> Frederic Weisbecker <fweisbec@gmail.com> writes:
> 
> > [...]
> > This brings dwarf cfi based callchain for userspace apps that don't have
> > frame pointers.
> 
> Interesting approach!
> 
> > [...]
> > - it's slow. A first improvement to make it faster is to support binary
> >   search from .eh_frame_hdr. 
> 
> In systemtap land, we did find a dramatic improvement from this too.


Yeah, linear walking on .eh_frame is not a good thing other than for
testing and debugging.


 
> Have you measured the cost of transcribing of potentially large chunks
> of the user stacks?  We did not seriously evaluate this path, since we
> encounter megabyte+ stacks in larger userspace apps, and copying THAT
> out seemed absurd.


Actually that's quite affordable. And you don't need to dump all the
stack of a process on every samples, that would indeed be absurd.
A small chunk, starting from the stack pointer, is enough.

What is interesting is that you can play with several different
sizes of dump, the higher it is and the deeper you'll be able to
unwind, that also means more profiling overhead.

I can't measure significant throughput issues with hackbench
for example:


Normal run:

	$ time ./hackbench 10
	Time: 3.415

	real	0m3.506s
	user	0m0.257s
	sys	0m6.519s


With perf record (default is cycles counter with 1000 HZ samples frequency):

	$ time ./perf record ./hackbench 10
	Time: 3.584
	[ perf record: Woken up 1 times to write data ]
	[ perf record: Captured and wrote 0.381 MB perf.data (~16654 samples) ]

	real	0m3.748s
	user	0m0.028s
	sys	0m0.022s


With perf record + frame pointer based callchain capture

	$ time ./perf record -g fp ./hackbench 10
	Time: 3.666
	[ perf record: Woken up 3 times to write data ]
	[ perf record: Captured and wrote 1.426 MB perf.data (~62281 samples) ]

	real	0m3.834s
	user	0m0.033s
	sys	0m0.046s

With perf record + 4096 bytes of stack dump in every sample


	$ time ./perf record -g dwarf,4096 ./hackbench 10
	Time: 3.697
	[ perf record: Woken up 15 times to write data ]
	[ perf record: Captured and wrote 5.156 MB perf.data (~225251 samples) ]

	real	0m3.931s
	user	0m0.026s
	sys	0m0.135s

With perf record + 20000 bytes of stack dump in every sample

	$ time ./perf record -g dwarf,20000 ./hackbench 10
	Time: 3.847
	[ perf record: Woken up 9 times to write data ]
	[ perf record: Captured and wrote 13.349 MB perf.data (~583219 samples) ]

	real	0m4.559s
	user	0m0.028s
	sys	0m0.329s


So there is no big differences. May be hackbench is not a good example
to highlight the possible impact though. And some tuning with higher
frequencies would make the difference more visible.

Perhaps the real impact is more on the amount to record, the data file
tends to grow quickly, obviously.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-20  9:24                   ` Stephane Eranian
@ 2010-10-20 16:13                     ` Frederic Weisbecker
  2010-10-20 16:19                     ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Frederic Weisbecker @ 2010-10-20 16:13 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Wed, Oct 20, 2010 at 11:24:42AM +0200, Stephane Eranian wrote:
> On Tue, Oct 19, 2010 at 12:35 AM, Frederic Weisbecker
> <fweisbec@gmail.com> wrote:
> > On Mon, Oct 18, 2010 at 12:01:18PM +0200, Stephane Eranian wrote:
> >> On Sun, Oct 17, 2010 at 12:07 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Sat, 2010-10-16 at 00:58 +0200, Frederic Weisbecker wrote:
> >> >> > Yes, PEBS does not capture the entire state.
> >> >> >
> >> >> > Here is what you get on Intel Core:
> >> >> >         u64 flags, ip;
> >> >> >         u64 ax, bx, cx, dx;
> >> >> >         u64 si, di, bp, sp;
> >> >> >         u64 r8,  r9,  r10, r11;
> >> >> >         u64 r12, r13, r14, r15;
> >> >
> >> >> Ok, that seems to cover most of the state. I guess few people care
> >> >> about cs, ds, es, fs, gs, most of the time.
> >> >
> >> > Yeah, except if you want to profile wine or something like that ;-)
> >> >
> >> That means that if you want the segment registers, then you cannot
> >> use PEBS. I think you could catch that when the event is created.
> >>
> >> The other problem here is how to name registers at the API level.
> >> You would be introducing architecture-specific register names
> >> in perf_event.h. There is no such a thing today.
> >
> >
> > That can go into an asm/perf_regs.h or something. It's up to the
> > arch to name its registers.
> >
> I am fine with that.
> 
> Starting with Nehalem,  there is a PEBS mode where HW captures
> not just actual register state but also information about cache misses
> such as the data address, miss latency, data source. Those are
> stored in the PEBS record as u64. I believe we could also expose
> this thru this register bitmask mechanism. Of course, you'd get a
> failure if PEBS is not programmed correctly.



I'm not sure the registers are the right place for that. This is
too oriented toward a specific mechanism.

I would rather put that into a PERF_SAMPLE_RAW dump or a specific pebs
sample.

The problem with PERF_SAMPLE_RAW is that perf tools always think
it's trace event content. It should look at what event it is
looking at before making that assumption.

We'd need to look at the event that triggered the sample to
interpret the sample raw. That's fixable.



> The alternative would be to invent yet another generic abstraction
> to sample cache misses. Note that PEBS cache miss sampling
> cannot be attached to an existing generic cache miss event. It
> uses a dedicated event which does not count all cache misses.

Then perhaps that should be abstracted into a different event yeah.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 2/9] perf: Add ability to dump user regs
  2010-10-20  9:24                   ` Stephane Eranian
  2010-10-20 16:13                     ` Frederic Weisbecker
@ 2010-10-20 16:19                     ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2010-10-20 16:19 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, LKML, Ingo Molnar, Arnaldo Carvalho de Melo,
	Paul Mackerras, Cyrill Gorcunov, Tom Zanussi, Masami Hiramatsu,
	Steven Rostedt, Robert Richter, David Miller

On Wed, 2010-10-20 at 11:24 +0200, Stephane Eranian wrote:
> 
> The alternative would be to invent yet another generic abstraction
> to sample cache misses. Note that PEBS cache miss sampling
> cannot be attached to an existing generic cache miss event. It
> uses a dedicated event which does not count all cache misses. 

Yeah, the IBS trainwreck comes to mind, I wish the PMU designers
wouldn't create such a mess.

I really dislike using PERF_SAMPLE_RAW for things like this, because it
fully arch dependent.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2010-10-20 16:19 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-13  5:06 [RFC] perf: Dwarf cfi based user callchains Frederic Weisbecker
2010-10-13  5:06 ` [RFC PATCH 1/9] uaccess: Make copy_from_user_nmi() globally available Frederic Weisbecker
2010-10-13  7:15   ` Peter Zijlstra
2010-10-13 14:47     ` Frederic Weisbecker
2010-10-13 20:43       ` Peter Zijlstra
2010-10-13  5:06 ` [RFC PATCH 2/9] perf: Add ability to dump user regs Frederic Weisbecker
2010-10-13  7:20   ` Peter Zijlstra
2010-10-13 14:56     ` Frederic Weisbecker
2010-10-13 14:58     ` Frederic Weisbecker
2010-10-14 11:06     ` Stephane Eranian
2010-10-14 11:20       ` Frederic Weisbecker
2010-10-15  8:39         ` Stephane Eranian
2010-10-15 22:58           ` Frederic Weisbecker
2010-10-17 10:07             ` Peter Zijlstra
2010-10-18 10:01               ` Stephane Eranian
2010-10-18 22:35                 ` Frederic Weisbecker
2010-10-20  9:24                   ` Stephane Eranian
2010-10-20 16:13                     ` Frederic Weisbecker
2010-10-20 16:19                     ` Peter Zijlstra
2010-10-13  5:06 ` [RFC PATCH 3/9] perf: Add ability to dump part of the user stack Frederic Weisbecker
2010-10-13  7:22   ` Peter Zijlstra
2010-10-13 15:01     ` Frederic Weisbecker
2010-10-13  5:06 ` [RFC PATCH 4/9] perf: Don't record frame pointer based user stacktraces if we dump stack and regs Frederic Weisbecker
2010-10-13  7:23   ` Peter Zijlstra
2010-10-13 15:02     ` Frederic Weisbecker
2010-10-16  0:19     ` Frederic Weisbecker
2010-10-13  5:06 ` [RFC PATCH 5/9] perf: Support for dwarf mode callchain on perf record Frederic Weisbecker
2010-10-13  5:06 ` [RFC PATCH 6/9] perf: Build with dwarf cfi Frederic Weisbecker
2010-10-13  5:06 ` [RFC PATCH 7/9] perf: Support for error passed over pointers Frederic Weisbecker
2010-10-13  5:07 ` [RFC PATCH 8/9] perf: Add libunwind dependency for dwarf cfi unwinding Frederic Weisbecker
2010-10-13  5:07 ` [RFC PATCH 9/9] perf: Support for dwarf cfi unwinding on post processing Frederic Weisbecker
2010-10-13 15:13 ` [RFC] perf: Dwarf cfi based user callchains Frank Ch. Eigler
2010-10-20 15:35   ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).