* [PATCH] perf x86_64: Fix rsp register for system call fast path
@ 2012-10-01 17:31 Jiri Olsa
2012-10-02 10:44 ` Peter Zijlstra
0 siblings, 1 reply; 14+ messages in thread
From: Jiri Olsa @ 2012-10-01 17:31 UTC (permalink / raw)
To: linux-kernel
Cc: Jiri Olsa, Frederic Weisbecker, Ingo Molnar, Paul Mackerras,
Peter Zijlstra, Arnaldo Carvalho de Melo, Oleg Nesterov
The user level rsp register value attached to the sample is crucial
for proper user stack dump and for proper dwarf backtrace post unwind.
But currently, if the event happens within the system call fast path,
we don't store proper rsp register value in the event sample.
The reason is that the syscall fast path stores just minimal set of
registers to the task's struct pt_regs area. The rsp itself is stored
in per cpu variable 'old_rsp'.
This patch fixes this rsp register value based on the:
- 'old_rsp' per cpu variable
(updated within the syscall fast path)
- guess on how we got into the kernel - syscall or interrupt
(via pt_regs::orig_ax value)
We can use 'old_rsp' value only if we are inside the syscall.
Thanks to Oleg who outlined this solution!
Above guess introduces 2 race windows (fully desccribed within the patch
comments), where we might get incorrect user level rsp value stored in
sample. However, in comparison with system call fast path length, we still
get much more precise rsp values than without the patch.
Note that as we are now changing the pt_regs, we use statically allocated
pt_regs inside the sample data instead of task pt_regs pointer.
Example of syscall fast path dwarf backtrace unwind:
(perf record -e cycles -g dwarf ls; perf report --stdio)
Before the patch applied:
--23.76%-- preempt_schedule_irq
retint_kernel
tty_ldisc_deref
tty_write
vfs_write
sys_write
system_call_fastpath
__GI___libc_write
0x6
With the patch applied:
--12.37%-- finish_task_switch
__schedule
preempt_schedule
queue_work
schedule_work
tty_flip_buffer_push
pty_write
n_tty_write
tty_write
vfs_write
sys_write
system_call_fastpath
__GI___libc_write
_IO_file_write@@GLIBC_2.2.5
new_do_write
_IO_do_write@@GLIBC_2.2.5
_IO_file_overflow@@GLIBC_2.2.5
print_current_files
main
__libc_start_main
_start
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
---
arch/x86/kernel/cpu/perf_event.c | 47 ++++++++++++++++++++++
include/linux/perf_event.h | 6 +-
kernel/events/core.c | 80 +++++++++++++++++++------------------
3 files changed, 91 insertions(+), 42 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 915b876..11d62ff 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -34,6 +34,7 @@
#include <asm/timer.h>
#include <asm/desc.h>
#include <asm/ldt.h>
+#include <asm/syscall.h>
#include "perf_event.h"
@@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
userpg->time_offset = this_cpu_read(cyc2ns_offset) - now;
}
+#ifdef CONFIG_X86_64
+__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs)
+{
+ int kernel = !user_mode(regs);
+
+ if (kernel) {
+ if (current->mm)
+ regs = task_pt_regs(current);
+ else
+ regs = NULL;
+ }
+
+ if (regs) {
+ memcpy(oregs, regs, sizeof(*regs));
+
+ /*
+ * If the perf event was triggered within the kernel code
+ * path, then it was either syscall or interrupt. While
+ * interrupt stores almost all user registers, the syscall
+ * fast path does not. At this point we can at least set
+ * rsp register right, which is crucial for dwarf unwind.
+ *
+ * The syscall_get_nr function returns -1 (orig_ax) for
+ * interrupt, and positive value for syscall.
+ *
+ * We have two race windows in here:
+ *
+ * 1) Few instructions from syscall entry until old_rsp is
+ * set.
+ *
+ * 2) In syscall/interrupt path from entry until the orig_ax
+ * is set.
+ *
+ * Above described race windows are fractional opposed to
+ * the syscall fast path, so we get much better results
+ * fixing rsp this way.
+ */
+ if (kernel && (syscall_get_nr(current, regs) >= 0))
+ oregs->sp = this_cpu_read(old_rsp);
+
+ }
+
+ return regs ? 1 : 0;
+}
+#endif
+
/*
* callchain support
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 599afc4..451dcc5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -696,7 +696,7 @@ struct perf_branch_stack {
struct perf_regs_user {
__u64 abi;
- struct pt_regs *regs;
+ struct pt_regs regs;
};
struct task_struct;
@@ -1190,8 +1190,8 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
data->raw = NULL;
data->br_stack = NULL;
data->period = period;
- data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
- data->regs_user.regs = NULL;
+ /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */
+ memset(&data->regs_user, 0, sizeof(data->regs_user));
data->stack_user_size = 0;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7b9df35..5beb963 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3780,8 +3780,7 @@ perf_output_sample_regs(struct perf_output_handle *handle,
}
}
-static void perf_sample_regs_user(struct perf_regs_user *regs_user,
- struct pt_regs *regs)
+__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs)
{
if (!user_mode(regs)) {
if (current->mm)
@@ -3790,10 +3789,19 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user,
regs = NULL;
}
- if (regs) {
- regs_user->regs = regs;
- regs_user->abi = perf_reg_abi(current);
- }
+ if (regs)
+ memcpy(oregs, regs, sizeof(*regs));
+
+ return regs ? 1 : 0;
+}
+
+static int perf_sample_regs_user(struct perf_regs_user *regs_user,
+ struct pt_regs *regs)
+{
+ if (arch_sample_regs_user(®s_user->regs, regs))
+ regs_user->abi = perf_reg_abi(current);
+
+ return regs_user->abi;
}
/*
@@ -3819,10 +3827,6 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size,
{
u64 task_size;
- /* No regs, no stack pointer, no dump. */
- if (!regs)
- return 0;
-
/*
* Check if we fit in with the requested stack size into the:
* - TASK_SIZE
@@ -3854,33 +3858,30 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size,
static void
perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size,
- struct pt_regs *regs)
+ struct perf_regs_user *uregs)
{
- /* Case of a kernel thread, nothing to dump */
- if (!regs) {
- u64 size = 0;
- perf_output_put(handle, size);
- } else {
+ /*
+ * We dump:
+ * static size
+ * - the size requested by user or the best one we can fit
+ * in to the sample max size
+ * - zero (and final data) if there's nothing to dump
+ * data
+ * - user stack dump data
+ * dynamic size
+ * - the actual dumped size
+ */
+
+ /* Static size. */
+ perf_output_put(handle, dump_size);
+
+ if (dump_size) {
unsigned long sp;
unsigned int rem;
u64 dyn_size;
- /*
- * We dump:
- * static size
- * - the size requested by user or the best one we can fit
- * in to the sample max size
- * data
- * - user stack dump data
- * dynamic size
- * - the actual dumped size
- */
-
- /* Static size. */
- perf_output_put(handle, dump_size);
-
/* Data. */
- sp = perf_user_stack_pointer(regs);
+ sp = perf_user_stack_pointer(&uregs->regs);
rem = __output_copy_user(handle, (void *) sp, dump_size);
dyn_size = dump_size - rem;
@@ -4164,7 +4165,7 @@ void perf_output_sample(struct perf_output_handle *handle,
if (abi) {
u64 mask = event->attr.sample_regs_user;
perf_output_sample_regs(handle,
- data->regs_user.regs,
+ &data->regs_user.regs,
mask);
}
}
@@ -4172,7 +4173,7 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_STACK_USER)
perf_output_sample_ustack(handle,
data->stack_user_size,
- data->regs_user.regs);
+ &data->regs_user);
}
void perf_prepare_sample(struct perf_event_header *header,
@@ -4229,9 +4230,7 @@ void perf_prepare_sample(struct perf_event_header *header,
/* regs dump ABI info */
int size = sizeof(u64);
- perf_sample_regs_user(&data->regs_user, regs);
-
- if (data->regs_user.regs) {
+ if (perf_sample_regs_user(&data->regs_user, regs)) {
u64 mask = event->attr.sample_regs_user;
size += hweight64(mask) * sizeof(u64);
}
@@ -4247,14 +4246,17 @@ void perf_prepare_sample(struct perf_event_header *header,
* up the rest of the sample size.
*/
struct perf_regs_user *uregs = &data->regs_user;
- u16 stack_size = event->attr.sample_stack_user;
+ u64 sample_stack_user = event->attr.sample_stack_user;
+ u16 stack_size = 0;
u16 size = sizeof(u64);
if (!uregs->abi)
perf_sample_regs_user(uregs, regs);
- stack_size = perf_sample_ustack_size(stack_size, header->size,
- uregs->regs);
+ if (uregs->abi)
+ stack_size = perf_sample_ustack_size(sample_stack_user,
+ header->size,
+ &uregs->regs);
/*
* If there is something to dump, add space for the dump
--
1.7.7.6
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH] perf x86_64: Fix rsp register for system call fast path 2012-10-01 17:31 [PATCH] perf x86_64: Fix rsp register for system call fast path Jiri Olsa @ 2012-10-02 10:44 ` Peter Zijlstra 2012-10-02 14:58 ` [PATCHv2] " Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2012-10-02 10:44 UTC (permalink / raw) To: Jiri Olsa Cc: linux-kernel, Frederic Weisbecker, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Mon, 2012-10-01 at 19:31 +0200, Jiri Olsa wrote: > @@ -696,7 +696,7 @@ struct perf_branch_stack { > > struct perf_regs_user { > __u64 abi; > - struct pt_regs *regs; > + struct pt_regs regs; > }; That's somewhat unfortunate but unavoidable I guess, can't go modify pt_regs. > + if (uregs->abi) > + stack_size = perf_sample_ustack_size(sample_stack_user, > + header->size, > + just a style nit, please add {} for all multi-line single stmt constructs like that, even though not strictly required. It reduces the possible confusion between multi-line and multi-statement and reads easier. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCHv2] perf x86_64: Fix rsp register for system call fast path 2012-10-02 10:44 ` Peter Zijlstra @ 2012-10-02 14:58 ` Jiri Olsa 2012-10-02 15:49 ` Frederic Weisbecker 0 siblings, 1 reply; 14+ messages in thread From: Jiri Olsa @ 2012-10-02 14:58 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Frederic Weisbecker, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Tue, Oct 02, 2012 at 12:44:04PM +0200, Peter Zijlstra wrote: > On Mon, 2012-10-01 at 19:31 +0200, Jiri Olsa wrote: > > @@ -696,7 +696,7 @@ struct perf_branch_stack { > > > > struct perf_regs_user { > > __u64 abi; > > - struct pt_regs *regs; > > + struct pt_regs regs; > > }; > > That's somewhat unfortunate but unavoidable I guess, can't go modify pt_regs. > > > > + if (uregs->abi) > > + stack_size = perf_sample_ustack_size(sample_stack_user, > > + header->size, > > + > > just a style nit, please add {} for all multi-line single stmt > constructs like that, even though not strictly required. > > It reduces the possible confusion between multi-line and multi-statement > and reads easier. fixed, new version is attached thanks, jirka --- The user level rsp register value attached to the sample is crucial for proper user stack dump and for proper dwarf backtrace post unwind. But currently, if the event happens within the system call fast path, we don't store proper rsp register value in the event sample. The reason is that the syscall fast path stores just minimal set of registers to the task's struct pt_regs area. The rsp itself is stored in per cpu variable 'old_rsp'. This patch fixes this rsp register value based on the: - 'old_rsp' per cpu variable (updated within the syscall fast path) - guess on how we got into the kernel - syscall or interrupt (via pt_regs::orig_ax value) We can use 'old_rsp' value only if we are inside the syscall. Thanks to Oleg who outlined this solution! Above guess introduces 2 race windows (fully desccribed within the patch comments), where we might get incorrect user level rsp value stored in sample. However, in comparison with system call fast path length, we still get much more precise rsp values than without the patch. Note that as we are now changing the pt_regs, we use statically allocated pt_regs inside the sample data instead of task pt_regs pointer. Example of syscall fast path dwarf backtrace unwind: (perf record -e cycles -g dwarf ls; perf report --stdio) Before the patch applied: --23.76%-- preempt_schedule_irq retint_kernel tty_ldisc_deref tty_write vfs_write sys_write system_call_fastpath __GI___libc_write 0x6 With the patch applied: --12.37%-- finish_task_switch __schedule preempt_schedule queue_work schedule_work tty_flip_buffer_push pty_write n_tty_write tty_write vfs_write sys_write system_call_fastpath __GI___libc_write _IO_file_write@@GLIBC_2.2.5 new_do_write _IO_do_write@@GLIBC_2.2.5 _IO_file_overflow@@GLIBC_2.2.5 print_current_files main __libc_start_main _start Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 47 ++++++++++++++++++++++ include/linux/perf_event.h | 6 +- kernel/events/core.c | 81 +++++++++++++++++++------------------ 3 files changed, 92 insertions(+), 42 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..11d62ff 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include <asm/timer.h> #include <asm/desc.h> #include <asm/ldt.h> +#include <asm/syscall.h> #include "perf_event.h" @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) +{ + int kernel = !user_mode(regs); + + if (kernel) { + if (current->mm) + regs = task_pt_regs(current); + else + regs = NULL; + } + + if (regs) { + memcpy(oregs, regs, sizeof(*regs)); + + /* + * If the perf event was triggered within the kernel code + * path, then it was either syscall or interrupt. While + * interrupt stores almost all user registers, the syscall + * fast path does not. At this point we can at least set + * rsp register right, which is crucial for dwarf unwind. + * + * The syscall_get_nr function returns -1 (orig_ax) for + * interrupt, and positive value for syscall. + * + * We have two race windows in here: + * + * 1) Few instructions from syscall entry until old_rsp is + * set. + * + * 2) In syscall/interrupt path from entry until the orig_ax + * is set. + * + * Above described race windows are fractional opposed to + * the syscall fast path, so we get much better results + * fixing rsp this way. + */ + if (kernel && (syscall_get_nr(current, regs) >= 0)) + oregs->sp = this_cpu_read(old_rsp); + + } + + return regs ? 1 : 0; +} +#endif + /* * callchain support */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 599afc4..451dcc5 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -696,7 +696,7 @@ struct perf_branch_stack { struct perf_regs_user { __u64 abi; - struct pt_regs *regs; + struct pt_regs regs; }; struct task_struct; @@ -1190,8 +1190,8 @@ static inline void perf_sample_data_init(struct perf_sample_data *data, data->raw = NULL; data->br_stack = NULL; data->period = period; - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; - data->regs_user.regs = NULL; + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ + memset(&data->regs_user, 0, sizeof(data->regs_user)); data->stack_user_size = 0; } diff --git a/kernel/events/core.c b/kernel/events/core.c index 7b9df35..bfe9b42 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3780,8 +3780,7 @@ perf_output_sample_regs(struct perf_output_handle *handle, } } -static void perf_sample_regs_user(struct perf_regs_user *regs_user, - struct pt_regs *regs) +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) { if (!user_mode(regs)) { if (current->mm) @@ -3790,10 +3789,19 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user, regs = NULL; } - if (regs) { - regs_user->regs = regs; - regs_user->abi = perf_reg_abi(current); - } + if (regs) + memcpy(oregs, regs, sizeof(*regs)); + + return regs ? 1 : 0; +} + +static int perf_sample_regs_user(struct perf_regs_user *regs_user, + struct pt_regs *regs) +{ + if (arch_sample_regs_user(®s_user->regs, regs)) + regs_user->abi = perf_reg_abi(current); + + return regs_user->abi; } /* @@ -3819,10 +3827,6 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, { u64 task_size; - /* No regs, no stack pointer, no dump. */ - if (!regs) - return 0; - /* * Check if we fit in with the requested stack size into the: * - TASK_SIZE @@ -3854,33 +3858,30 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, static void perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, - struct pt_regs *regs) + struct perf_regs_user *uregs) { - /* Case of a kernel thread, nothing to dump */ - if (!regs) { - u64 size = 0; - perf_output_put(handle, size); - } else { + /* + * We dump: + * static size + * - the size requested by user or the best one we can fit + * in to the sample max size + * - zero (and final data) if there's nothing to dump + * data + * - user stack dump data + * dynamic size + * - the actual dumped size + */ + + /* Static size. */ + perf_output_put(handle, dump_size); + + if (dump_size) { unsigned long sp; unsigned int rem; u64 dyn_size; - /* - * We dump: - * static size - * - the size requested by user or the best one we can fit - * in to the sample max size - * data - * - user stack dump data - * dynamic size - * - the actual dumped size - */ - - /* Static size. */ - perf_output_put(handle, dump_size); - /* Data. */ - sp = perf_user_stack_pointer(regs); + sp = perf_user_stack_pointer(&uregs->regs); rem = __output_copy_user(handle, (void *) sp, dump_size); dyn_size = dump_size - rem; @@ -4164,7 +4165,7 @@ void perf_output_sample(struct perf_output_handle *handle, if (abi) { u64 mask = event->attr.sample_regs_user; perf_output_sample_regs(handle, - data->regs_user.regs, + &data->regs_user.regs, mask); } } @@ -4172,7 +4173,7 @@ void perf_output_sample(struct perf_output_handle *handle, if (sample_type & PERF_SAMPLE_STACK_USER) perf_output_sample_ustack(handle, data->stack_user_size, - data->regs_user.regs); + &data->regs_user); } void perf_prepare_sample(struct perf_event_header *header, @@ -4229,9 +4230,7 @@ void perf_prepare_sample(struct perf_event_header *header, /* regs dump ABI info */ int size = sizeof(u64); - perf_sample_regs_user(&data->regs_user, regs); - - if (data->regs_user.regs) { + if (perf_sample_regs_user(&data->regs_user, regs)) { u64 mask = event->attr.sample_regs_user; size += hweight64(mask) * sizeof(u64); } @@ -4247,14 +4246,18 @@ void perf_prepare_sample(struct perf_event_header *header, * up the rest of the sample size. */ struct perf_regs_user *uregs = &data->regs_user; - u16 stack_size = event->attr.sample_stack_user; + u64 sample_stack_user = event->attr.sample_stack_user; + u16 stack_size = 0; u16 size = sizeof(u64); if (!uregs->abi) perf_sample_regs_user(uregs, regs); - stack_size = perf_sample_ustack_size(stack_size, header->size, - uregs->regs); + if (uregs->abi) { + stack_size = perf_sample_ustack_size(sample_stack_user, + header->size, + &uregs->regs); + } /* * If there is something to dump, add space for the dump -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path 2012-10-02 14:58 ` [PATCHv2] " Jiri Olsa @ 2012-10-02 15:49 ` Frederic Weisbecker 2012-10-02 16:06 ` Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Frederic Weisbecker @ 2012-10-02 15:49 UTC (permalink / raw) To: Jiri Olsa Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > index 915b876..11d62ff 100644 > --- a/arch/x86/kernel/cpu/perf_event.c > +++ b/arch/x86/kernel/cpu/perf_event.c > @@ -34,6 +34,7 @@ > #include <asm/timer.h> > #include <asm/desc.h> > #include <asm/ldt.h> > +#include <asm/syscall.h> > > #include "perf_event.h" > > @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) > userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; > } > > +#ifdef CONFIG_X86_64 > +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) > +{ > + int kernel = !user_mode(regs); > + > + if (kernel) { > + if (current->mm) > + regs = task_pt_regs(current); > + else > + regs = NULL; > + } Shouldn't the above stay in generic code? > + > + if (regs) { > + memcpy(oregs, regs, sizeof(*regs)); > + > + /* > + * If the perf event was triggered within the kernel code > + * path, then it was either syscall or interrupt. While > + * interrupt stores almost all user registers, the syscall > + * fast path does not. At this point we can at least set > + * rsp register right, which is crucial for dwarf unwind. > + * > + * The syscall_get_nr function returns -1 (orig_ax) for > + * interrupt, and positive value for syscall. > + * > + * We have two race windows in here: > + * > + * 1) Few instructions from syscall entry until old_rsp is > + * set. > + * > + * 2) In syscall/interrupt path from entry until the orig_ax > + * is set. > + * > + * Above described race windows are fractional opposed to > + * the syscall fast path, so we get much better results > + * fixing rsp this way. That said, a race is there already: if the syscall is interrupted before SAVE_ARGS and co. I'm trying to scratch my head to find a solution to detect the race and bail out instead of recording erroneous values but I can't find one. Anyway this is still better than what we have now. Another solution could be to force syscall slow path and have some variable set there that tells us we are in a syscall and every regs have been saved. But we probably don't want to force syscall slow path... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path 2012-10-02 15:49 ` Frederic Weisbecker @ 2012-10-02 16:06 ` Jiri Olsa 2012-10-02 16:16 ` Frederic Weisbecker 0 siblings, 1 reply; 14+ messages in thread From: Jiri Olsa @ 2012-10-02 16:06 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote: > On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > > index 915b876..11d62ff 100644 > > --- a/arch/x86/kernel/cpu/perf_event.c > > +++ b/arch/x86/kernel/cpu/perf_event.c > > @@ -34,6 +34,7 @@ > > #include <asm/timer.h> > > #include <asm/desc.h> > > #include <asm/ldt.h> > > +#include <asm/syscall.h> > > > > #include "perf_event.h" > > > > @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) > > userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; > > } > > > > +#ifdef CONFIG_X86_64 > > +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) > > +{ > > + int kernel = !user_mode(regs); > > + > > + if (kernel) { > > + if (current->mm) > > + regs = task_pt_regs(current); > > + else > > + regs = NULL; > > + } > > Shouldn't the above stay in generic code? could be.. I guess I thought that having the regs retrieval plus the fixup at the same place feels better/compact ;) but could change that if needed SNIP > > That said, a race is there already: if the syscall is interrupted before > SAVE_ARGS and co. yep > > I'm trying to scratch my head to find a solution to detect the race and > bail out instead of recording erroneous values but I can't find one. > > Anyway this is still better than what we have now. > > Another solution could be to force syscall slow path and have some variable > set there that tells us we are in a syscall and every regs have been saved. > > But we probably don't want to force syscall slow path... I was trying something like that as well, but the one I sent looks far less hacky to me.. :) jirka ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv2] perf x86_64: Fix rsp register for system call fast path 2012-10-02 16:06 ` Jiri Olsa @ 2012-10-02 16:16 ` Frederic Weisbecker 2012-10-03 12:29 ` [PATCHv3] " Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Frederic Weisbecker @ 2012-10-02 16:16 UTC (permalink / raw) To: Jiri Olsa Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Tue, Oct 02, 2012 at 06:06:26PM +0200, Jiri Olsa wrote: > On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote: > > On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: > > > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c > > > index 915b876..11d62ff 100644 > > > --- a/arch/x86/kernel/cpu/perf_event.c > > > +++ b/arch/x86/kernel/cpu/perf_event.c > > > @@ -34,6 +34,7 @@ > > > #include <asm/timer.h> > > > #include <asm/desc.h> > > > #include <asm/ldt.h> > > > +#include <asm/syscall.h> > > > > > > #include "perf_event.h" > > > > > > @@ -1699,6 +1700,52 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) > > > userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; > > > } > > > > > > +#ifdef CONFIG_X86_64 > > > +__weak int arch_sample_regs_user(struct pt_regs *oregs, struct pt_regs *regs) > > > +{ > > > + int kernel = !user_mode(regs); > > > + > > > + if (kernel) { > > > + if (current->mm) > > > + regs = task_pt_regs(current); > > > + else > > > + regs = NULL; > > > + } > > > > Shouldn't the above stay in generic code? > > could be.. I guess I thought that having the regs retrieval > plus the fixup at the same place feels better/compact ;) > > but could change that if needed Yeah please. > > > > I'm trying to scratch my head to find a solution to detect the race and > > bail out instead of recording erroneous values but I can't find one. > > > > Anyway this is still better than what we have now. > > > > Another solution could be to force syscall slow path and have some variable > > set there that tells us we are in a syscall and every regs have been saved. > > > > But we probably don't want to force syscall slow path... > > I was trying something like that as well, but the one I sent looks > far less hacky to me.. :) Actually it's more hacky because it's less deterministic. But it's more simple, and doesn't hurt performances. Ok, let's start with that. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCHv3] perf x86_64: Fix rsp register for system call fast path 2012-10-02 16:16 ` Frederic Weisbecker @ 2012-10-03 12:29 ` Jiri Olsa 2012-10-03 12:35 ` Frederic Weisbecker 0 siblings, 1 reply; 14+ messages in thread From: Jiri Olsa @ 2012-10-03 12:29 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Tue, Oct 02, 2012 at 06:16:47PM +0200, Frederic Weisbecker wrote: > On Tue, Oct 02, 2012 at 06:06:26PM +0200, Jiri Olsa wrote: > > On Tue, Oct 02, 2012 at 05:49:26PM +0200, Frederic Weisbecker wrote: > > > On Tue, Oct 02, 2012 at 04:58:15PM +0200, Jiri Olsa wrote: SNIP > > could be.. I guess I thought that having the regs retrieval > > plus the fixup at the same place feels better/compact ;) > > > > but could change that if needed > > Yeah please. attached new version > Actually it's more hacky because it's less deterministic. > But it's more simple, and doesn't hurt performances. new version does the regs copy only if it recognize the syscall, so it should be little faster than previous version ;) thanks, jirka --- The user level rsp register value attached to the sample is crucial for proper user stack dump and for proper dwarf backtrace post unwind. But currently, if the event happens within the system call fast path, we don't store proper rsp register value in the event sample. The reason is that the syscall fast path stores just minimal set of registers to the task's struct pt_regs area. The rsp itself is stored in per cpu variable 'old_rsp'. This patch fixes this rsp register value based on the: - 'old_rsp' per cpu variable (updated within the syscall fast path) - guess on how we got into the kernel - syscall or interrupt (via pt_regs::orig_ax value) We can use 'old_rsp' value only if we are inside the syscall. Thanks to Oleg who outlined this solution! Above guess introduces 2 race windows (fully desccribed within the patch comments), where we might get incorrect user level rsp value stored in sample. However, in comparison with system call fast path length, we still get much more precise rsp values than without the patch. Note, that we use statically allocated pt_regs inside the sample data when we need to change it. In other cases we still use the pt_regs pointer. Example of syscall fast path dwarf backtrace unwind: (perf record -e cycles -g dwarf ls; perf report --stdio) Before the patch applied: --23.76%-- preempt_schedule_irq retint_kernel tty_ldisc_deref tty_write vfs_write sys_write system_call_fastpath __GI___libc_write 0x6 With the patch applied: --12.37%-- finish_task_switch __schedule preempt_schedule queue_work schedule_work tty_flip_buffer_push pty_write n_tty_write tty_write vfs_write sys_write system_call_fastpath __GI___libc_write _IO_file_write@@GLIBC_2.2.5 new_do_write _IO_do_write@@GLIBC_2.2.5 _IO_file_overflow@@GLIBC_2.2.5 print_current_files main __libc_start_main _start Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 38 ++++++++++++++++++++ include/linux/perf_event.h | 5 ++- kernel/events/core.c | 73 ++++++++++++++++++++------------------ 3 files changed, 79 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..0ba525b 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include <asm/timer.h> #include <asm/desc.h> #include <asm/ldt.h> +#include <asm/syscall.h> #include "perf_event.h" @@ -1699,6 +1700,43 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +__weak void +arch_sample_regs_user_fixup(struct perf_regs_user *uregs, int kernel) +{ + /* + * If the perf event was triggered within the kernel code + * path, then it was either syscall or interrupt. While + * interrupt stores almost all user registers, the syscall + * fast path does not. At this point we can at least set + * rsp register right, which is crucial for dwarf unwind. + * + * The syscall_get_nr function returns -1 (orig_ax) for + * interrupt, and positive value for syscall. + * + * We have two race windows in here: + * + * 1) Few instructions from syscall entry until old_rsp is + * set. + * + * 2) In syscall/interrupt path from entry until the orig_ax + * is set. + * + * Above described race windows are fractional opposed to + * the syscall fast path, so we get much better results + * fixing rsp this way. + */ + if (kernel && (syscall_get_nr(current, uregs->regs) >= 0)) { + /* Make a copy and link it to regs pointer. */ + memcpy(&uregs->regs_copy, uregs->regs, sizeof(*uregs->regs)); + uregs->regs = &uregs->regs_copy; + + /* And fix the rsp. */ + uregs->regs->sp = this_cpu_read(old_rsp); + } +} +#endif + /* * callchain support */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 599afc4..d8ad615 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -697,6 +697,7 @@ struct perf_branch_stack { struct perf_regs_user { __u64 abi; struct pt_regs *regs; + struct pt_regs regs_copy; }; struct task_struct; @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct perf_sample_data *data, data->raw = NULL; data->br_stack = NULL; data->period = period; - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; - data->regs_user.regs = NULL; + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ + memset(&data->regs_user, 0, sizeof(data->regs_user)); data->stack_user_size = 0; } diff --git a/kernel/events/core.c b/kernel/events/core.c index 7b9df35..aa4e2fc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3780,10 +3780,15 @@ perf_output_sample_regs(struct perf_output_handle *handle, } } -static void perf_sample_regs_user(struct perf_regs_user *regs_user, - struct pt_regs *regs) +__weak void +arch_sample_regs_user_fixup(struct perf_regs_user *regs_user, int kernel) { } + +static int perf_sample_regs_user(struct perf_regs_user *regs_user, + struct pt_regs *regs) { - if (!user_mode(regs)) { + int kernel = !user_mode(regs); + + if (kernel) { if (current->mm) regs = task_pt_regs(current); else @@ -3793,7 +3798,10 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user, if (regs) { regs_user->regs = regs; regs_user->abi = perf_reg_abi(current); + arch_sample_regs_user_fixup(regs_user, kernel); } + + return regs_user->abi; } /* @@ -3819,10 +3827,6 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, { u64 task_size; - /* No regs, no stack pointer, no dump. */ - if (!regs) - return 0; - /* * Check if we fit in with the requested stack size into the: * - TASK_SIZE @@ -3854,33 +3858,30 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, static void perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, - struct pt_regs *regs) + struct perf_regs_user *uregs) { - /* Case of a kernel thread, nothing to dump */ - if (!regs) { - u64 size = 0; - perf_output_put(handle, size); - } else { + /* + * We dump: + * static size + * - the size requested by user or the best one we can fit + * in to the sample max size + * - zero (and final data) if there's nothing to dump + * data + * - user stack dump data + * dynamic size + * - the actual dumped size + */ + + /* Static size. */ + perf_output_put(handle, dump_size); + + if (dump_size) { unsigned long sp; unsigned int rem; u64 dyn_size; - /* - * We dump: - * static size - * - the size requested by user or the best one we can fit - * in to the sample max size - * data - * - user stack dump data - * dynamic size - * - the actual dumped size - */ - - /* Static size. */ - perf_output_put(handle, dump_size); - /* Data. */ - sp = perf_user_stack_pointer(regs); + sp = perf_user_stack_pointer(uregs->regs); rem = __output_copy_user(handle, (void *) sp, dump_size); dyn_size = dump_size - rem; @@ -4172,7 +4173,7 @@ void perf_output_sample(struct perf_output_handle *handle, if (sample_type & PERF_SAMPLE_STACK_USER) perf_output_sample_ustack(handle, data->stack_user_size, - data->regs_user.regs); + &data->regs_user); } void perf_prepare_sample(struct perf_event_header *header, @@ -4229,9 +4230,7 @@ void perf_prepare_sample(struct perf_event_header *header, /* regs dump ABI info */ int size = sizeof(u64); - perf_sample_regs_user(&data->regs_user, regs); - - if (data->regs_user.regs) { + if (perf_sample_regs_user(&data->regs_user, regs)) { u64 mask = event->attr.sample_regs_user; size += hweight64(mask) * sizeof(u64); } @@ -4247,14 +4246,18 @@ void perf_prepare_sample(struct perf_event_header *header, * up the rest of the sample size. */ struct perf_regs_user *uregs = &data->regs_user; - u16 stack_size = event->attr.sample_stack_user; + u64 sample_stack_user = event->attr.sample_stack_user; + u16 stack_size = 0; u16 size = sizeof(u64); if (!uregs->abi) perf_sample_regs_user(uregs, regs); - stack_size = perf_sample_ustack_size(stack_size, header->size, - uregs->regs); + if (uregs->abi) { + stack_size = perf_sample_ustack_size(sample_stack_user, + header->size, + uregs->regs); + } /* * If there is something to dump, add space for the dump -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCHv3] perf x86_64: Fix rsp register for system call fast path 2012-10-03 12:29 ` [PATCHv3] " Jiri Olsa @ 2012-10-03 12:35 ` Frederic Weisbecker 2012-10-03 13:13 ` [PATCHv4] " Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Frederic Weisbecker @ 2012-10-03 12:35 UTC (permalink / raw) To: Jiri Olsa Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Wed, Oct 03, 2012 at 02:29:47PM +0200, Jiri Olsa wrote: > +#ifdef CONFIG_X86_64 > +__weak void Only annotate with __weak the default implementation you want to be overriden. Here you want it to actually override the default __weak version. > +arch_sample_regs_user_fixup(struct perf_regs_user *uregs, int kernel) > +{ > + /* > + * If the perf event was triggered within the kernel code > + * path, then it was either syscall or interrupt. While > + * interrupt stores almost all user registers, the syscall > + * fast path does not. At this point we can at least set > + * rsp register right, which is crucial for dwarf unwind. > + * > + * The syscall_get_nr function returns -1 (orig_ax) for > + * interrupt, and positive value for syscall. > + * > + * We have two race windows in here: > + * > + * 1) Few instructions from syscall entry until old_rsp is > + * set. > + * > + * 2) In syscall/interrupt path from entry until the orig_ax > + * is set. > + * > + * Above described race windows are fractional opposed to > + * the syscall fast path, so we get much better results > + * fixing rsp this way. > + */ > + if (kernel && (syscall_get_nr(current, uregs->regs) >= 0)) { > + /* Make a copy and link it to regs pointer. */ > + memcpy(&uregs->regs_copy, uregs->regs, sizeof(*uregs->regs)); > + uregs->regs = &uregs->regs_copy; > + > + /* And fix the rsp. */ > + uregs->regs->sp = this_cpu_read(old_rsp); > + } > +} > +#endif ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCHv4] perf x86_64: Fix rsp register for system call fast path 2012-10-03 12:35 ` Frederic Weisbecker @ 2012-10-03 13:13 ` Jiri Olsa 2012-10-03 13:22 ` Peter Zijlstra 0 siblings, 1 reply; 14+ messages in thread From: Jiri Olsa @ 2012-10-03 13:13 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Wed, Oct 03, 2012 at 02:35:05PM +0200, Frederic Weisbecker wrote: > On Wed, Oct 03, 2012 at 02:29:47PM +0200, Jiri Olsa wrote: > > +#ifdef CONFIG_X86_64 > > +__weak void > > Only annotate with __weak the default implementation you want to be > overriden. Here you want it to actually override the default __weak version. new one attached thanks, jirka --- The user level rsp register value attached to the sample is crucial for proper user stack dump and for proper dwarf backtrace post unwind. But currently, if the event happens within the system call fast path, we don't store proper rsp register value in the event sample. The reason is that the syscall fast path stores just minimal set of registers to the task's struct pt_regs area. The rsp itself is stored in per cpu variable 'old_rsp'. This patch fixes this rsp register value based on the: - 'old_rsp' per cpu variable (updated within the syscall fast path) - guess on how we got into the kernel - syscall or interrupt (via pt_regs::orig_ax value) We can use 'old_rsp' value only if we are inside the syscall. Thanks to Oleg who outlined this solution! Above guess introduces 2 race windows (fully desccribed within the patch comments), where we might get incorrect user level rsp value stored in sample. However, in comparison with system call fast path length, we still get much more precise rsp values than without the patch. Note, that we use statically allocated pt_regs inside the sample data when we need to change it. In other cases we still use the pt_regs pointer. Example of syscall fast path dwarf backtrace unwind: (perf record -e cycles -g dwarf ls; perf report --stdio) Before the patch applied: --23.76%-- preempt_schedule_irq retint_kernel tty_ldisc_deref tty_write vfs_write sys_write system_call_fastpath __GI___libc_write 0x6 With the patch applied: --12.37%-- finish_task_switch __schedule preempt_schedule queue_work schedule_work tty_flip_buffer_push pty_write n_tty_write tty_write vfs_write sys_write system_call_fastpath __GI___libc_write _IO_file_write@@GLIBC_2.2.5 new_do_write _IO_do_write@@GLIBC_2.2.5 _IO_file_overflow@@GLIBC_2.2.5 print_current_files main __libc_start_main _start Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 37 +++++++++++++++++++ include/linux/perf_event.h | 5 ++- kernel/events/core.c | 73 ++++++++++++++++++++------------------ 3 files changed, 78 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..834fe96 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include <asm/timer.h> #include <asm/desc.h> #include <asm/ldt.h> +#include <asm/syscall.h> #include "perf_event.h" @@ -1699,6 +1700,42 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +void arch_sample_regs_user_fixup(struct perf_regs_user *uregs, int kernel) +{ + /* + * If the perf event was triggered within the kernel code + * path, then it was either syscall or interrupt. While + * interrupt stores almost all user registers, the syscall + * fast path does not. At this point we can at least set + * rsp register right, which is crucial for dwarf unwind. + * + * The syscall_get_nr function returns -1 (orig_ax) for + * interrupt, and positive value for syscall. + * + * We have two race windows in here: + * + * 1) Few instructions from syscall entry until old_rsp is + * set. + * + * 2) In syscall/interrupt path from entry until the orig_ax + * is set. + * + * Above described race windows are fractional opposed to + * the syscall fast path, so we get much better results + * fixing rsp this way. + */ + if (kernel && (syscall_get_nr(current, uregs->regs) >= 0)) { + /* Make a copy and link it to regs pointer. */ + memcpy(&uregs->regs_copy, uregs->regs, sizeof(*uregs->regs)); + uregs->regs = &uregs->regs_copy; + + /* And fix the rsp. */ + uregs->regs->sp = this_cpu_read(old_rsp); + } +} +#endif + /* * callchain support */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 599afc4..d8ad615 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -697,6 +697,7 @@ struct perf_branch_stack { struct perf_regs_user { __u64 abi; struct pt_regs *regs; + struct pt_regs regs_copy; }; struct task_struct; @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct perf_sample_data *data, data->raw = NULL; data->br_stack = NULL; data->period = period; - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; - data->regs_user.regs = NULL; + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ + memset(&data->regs_user, 0, sizeof(data->regs_user)); data->stack_user_size = 0; } diff --git a/kernel/events/core.c b/kernel/events/core.c index 7b9df35..aa4e2fc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3780,10 +3780,15 @@ perf_output_sample_regs(struct perf_output_handle *handle, } } -static void perf_sample_regs_user(struct perf_regs_user *regs_user, - struct pt_regs *regs) +__weak void +arch_sample_regs_user_fixup(struct perf_regs_user *regs_user, int kernel) { } + +static int perf_sample_regs_user(struct perf_regs_user *regs_user, + struct pt_regs *regs) { - if (!user_mode(regs)) { + int kernel = !user_mode(regs); + + if (kernel) { if (current->mm) regs = task_pt_regs(current); else @@ -3793,7 +3798,10 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user, if (regs) { regs_user->regs = regs; regs_user->abi = perf_reg_abi(current); + arch_sample_regs_user_fixup(regs_user, kernel); } + + return regs_user->abi; } /* @@ -3819,10 +3827,6 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, { u64 task_size; - /* No regs, no stack pointer, no dump. */ - if (!regs) - return 0; - /* * Check if we fit in with the requested stack size into the: * - TASK_SIZE @@ -3854,33 +3858,30 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, static void perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, - struct pt_regs *regs) + struct perf_regs_user *uregs) { - /* Case of a kernel thread, nothing to dump */ - if (!regs) { - u64 size = 0; - perf_output_put(handle, size); - } else { + /* + * We dump: + * static size + * - the size requested by user or the best one we can fit + * in to the sample max size + * - zero (and final data) if there's nothing to dump + * data + * - user stack dump data + * dynamic size + * - the actual dumped size + */ + + /* Static size. */ + perf_output_put(handle, dump_size); + + if (dump_size) { unsigned long sp; unsigned int rem; u64 dyn_size; - /* - * We dump: - * static size - * - the size requested by user or the best one we can fit - * in to the sample max size - * data - * - user stack dump data - * dynamic size - * - the actual dumped size - */ - - /* Static size. */ - perf_output_put(handle, dump_size); - /* Data. */ - sp = perf_user_stack_pointer(regs); + sp = perf_user_stack_pointer(uregs->regs); rem = __output_copy_user(handle, (void *) sp, dump_size); dyn_size = dump_size - rem; @@ -4172,7 +4173,7 @@ void perf_output_sample(struct perf_output_handle *handle, if (sample_type & PERF_SAMPLE_STACK_USER) perf_output_sample_ustack(handle, data->stack_user_size, - data->regs_user.regs); + &data->regs_user); } void perf_prepare_sample(struct perf_event_header *header, @@ -4229,9 +4230,7 @@ void perf_prepare_sample(struct perf_event_header *header, /* regs dump ABI info */ int size = sizeof(u64); - perf_sample_regs_user(&data->regs_user, regs); - - if (data->regs_user.regs) { + if (perf_sample_regs_user(&data->regs_user, regs)) { u64 mask = event->attr.sample_regs_user; size += hweight64(mask) * sizeof(u64); } @@ -4247,14 +4246,18 @@ void perf_prepare_sample(struct perf_event_header *header, * up the rest of the sample size. */ struct perf_regs_user *uregs = &data->regs_user; - u16 stack_size = event->attr.sample_stack_user; + u64 sample_stack_user = event->attr.sample_stack_user; + u16 stack_size = 0; u16 size = sizeof(u64); if (!uregs->abi) perf_sample_regs_user(uregs, regs); - stack_size = perf_sample_ustack_size(stack_size, header->size, - uregs->regs); + if (uregs->abi) { + stack_size = perf_sample_ustack_size(sample_stack_user, + header->size, + uregs->regs); + } /* * If there is something to dump, add space for the dump -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCHv4] perf x86_64: Fix rsp register for system call fast path 2012-10-03 13:13 ` [PATCHv4] " Jiri Olsa @ 2012-10-03 13:22 ` Peter Zijlstra 2012-10-03 13:30 ` Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Peter Zijlstra @ 2012-10-03 13:22 UTC (permalink / raw) To: Jiri Olsa Cc: Frederic Weisbecker, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Wed, 2012-10-03 at 15:13 +0200, Jiri Olsa wrote: > @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct > perf_sample_data *data, > data->raw = NULL; > data->br_stack = NULL; > data->period = period; > - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; > - data->regs_user.regs = NULL; > + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ > + memset(&data->regs_user, 0, sizeof(data->regs_user)); > data->stack_user_size = 0; > } Hmm, this will slow down all events, regardless of whether they use any of that stuff or not. Since the one user actually does something like: data->regs_user = *pt_regs; except it does a memcpy() for some obscure reason, it really doesn't matter what is in there when uninitialized, right? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4] perf x86_64: Fix rsp register for system call fast path 2012-10-03 13:22 ` Peter Zijlstra @ 2012-10-03 13:30 ` Jiri Olsa 2012-10-04 10:38 ` [PATH 0/2] perf: x86_64 rsp related changes Jiri Olsa 0 siblings, 1 reply; 14+ messages in thread From: Jiri Olsa @ 2012-10-03 13:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo, Oleg Nesterov On Wed, Oct 03, 2012 at 03:22:17PM +0200, Peter Zijlstra wrote: > On Wed, 2012-10-03 at 15:13 +0200, Jiri Olsa wrote: > > @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct > > perf_sample_data *data, > > data->raw = NULL; > > data->br_stack = NULL; > > data->period = period; > > - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; > > - data->regs_user.regs = NULL; > > + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ > > + memset(&data->regs_user, 0, sizeof(data->regs_user)); > > data->stack_user_size = 0; > > } > > Hmm, this will slow down all events, regardless of whether they use any > of that stuff or not. Since the one user actually does something like: > > data->regs_user = *pt_regs; > > except it does a memcpy() for some obscure reason, it really doesn't > matter what is in there when uninitialized, right? right, the init can stay as it was jirka ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATH 0/2] perf: x86_64 rsp related changes 2012-10-03 13:30 ` Jiri Olsa @ 2012-10-04 10:38 ` Jiri Olsa 2012-10-04 10:38 ` [PATCH 1/2] perf x86_64: Fix rsp register for system call fast path Jiri Olsa 2012-10-04 10:38 ` [PATCH 2/2] perf: Simplify the sample's user regs/stack retrieval Jiri Olsa 0 siblings, 2 replies; 14+ messages in thread From: Jiri Olsa @ 2012-10-04 10:38 UTC (permalink / raw) To: linux-kernel Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras, Peter Zijlstra, Arnaldo Carvalho de Melo, Oleg Nesterov, Jiri Olsa On Wed, Oct 03, 2012 at 03:30:07PM +0200, Jiri Olsa wrote: > On Wed, Oct 03, 2012 at 03:22:17PM +0200, Peter Zijlstra wrote: > > On Wed, 2012-10-03 at 15:13 +0200, Jiri Olsa wrote: > > > @@ -1190,8 +1191,8 @@ static inline void perf_sample_data_init(struct > > > perf_sample_data *data, > > > data->raw = NULL; > > > data->br_stack = NULL; > > > data->period = period; > > > - data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE; > > > - data->regs_user.regs = NULL; > > > + /* Sets abi to PERF_SAMPLE_REGS_ABI_NONE. */ > > > + memset(&data->regs_user, 0, sizeof(data->regs_user)); > > > data->stack_user_size = 0; > > > } > > > > Hmm, this will slow down all events, regardless of whether they use any > > of that stuff or not. Since the one user actually does something like: > > > > data->regs_user = *pt_regs; > > > > except it does a memcpy() for some obscure reason, it really doesn't > > matter what is in there when uninitialized, right? > > right, the init can stay as it was > > jirka I made the change and I split it into 2 patches, should be more readable 1/2 perf x86_64: Fix rsp register for system call fast path 2/2 perf: Simplify the sample's user regs/stack retrieval thanks, jirka Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 37 +++++++++++++++++++ include/linux/perf_event.h | 1 + kernel/events/core.c | 74 ++++++++++++++++++++------------------ 3 files changed, 77 insertions(+), 35 deletions(-) ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] perf x86_64: Fix rsp register for system call fast path 2012-10-04 10:38 ` [PATH 0/2] perf: x86_64 rsp related changes Jiri Olsa @ 2012-10-04 10:38 ` Jiri Olsa 2012-10-04 10:38 ` [PATCH 2/2] perf: Simplify the sample's user regs/stack retrieval Jiri Olsa 1 sibling, 0 replies; 14+ messages in thread From: Jiri Olsa @ 2012-10-04 10:38 UTC (permalink / raw) To: linux-kernel Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras, Peter Zijlstra, Arnaldo Carvalho de Melo, Oleg Nesterov, Jiri Olsa The user level rsp register value attached to the sample is crucial for proper user stack dump and for proper dwarf backtrace post unwind. But currently, if the event happens within the system call fast path, we don't store proper rsp register value in the event sample. The reason is that the syscall fast path stores just minimal set of registers to the task's struct pt_regs area. The rsp itself is stored in per cpu variable 'old_rsp'. This patch fixes this rsp register value based on the: - 'old_rsp' per cpu variable (updated within the syscall fast path) - guess on how we got into the kernel - syscall or interrupt (via pt_regs::orig_ax value) We can use 'old_rsp' value only if we are inside the syscall. Thanks to Oleg who outlined this solution! Above guess introduces 2 race windows (fully desccribed within the patch comments), where we might get incorrect user level rsp value stored in sample. However, in comparison with system call fast path length, we still get much more precise rsp values than without the patch. Note, that we use statically allocated pt_regs inside the sample data when we need to change it. In other cases we still use the pt_regs pointer. Example of syscall fast path dwarf backtrace unwind: (perf record -e cycles -g dwarf ls; perf report --stdio) Before the patch applied: --23.76%-- preempt_schedule_irq retint_kernel tty_ldisc_deref tty_write vfs_write sys_write system_call_fastpath __GI___libc_write 0x6 With the patch applied: --12.37%-- finish_task_switch __schedule preempt_schedule queue_work schedule_work tty_flip_buffer_push pty_write n_tty_write tty_write vfs_write sys_write system_call_fastpath __GI___libc_write _IO_file_write@@GLIBC_2.2.5 new_do_write _IO_do_write@@GLIBC_2.2.5 _IO_file_overflow@@GLIBC_2.2.5 print_current_files main __libc_start_main _start Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@redhat.com> --- arch/x86/kernel/cpu/perf_event.c | 37 +++++++++++++++++++++++++++++++++++++ include/linux/perf_event.h | 1 + kernel/events/core.c | 10 ++++++++-- 3 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 915b876..834fe96 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -34,6 +34,7 @@ #include <asm/timer.h> #include <asm/desc.h> #include <asm/ldt.h> +#include <asm/syscall.h> #include "perf_event.h" @@ -1699,6 +1700,42 @@ void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now) userpg->time_offset = this_cpu_read(cyc2ns_offset) - now; } +#ifdef CONFIG_X86_64 +void arch_sample_regs_user_fixup(struct perf_regs_user *uregs, int kernel) +{ + /* + * If the perf event was triggered within the kernel code + * path, then it was either syscall or interrupt. While + * interrupt stores almost all user registers, the syscall + * fast path does not. At this point we can at least set + * rsp register right, which is crucial for dwarf unwind. + * + * The syscall_get_nr function returns -1 (orig_ax) for + * interrupt, and positive value for syscall. + * + * We have two race windows in here: + * + * 1) Few instructions from syscall entry until old_rsp is + * set. + * + * 2) In syscall/interrupt path from entry until the orig_ax + * is set. + * + * Above described race windows are fractional opposed to + * the syscall fast path, so we get much better results + * fixing rsp this way. + */ + if (kernel && (syscall_get_nr(current, uregs->regs) >= 0)) { + /* Make a copy and link it to regs pointer. */ + memcpy(&uregs->regs_copy, uregs->regs, sizeof(*uregs->regs)); + uregs->regs = &uregs->regs_copy; + + /* And fix the rsp. */ + uregs->regs->sp = this_cpu_read(old_rsp); + } +} +#endif + /* * callchain support */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 599afc4..817e192 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -697,6 +697,7 @@ struct perf_branch_stack { struct perf_regs_user { __u64 abi; struct pt_regs *regs; + struct pt_regs regs_copy; }; struct task_struct; diff --git a/kernel/events/core.c b/kernel/events/core.c index 7b9df35..71329d6 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3780,10 +3780,15 @@ perf_output_sample_regs(struct perf_output_handle *handle, } } +__weak void +arch_sample_regs_user_fixup(struct perf_regs_user *regs_user, int kernel) { } + static void perf_sample_regs_user(struct perf_regs_user *regs_user, - struct pt_regs *regs) + struct pt_regs *regs) { - if (!user_mode(regs)) { + int kernel = !user_mode(regs); + + if (kernel) { if (current->mm) regs = task_pt_regs(current); else @@ -3793,6 +3798,7 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user, if (regs) { regs_user->regs = regs; regs_user->abi = perf_reg_abi(current); + arch_sample_regs_user_fixup(regs_user, kernel); } } -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] perf: Simplify the sample's user regs/stack retrieval 2012-10-04 10:38 ` [PATH 0/2] perf: x86_64 rsp related changes Jiri Olsa 2012-10-04 10:38 ` [PATCH 1/2] perf x86_64: Fix rsp register for system call fast path Jiri Olsa @ 2012-10-04 10:38 ` Jiri Olsa 1 sibling, 0 replies; 14+ messages in thread From: Jiri Olsa @ 2012-10-04 10:38 UTC (permalink / raw) To: linux-kernel Cc: Frederic Weisbecker, Ingo Molnar, Paul Mackerras, Peter Zijlstra, Arnaldo Carvalho de Melo, Jiri Olsa The perf_sample_regs_user function now returns result of the user register retrieval, which simplifies surrounding code a bit. The perf_output_sample_ustack now combines initial dump size output for both (stack dump un/available) cases. Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Jiri Olsa <jolsa@redhat.com> --- kernel/events/core.c | 64 ++++++++++++++++++++++++------------------------- 1 files changed, 31 insertions(+), 33 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 71329d6..e392dcb 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3783,11 +3783,15 @@ perf_output_sample_regs(struct perf_output_handle *handle, __weak void arch_sample_regs_user_fixup(struct perf_regs_user *regs_user, int kernel) { } -static void perf_sample_regs_user(struct perf_regs_user *regs_user, +static int perf_sample_regs_user(struct perf_regs_user *regs_user, struct pt_regs *regs) { int kernel = !user_mode(regs); + /* Already defined. */ + if (regs_user->abi) + return 1; + if (kernel) { if (current->mm) regs = task_pt_regs(current); @@ -3800,6 +3804,8 @@ static void perf_sample_regs_user(struct perf_regs_user *regs_user, regs_user->abi = perf_reg_abi(current); arch_sample_regs_user_fixup(regs_user, kernel); } + + return regs ? 1 : 0; } /* @@ -3825,10 +3831,6 @@ perf_sample_ustack_size(u16 stack_size, u16 header_size, { u64 task_size; - /* No regs, no stack pointer, no dump. */ - if (!regs) - return 0; - /* * Check if we fit in with the requested stack size into the: * - TASK_SIZE @@ -3862,29 +3864,26 @@ static void perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, struct pt_regs *regs) { - /* Case of a kernel thread, nothing to dump */ - if (!regs) { - u64 size = 0; - perf_output_put(handle, size); - } else { + /* + * We dump: + * static size + * - the size requested by user or the best one we can fit + * in to the sample max size + * - zero (and final data) if there's nothing to dump + * data + * - user stack dump data + * dynamic size + * - the actual dumped size + */ + + /* Static size. */ + perf_output_put(handle, dump_size); + + if (dump_size) { unsigned long sp; unsigned int rem; u64 dyn_size; - /* - * We dump: - * static size - * - the size requested by user or the best one we can fit - * in to the sample max size - * data - * - user stack dump data - * dynamic size - * - the actual dumped size - */ - - /* Static size. */ - perf_output_put(handle, dump_size); - /* Data. */ sp = perf_user_stack_pointer(regs); rem = __output_copy_user(handle, (void *) sp, dump_size); @@ -4235,9 +4234,7 @@ void perf_prepare_sample(struct perf_event_header *header, /* regs dump ABI info */ int size = sizeof(u64); - perf_sample_regs_user(&data->regs_user, regs); - - if (data->regs_user.regs) { + if (perf_sample_regs_user(&data->regs_user, regs)) { u64 mask = event->attr.sample_regs_user; size += hweight64(mask) * sizeof(u64); } @@ -4253,14 +4250,15 @@ void perf_prepare_sample(struct perf_event_header *header, * up the rest of the sample size. */ struct perf_regs_user *uregs = &data->regs_user; - u16 stack_size = event->attr.sample_stack_user; + u64 sample_stack_user = event->attr.sample_stack_user; + u16 stack_size = 0; u16 size = sizeof(u64); - if (!uregs->abi) - perf_sample_regs_user(uregs, regs); - - stack_size = perf_sample_ustack_size(stack_size, header->size, - uregs->regs); + if (perf_sample_regs_user(uregs, regs)) { + stack_size = perf_sample_ustack_size(sample_stack_user, + header->size, + uregs->regs); + } /* * If there is something to dump, add space for the dump -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-10-04 10:39 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-10-01 17:31 [PATCH] perf x86_64: Fix rsp register for system call fast path Jiri Olsa 2012-10-02 10:44 ` Peter Zijlstra 2012-10-02 14:58 ` [PATCHv2] " Jiri Olsa 2012-10-02 15:49 ` Frederic Weisbecker 2012-10-02 16:06 ` Jiri Olsa 2012-10-02 16:16 ` Frederic Weisbecker 2012-10-03 12:29 ` [PATCHv3] " Jiri Olsa 2012-10-03 12:35 ` Frederic Weisbecker 2012-10-03 13:13 ` [PATCHv4] " Jiri Olsa 2012-10-03 13:22 ` Peter Zijlstra 2012-10-03 13:30 ` Jiri Olsa 2012-10-04 10:38 ` [PATH 0/2] perf: x86_64 rsp related changes Jiri Olsa 2012-10-04 10:38 ` [PATCH 1/2] perf x86_64: Fix rsp register for system call fast path Jiri Olsa 2012-10-04 10:38 ` [PATCH 2/2] perf: Simplify the sample's user regs/stack retrieval Jiri Olsa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox