linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Full syscall argument decode in "perf trace"
@ 2013-09-17 15:10 Denys Vlasenko
  2013-09-17 17:52 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Denys Vlasenko @ 2013-09-17 15:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Tom Zanussi, Steven Rostedt,
	Ingo Molnar, Jiri Olsa, Masami Hiramatsu, Oleg Nesterov,
	linux-kernel

Hi,

I'm trying to figure out how to extend "perf trace".

Currently, it shows syscall names and arguments, and only them.
Meaning that syscalls such as open(2) are shown as:

    open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3

The problem is, of course, that user wants to see the filename
per se, not the address of its first byte.

To improve that, we need to fetch the pointed-to data.
There are two approaches to this: extending
"raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
or selectively stopping the traced process when it reaches the thacepoint.


First solution is attractive performance-wise, but requires a lot
of new code: *ALL* syscalls will need to know which arguments are pointers,
how large their pointed-to data structures are, and (remember
readv and friends!) some of pointed-to structures themselves
contain pointers which reference even more data.

If we want to go this way, do we want to encode all this knowledge in kernel?
If yes, how? If no, in what form userspace (perf trace) would configure
the tracepoint wrt which syscalls' arguments to copy to trace buffer?


The second solution is to pause traced process, let "perf trace" to fetch
its data (e.g. via process_vm_readv(2)) and unpause it.

The dead-simple approach ("pause on every sys_{enter,exit}") would be
no faster than strace. To make any sense, as a minimum the pausing needs
to be conditional: there is no need to stop on syscalls which do not
have indirect data (e.g. close(2), dup2(2)...).

Optimizing further, we can choose a few typical syscalls such as [f]stat(2),
write(2), and apply solution #1 ("dump data to trace buffer and don't pause")
to them.
For example, fstat(fd, &statbuf) does not need to stop on sys_enter at all,
and needs to only copy the fixed number of bytes of statbuf to trace buffer
on exit to avoid the need to pause.

If we want to go this way, how do you guys think this should be implemented?

IIUC tracepoints weren't meant to be able to influence execution,
the "pause the current process when tracepoint
is triggered" is a new feature. Does it look acceptable?
How to go about implementing it? Something like an ad-hoc extension field in
struct perf_event_attr to enable it?
Specifically, a new field or flag can enable this:
perf_event_open -> perf_event_alloc(... overflow_handler_which_conditionally_stops_current ...)

The "pausing", what it should be, exactly? In the ancient times, strace
chose to simply use SIGSTOP for similar needs, and it ended up interfering
with tracing real SIGSTOPs. I guess we don't want to repeat that. Then,
how? More specifically: when "perf trace" will read trace buffer and see
"process FOO paused in sys_exit from readv", how it should kick
process FOO to unpause it?

**end of brain dump**

Comments? Suggestions?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-17 15:10 Denys Vlasenko
@ 2013-09-17 17:52 ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-09-17 17:52 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tom Zanussi, Steven Rostedt, Ingo Molnar, Jiri Olsa,
	Masami Hiramatsu, Oleg Nesterov, linux-kernel

Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> I'm trying to figure out how to extend "perf trace".
 
> Currently, it shows syscall names and arguments, and only them.
> Meaning that syscalls such as open(2) are shown as:
 
>     open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
 
> The problem is, of course, that user wants to see the filename
> per se, not the address of its first byte.
 
> To improve that, we need to fetch the pointed-to data.
> There are two approaches to this: extending
> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> or selectively stopping the traced process when it reaches the thacepoint.

We don't want to stop the process at all, this is one of the major
advantages of 'perf trace' over 'strace'.

Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
tracepoint to vfs_getname to use that.
 
> First solution is attractive performance-wise, but requires a lot
> of new code: *ALL* syscalls will need to know which arguments are pointers,
> how large their pointed-to data structures are, and (remember
> readv and friends!) some of pointed-to structures themselves
> contain pointers which reference even more data.

Well, we can look at DWARF to get the function signatures, types,
librarize 'perf probe' and insert probes in the syscalls we want
decoding.

That for the cases where we don't have a tracepoint or when adding a new
tracepoint is not an option.

And this all with what we have in the kernel right now.

Also for 'perf trace' look at my perf/core branch, where we have more
syscall arg beautifiers and the machinery that is getting in place to
allow that.

Longer term we could have something like dtrace's CTF to have a more
compact type only ELF section that always go with the kernel, like we
have CFI in binaries these days.
 
- Arnaldo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
@ 2013-09-17 19:06 Arnaldo Carvalho de Melo
  2013-09-18 11:35 ` Denys Vlasenko
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-09-17 19:06 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tom Zanussi, Steven Rostedt, Ingo Molnar, Jiri Olsa,
	Masami Hiramatsu, Oleg Nesterov, linux-kernel

Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> I'm trying to figure out how to extend "perf trace".
 
> Currently, it shows syscall names and arguments, and only them.
> Meaning that syscalls such as open(2) are shown as:
 
>     open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
 
> The problem is, of course, that user wants to see the filename
> per se, not the address of its first byte.
 
> To improve that, we need to fetch the pointed-to data.
> There are two approaches to this: extending
> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> or selectively stopping the traced process when it reaches the thacepoint.

We don't want to stop the process at all, this is one of the major
advantages of 'perf trace' over 'strace'.

Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
tracepoint to vfs_getname to use that.
 
> First solution is attractive performance-wise, but requires a lot
> of new code: *ALL* syscalls will need to know which arguments are pointers,
> how large their pointed-to data structures are, and (remember
> readv and friends!) some of pointed-to structures themselves
> contain pointers which reference even more data.

Well, we can look at DWARF to get the function signatures, types,
librarize 'perf probe' and insert probes in the syscalls we want
decoding.

That for the cases where we don't have a tracepoint or when adding a new
tracepoint is not an option.

And this all with what we have in the kernel right now.

Also for 'perf trace' look at my perf/core branch, where we have more
syscall arg beautifiers and the machinery that is getting in place to
allow that.

Longer term we could have something like dtrace's CTF to have a more
compact type only ELF section that always go with the kernel, like we
have CFI in binaries these days.
 
- Arnaldo

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-17 19:06 [RFC] Full syscall argument decode in "perf trace" Arnaldo Carvalho de Melo
@ 2013-09-18 11:35 ` Denys Vlasenko
  2013-09-18 12:46   ` David Ahern
  2013-09-18 14:33   ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 9+ messages in thread
From: Denys Vlasenko @ 2013-09-18 11:35 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Tom Zanussi, Steven Rostedt, Ingo Molnar, Jiri Olsa,
	Masami Hiramatsu, Oleg Nesterov, linux-kernel, Denys Vlasenko

On 09/17/2013 09:06 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
>> I'm trying to figure out how to extend "perf trace".
>  
>> Currently, it shows syscall names and arguments, and only them.
>> Meaning that syscalls such as open(2) are shown as:
>  
>>     open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
>  
>> The problem is, of course, that user wants to see the filename
>> per se, not the address of its first byte.
>  
>> To improve that, we need to fetch the pointed-to data.
>> There are two approaches to this: extending
>> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
>> or selectively stopping the traced process when it reaches the thacepoint.
> 
> We don't want to stop the process at all, this is one of the major
> advantages of 'perf trace' over 'strace'.

This is a worthy goal. strace is so slow exactly because it stops
traced process so often. strace developers do want to avoid
as many of these stops as possible.

I'm not sure that "not stopping ever" is achievable, though.
There are cases where stopping is necessary.

For example, after clone() call, depending on the tracer needs,
there may be operations which must be done on the new child
before it is allowed to run.

strace used to use hideous, unsafe workarounds to catch children,
until ptrace was augmented with features which made children stop
immediately.

Do you think you can work around that? I just don't see how.

> Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
> tracepoint to vfs_getname to use that.

I know that this is the way how to fetch syscall args without stopping,
yes.

The problem: ~100 more tracepoints need to be added merely to get
to the point where strace already is, wrt quality of syscall decoding.
strace has nearly 300 separate custom syscall formatting functions,
some of them quite complex.

If we need to add syscall stopping feature (which, as I said above,
will be necessary anyway IMO), then syscall decoding can be as good
as strace *already*. Then, gradually more tracepoints are added
to make it faster.

I am thinking about going into this direction.

Therefore my question should be restated as:

Would perf developers accept the "syscall pausing" feature,
or it won't be accepted?

-- 
vda


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-18 11:35 ` Denys Vlasenko
@ 2013-09-18 12:46   ` David Ahern
  2013-09-18 13:35     ` Ingo Molnar
  2013-09-18 14:33   ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 9+ messages in thread
From: David Ahern @ 2013-09-18 12:46 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Arnaldo Carvalho de Melo, Tom Zanussi, Steven Rostedt,
	Ingo Molnar, Jiri Olsa, Masami Hiramatsu, Oleg Nesterov,
	linux-kernel, Denys Vlasenko

On 9/18/13 5:35 AM, Denys Vlasenko wrote:
> Therefore my question should be restated as:
>
> Would perf developers accept the "syscall pausing" feature,
> or it won't be accepted?

I have been using perf-trace a lot lately specifically because it is 
effectively a 'passive' observer of the task (e.g., time-sensitive tasks 
can be traced with perf but not with strace).

Also, your solution would not work if the raw_syscall events are written 
to a file for later analysis where as using tracepoints to collect this 
information would.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-18 12:46   ` David Ahern
@ 2013-09-18 13:35     ` Ingo Molnar
  0 siblings, 0 replies; 9+ messages in thread
From: Ingo Molnar @ 2013-09-18 13:35 UTC (permalink / raw)
  To: David Ahern
  Cc: Denys Vlasenko, Arnaldo Carvalho de Melo, Tom Zanussi,
	Steven Rostedt, Ingo Molnar, Jiri Olsa, Masami Hiramatsu,
	Oleg Nesterov, linux-kernel, Denys Vlasenko


* David Ahern <dsahern@gmail.com> wrote:

> On 9/18/13 5:35 AM, Denys Vlasenko wrote:
> >Therefore my question should be restated as:
> >
> >Would perf developers accept the "syscall pausing" feature,
> >or it won't be accepted?
> 
> I have been using perf-trace a lot lately specifically because it is 
> effectively a 'passive' observer of the task (e.g., time-sensitive tasks 
> can be traced with perf but not with strace).

Yes, this is not just an important but a primary design goal for all perf 
utilities: if the tracing buffers used are large enough then it should be 
a nearly zero-overhead observer that preserves all previous patterns of 
behavior as much as possible.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-18 11:35 ` Denys Vlasenko
  2013-09-18 12:46   ` David Ahern
@ 2013-09-18 14:33   ` Arnaldo Carvalho de Melo
  2013-09-26  7:41     ` Denys Vlasenko
  1 sibling, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-09-18 14:33 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tom Zanussi, Steven Rostedt, Ingo Molnar, Jiri Olsa,
	Masami Hiramatsu, Oleg Nesterov, linux-kernel, Denys Vlasenko

Em Wed, Sep 18, 2013 at 01:35:13PM +0200, Denys Vlasenko escreveu:
> On 09/17/2013 09:06 PM, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> >> I'm trying to figure out how to extend "perf trace".
> >  
> >> Currently, it shows syscall names and arguments, and only them.
> >> Meaning that syscalls such as open(2) are shown as:
> >  
> >>     open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
> >  
> >> The problem is, of course, that user wants to see the filename
> >> per se, not the address of its first byte.
> >  
> >> To improve that, we need to fetch the pointed-to data.
> >> There are two approaches to this: extending
> >> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> >> or selectively stopping the traced process when it reaches the thacepoint.
> > 
> > We don't want to stop the process at all, this is one of the major
> > advantages of 'perf trace' over 'strace'.
> 
> This is a worthy goal. strace is so slow exactly because it stops
> traced process so often. strace developers do want to avoid
> as many of these stops as possible.
> 
> I'm not sure that "not stopping ever" is achievable, though.
> There are cases where stopping is necessary.

Can't we try first to achieve what is possible with existing
infrastructure so that we can, out of the combo 'perf trace' and
'strace' have something that is better than plain 'strace'?

> For example, after clone() call, depending on the tracer needs,
> there may be operations which must be done on the new child
> before it is allowed to run.
> 
> strace used to use hideous, unsafe workarounds to catch children,
> until ptrace was augmented with features which made children stop
> immediately.
> 
> Do you think you can work around that? I just don't see how.

I haven't even thought about it 8-)
 
> > Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
> > tracepoint to vfs_getname to use that.
> 
> I know that this is the way how to fetch syscall args without stopping,
> yes.
> 
> The problem: ~100 more tracepoints need to be added merely to get
> to the point where strace already is, wrt quality of syscall decoding.
> strace has nearly 300 separate custom syscall formatting functions,
> some of them quite complex.
> 
> If we need to add syscall stopping feature (which, as I said above,
> will be necessary anyway IMO), then syscall decoding can be as good
> as strace *already*. Then, gradually more tracepoints are added
> to make it faster.
> 
> I am thinking about going into this direction.
> 
> Therefore my question should be restated as:
> 
> Would perf developers accept the "syscall pausing" feature,
> or it won't be accepted?

Do you have some patch for us to try?

- Arnaldo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-18 14:33   ` Arnaldo Carvalho de Melo
@ 2013-09-26  7:41     ` Denys Vlasenko
  2013-09-30 11:33       ` Denys Vlasenko
  0 siblings, 1 reply; 9+ messages in thread
From: Denys Vlasenko @ 2013-09-26  7:41 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Denys Vlasenko, Tom Zanussi, Steven Rostedt, Ingo Molnar,
	Jiri Olsa, Masami Hiramatsu, Oleg Nesterov,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]

On Wed, Sep 18, 2013 at 4:33 PM, Arnaldo Carvalho de Melo
<acme@redhat.com> wrote:
>> > Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
>> > tracepoint to vfs_getname to use that.
>>
>> I know that this is the way how to fetch syscall args without stopping,
>> yes.
>>
>> The problem: ~100 more tracepoints need to be added merely to get
>> to the point where strace already is, wrt quality of syscall decoding.
>> strace has nearly 300 separate custom syscall formatting functions,
>> some of them quite complex.
>>
>> If we need to add syscall stopping feature (which, as I said above,
>> will be necessary anyway IMO), then syscall decoding can be as good
>> as strace *already*. Then, gradually more tracepoints are added
>> to make it faster.
>>
>> I am thinking about going into this direction.
>>
>> Therefore my question should be restated as:
>>
>> Would perf developers accept the "syscall pausing" feature,
>> or it won't be accepted?
>
> Do you have some patch for us to try?

I have a patch which is a bit strace specific: it sidesteps
the question of the synchronization between traced process
and its tracer by using ptrace's existing method of reporting stops.

This works for strace, and is very easy to implement.
Naturally, other tracers (e.g. "perf trace" wouldn't
want to start using ptrace! Synchronization needs
to be done in some other way, not as a ptrace stop.

For one, the stopping flag needs to be a counter, so that
more than one tracer can use this feature concurrently.

But anyway, I am attaching it.

It adds a new flag, attr.sysexit_stop, which makes process stop
at next syscall exit when this tracepoint overflows.

-- 
vda

[-- Attachment #2: perf_trace_stop_RFC.diff --]
[-- Type: application/octet-stream, Size: 1255 bytes --]

diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/uapi/linux/perf_event.h linux-3.10.11-100.fc18.x86_64.clean2/include/uapi/linux/perf_event.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/uapi/linux/perf_event.h	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64.clean2/include/uapi/linux/perf_event.h	2013-09-25 23:41:31.265576897 +0200
@@ -273,7 +273,9 @@ struct perf_event_attr {
 				exclude_callchain_kernel : 1, /* exclude kernel callchains */
 				exclude_callchain_user   : 1, /* exclude user callchains */
 
-				__reserved_1   : 41;
+				sysexit_stop   : 1,
+
+				__reserved_1   : 40;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/events/core.c linux-3.10.11-100.fc18.x86_64.clean2/kernel/events/core.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/events/core.c	2013-09-23 12:03:25.719253908 +0200
+++ linux-3.10.11-100.fc18.x86_64.clean2/kernel/events/core.c	2013-09-25 23:43:00.376621584 +0200
@@ -5026,6 +5026,10 @@ static int __perf_event_overflow(struct
 		irq_work_queue(&event->pending);
 	}
 
+	if (!in_interrupt() && event->attr.sysexit_stop && current->ptrace) {
+		set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
+	}
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Full syscall argument decode in "perf trace"
  2013-09-26  7:41     ` Denys Vlasenko
@ 2013-09-30 11:33       ` Denys Vlasenko
  0 siblings, 0 replies; 9+ messages in thread
From: Denys Vlasenko @ 2013-09-30 11:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Denys Vlasenko, Tom Zanussi, Steven Rostedt, Ingo Molnar,
	Jiri Olsa, Masami Hiramatsu, Oleg Nesterov,
	Linux Kernel Mailing List, Jiri Moskovcak

[-- Attachment #1: Type: text/plain, Size: 3388 bytes --]

On Thu, Sep 26, 2013 at 9:41 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Wed, Sep 18, 2013 at 4:33 PM, Arnaldo Carvalho de Melo
> <acme@redhat.com> wrote:
>>> The problem: ~100 more tracepoints need to be added merely to get
>>> to the point where strace already is, wrt quality of syscall decoding.
>>> strace has nearly 300 separate custom syscall formatting functions,
>>> some of them quite complex.
>>>
>>> If we need to add syscall stopping feature (which, as I said above,
>>> will be necessary anyway IMO), then syscall decoding can be as good
>>> as strace *already*. Then, gradually more tracepoints are added
>>> to make it faster.
>>>
>>> I am thinking about going into this direction.
>>>
>>> Therefore my question should be restated as:
>>>
>>> Would perf developers accept the "syscall pausing" feature,
>>> or it won't be accepted?
>>
>> Do you have some patch for us to try?
>
> I have a patch which is a bit strace specific: it sidesteps
> the question of the synchronization between traced process
> and its tracer by using ptrace's existing method of reporting stops.
>
> This works for strace, and is very easy to implement.
> Naturally, other tracers (e.g. "perf trace" wouldn't
> want to start using ptrace! Synchronization needs
> to be done in some other way, not as a ptrace stop.
>
> For one, the stopping flag needs to be a counter, so that
> more than one tracer can use this feature concurrently.
>
> But anyway, I am attaching it.
>
> It adds a new flag, attr.sysexit_stop, which makes process stop
> at next syscall exit when this tracepoint overflows.

Here is the next iteration of the work in progress.

I added syscall masks.
This necessitated propagation of pointer to struct pt_regs
which points to userspace registers from sys_{enter,exit}
tracepoints to overflow handling functions, in order to get syscall#.
(Yes, I discovered that pt_regs which was already there wasn't
the *userspace* one).

The patch is tested: I have a modified version of strace
which decodes all syscalls properly and which avoids stopping
on all syscall entries and on a selected few syscall exits too.

As I see it, the next thing to tackle is the stopping method.
(The current patch still uses my old ptrace-specific hack).

How about the following: add a per-task "pause counter".
If it is <= 0, then task is not paused. If it is > 0, task is paused.

When an attached perf fd causes task to pause, the counter
is incremented, a marker is written into the perf buffer,
and task goes to sleep.

When tracer process sees the marker, it commands traced
process to "unpause", which decrements the counter.

Why this way?
* this allows traced process to be paused by several tracers
at once.
* this does not need heavy-weight notifications to be sent
to tracers (unlike my current hack, which invokes the
waitpid notification machinery, the source of much of strace's
slowness).
* it might work even if counter increment is reordered
relative to perf marker writing. if tracer sees the marker,
it can "unpause" - decrement counter and cause it to go -1.
The task is not paused (the rule is "<= 0", not "= 0").
Then kernel increments the counter, it's 0 now,
and task is still not paused. (I'm not sure whether
such property is useful, but if it is, we have it - good :)

The downside is, we'd need one new field in task struct.

Does this look sensible to you?

[-- Attachment #2: perf_trace_stop_RFC_v2.diff --]
[-- Type: application/octet-stream, Size: 25871 bytes --]

diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/alpha/kernel/perf_event.c linux-3.10.11-100.fc18.x86_64/arch/alpha/kernel/perf_event.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/alpha/kernel/perf_event.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/alpha/kernel/perf_event.c	2013-09-30 11:19:04.290849329 +0200
@@ -850,7 +850,7 @@ static void alpha_perf_event_irq_handler
 	perf_sample_data_init(&data, 0, hwc->last_period);
 
 	if (alpha_perf_event_set_period(event, hwc, idx)) {
-		if (perf_event_overflow(event, &data, regs)) {
+		if (perf_event_overflow(event, &data, regs, NULL)) {
 			/* Interrupts coming too quickly; "throttle" the
 			 * counter, i.e., disable it for a little while.
 			 */
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_v6.c linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_v6.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_v6.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_v6.c	2013-09-30 11:19:04.291849332 +0200
@@ -514,7 +514,7 @@ armv6pmu_handle_irq(int irq_num,
 		if (!armpmu_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			cpu_pmu->disable(event);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_v7.c linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_v7.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_v7.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_v7.c	2013-09-30 11:19:04.293849338 +0200
@@ -1074,7 +1074,7 @@ static irqreturn_t armv7pmu_handle_irq(i
 		if (!armpmu_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			cpu_pmu->disable(event);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_xscale.c linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_xscale.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/arm/kernel/perf_event_xscale.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/arm/kernel/perf_event_xscale.c	2013-09-30 11:19:04.294849341 +0200
@@ -265,7 +265,7 @@ xscale1pmu_handle_irq(int irq_num, void
 		if (!armpmu_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			cpu_pmu->disable(event);
 	}
 
@@ -606,7 +606,7 @@ xscale2pmu_handle_irq(int irq_num, void
 		if (!armpmu_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			cpu_pmu->disable(event);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/arm64/kernel/perf_event.c linux-3.10.11-100.fc18.x86_64/arch/arm64/kernel/perf_event.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/arm64/kernel/perf_event.c	2013-09-23 12:03:25.604253957 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/arm64/kernel/perf_event.c	2013-09-30 11:19:04.295849344 +0200
@@ -1063,7 +1063,7 @@ static irqreturn_t armv8pmu_handle_irq(i
 		if (!armpmu_event_set_period(event, hwc, idx))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			cpu_pmu->disable(hwc, idx);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/metag/kernel/perf/perf_event.c linux-3.10.11-100.fc18.x86_64/arch/metag/kernel/perf/perf_event.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/metag/kernel/perf/perf_event.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/metag/kernel/perf/perf_event.c	2013-09-30 11:19:04.296849347 +0200
@@ -789,7 +789,7 @@ static irqreturn_t metag_pmu_counter_ove
 	 * completed. Note the counter value may have been modified while it was
 	 * inactive to set it up ready for the next interrupt.
 	 */
-	if (!perf_event_overflow(event, &sampledata, regs)) {
+	if (!perf_event_overflow(event, &sampledata, regs, NULL)) {
 		__global_lock2(flags);
 		counter = (counter & 0xff000000) |
 			  (metag_in32(PERF_COUNT(idx)) & 0x00ffffff);
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/mips/kernel/perf_event_mipsxx.c linux-3.10.11-100.fc18.x86_64/arch/mips/kernel/perf_event_mipsxx.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/mips/kernel/perf_event_mipsxx.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/mips/kernel/perf_event_mipsxx.c	2013-09-30 11:19:04.297849351 +0200
@@ -746,7 +746,7 @@ static void handle_associated_event(stru
 	if (!mipspmu_event_set_period(event, hwc, idx))
 		return;
 
-	if (perf_event_overflow(event, data, regs))
+	if (perf_event_overflow(event, data, regs, NULL))
 		mipsxx_pmu_disable_event(idx);
 }
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/powerpc/perf/core-book3s.c linux-3.10.11-100.fc18.x86_64/arch/powerpc/perf/core-book3s.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/powerpc/perf/core-book3s.c	2013-09-23 12:03:25.610253955 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/powerpc/perf/core-book3s.c	2013-09-30 11:19:04.297849351 +0200
@@ -1639,7 +1639,7 @@ static void record_and_restart(struct pe
 			data.br_stack = &cpuhw->bhrb_stack;
 		}
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			power_pmu_stop(event, 0);
 	}
 }
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/powerpc/perf/core-fsl-emb.c linux-3.10.11-100.fc18.x86_64/arch/powerpc/perf/core-fsl-emb.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/powerpc/perf/core-fsl-emb.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/powerpc/perf/core-fsl-emb.c	2013-09-30 11:19:04.297849351 +0200
@@ -615,7 +615,7 @@ static void record_and_restart(struct pe
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			fsl_emb_pmu_stop(event, 0);
 	}
 }
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/sparc/kernel/perf_event.c linux-3.10.11-100.fc18.x86_64/arch/sparc/kernel/perf_event.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/sparc/kernel/perf_event.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/sparc/kernel/perf_event.c	2013-09-30 11:19:04.297849351 +0200
@@ -1633,7 +1633,7 @@ static int __kprobes perf_event_nmi_hand
 		if (!sparc_perf_event_set_period(event, hwc, idx))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			sparc_pmu_stop(event, 0);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_amd_ibs.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_amd_ibs.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_amd_ibs.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_amd_ibs.c	2013-09-30 11:19:04.298849354 +0200
@@ -580,7 +580,7 @@ static int perf_ibs_handle_irq(struct pe
 		data.raw = &raw;
 	}
 
-	throttle = perf_event_overflow(event, &data, &regs);
+	throttle = perf_event_overflow(event, &data, &regs, NULL);
 out:
 	if (throttle)
 		perf_ibs_disable_event(perf_ibs, hwc, *config);
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event.c	2013-09-30 11:19:04.298849354 +0200
@@ -1225,7 +1225,7 @@ int x86_pmu_handle_irq(struct pt_regs *r
 		if (!x86_perf_event_set_period(event))
 			continue;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			x86_pmu_stop(event, 0);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_intel.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_intel.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_intel.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_intel.c	2013-09-30 11:19:04.298849354 +0200
@@ -1222,7 +1222,7 @@ again:
 		if (has_branch_stack(event))
 			data.br_stack = &cpuc->lbr_stack;
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			x86_pmu_stop(event, 0);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_intel_ds.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_intel_ds.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_intel_ds.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_intel_ds.c	2013-09-30 11:19:04.298849354 +0200
@@ -761,7 +761,7 @@ static void __intel_pmu_pebs_event(struc
 	if (has_branch_stack(event))
 		data.br_stack = &cpuc->lbr_stack;
 
-	if (perf_event_overflow(event, &data, &regs))
+	if (perf_event_overflow(event, &data, &regs, NULL))
 		x86_pmu_stop(event, 0);
 }
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_knc.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_knc.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_knc.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_knc.c	2013-09-30 11:19:04.299849357 +0200
@@ -251,7 +251,7 @@ again:
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			x86_pmu_stop(event, 0);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_p4.c linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_p4.c
--- linux-3.10.11-100.fc18.x86_64.ORG/arch/x86/kernel/cpu/perf_event_p4.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/arch/x86/kernel/cpu/perf_event_p4.c	2013-09-30 11:19:04.299849357 +0200
@@ -1037,7 +1037,7 @@ static int p4_pmu_handle_irq(struct pt_r
 			continue;
 
 
-		if (perf_event_overflow(event, &data, regs))
+		if (perf_event_overflow(event, &data, regs, NULL))
 			x86_pmu_stop(event, 0);
 	}
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/linux/ftrace_event.h linux-3.10.11-100.fc18.x86_64/include/linux/ftrace_event.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/linux/ftrace_event.h	2013-09-23 12:03:25.714253910 +0200
+++ linux-3.10.11-100.fc18.x86_64/include/linux/ftrace_event.h	2013-09-30 11:19:04.299849357 +0200
@@ -376,10 +376,10 @@ extern void *perf_trace_buf_prepare(int
 
 static inline void
 perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
-		       u64 count, struct pt_regs *regs, void *head,
-		       struct task_struct *task)
+		       u64 count, struct pt_regs *regs, struct pt_regs *user_regs,
+		       void *head, struct task_struct *task)
 {
-	perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
+	perf_tp_event(addr, count, raw_data, size, regs, user_regs, head, rctx, task);
 }
 #endif
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/linux/perf_event.h linux-3.10.11-100.fc18.x86_64/include/linux/perf_event.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/linux/perf_event.h	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/include/linux/perf_event.h	2013-09-30 11:19:04.299849357 +0200
@@ -602,7 +602,8 @@ extern void perf_prepare_sample(struct p
 
 extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
-				 struct pt_regs *regs);
+				 struct pt_regs *regs,
+				 struct pt_regs *user_regs);
 
 static inline bool is_sampling_event(struct perf_event *event)
 {
@@ -717,7 +718,7 @@ static inline bool perf_paranoid_kernel(
 
 extern void perf_event_init(void);
 extern void perf_tp_event(u64 addr, u64 count, void *record,
-			  int entry_size, struct pt_regs *regs,
+			  int entry_size, struct pt_regs *regs, struct pt_regs *user_regs,
 			  struct hlist_head *head, int rctx,
 			  struct task_struct *task);
 extern void perf_bp_event(struct perf_event *event, void *data);
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/trace/events/syscalls.h linux-3.10.11-100.fc18.x86_64/include/trace/events/syscalls.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/trace/events/syscalls.h	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/include/trace/events/syscalls.h	2013-09-30 12:05:15.658006437 +0200
@@ -30,6 +30,7 @@ TRACE_EVENT_FN(sys_enter,
 	TP_fast_assign(
 		__entry->id	= id;
 		syscall_get_arguments(current, regs, 0, 6, __entry->args);
+		user_regs = regs;
 	),
 
 	TP_printk("NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)",
@@ -56,6 +57,7 @@ TRACE_EVENT_FN(sys_exit,
 	TP_fast_assign(
 		__entry->id	= syscall_get_nr(current, regs);
 		__entry->ret	= ret;
+		user_regs = regs;
 	),
 
 	TP_printk("NR %ld = %ld",
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/trace/ftrace.h linux-3.10.11-100.fc18.x86_64/include/trace/ftrace.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/trace/ftrace.h	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/include/trace/ftrace.h	2013-09-30 12:10:59.065011590 +0200
@@ -519,6 +519,8 @@ ftrace_raw_event_##call(void *__data, pr
 	struct ftrace_raw_##call *entry;				\
 	struct ring_buffer *buffer;					\
 	unsigned long irq_flags;					\
+	/* dummy. "assign" macro param might need it to exist: */	\
+	struct pt_regs __maybe_unused *user_regs;			\
 	int __data_size;						\
 	int pc;								\
 									\
@@ -652,6 +654,8 @@ perf_trace_##call(void *__data, proto)
 	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
 	struct ftrace_raw_##call *entry;				\
 	struct pt_regs __regs;						\
+	/* "assign" macro parameter might overwrite it: */		\
+	struct pt_regs *user_regs = NULL;				\
 	u64 __addr = 0, __count = 1;					\
 	struct task_struct *__task = NULL;				\
 	struct hlist_head *head;					\
@@ -681,7 +685,7 @@ perf_trace_##call(void *__data, proto)
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
 	perf_trace_buf_submit(entry, __entry_size, rctx, __addr,	\
-		__count, &__regs, head, __task);			\
+		__count, &__regs, user_regs, head, __task);		\
 }
 
 /*
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/include/uapi/linux/perf_event.h linux-3.10.11-100.fc18.x86_64/include/uapi/linux/perf_event.h
--- linux-3.10.11-100.fc18.x86_64.ORG/include/uapi/linux/perf_event.h	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/include/uapi/linux/perf_event.h	2013-09-30 11:19:04.300849360 +0200
@@ -273,7 +273,10 @@ struct perf_event_attr {
 				exclude_callchain_kernel : 1, /* exclude kernel callchains */
 				exclude_callchain_user   : 1, /* exclude user callchains */
 
-				__reserved_1   : 41;
+				sysenter_stop  : 1,
+				sysexit_stop   : 1,
+
+				__reserved_1   : 39;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -304,6 +307,15 @@ struct perf_event_attr {
 
 	/* Align to u64. */
 	__u32	__reserved_2;
+
+	/*
+	 * If sys{enter,exit}_stop should ignore some syscalls,
+	 * these bitmasks specify which to ignore. Otherwise set to 0/NULL.
+	 */
+	unsigned	sysenter_mask_len;
+	unsigned	sysexit_mask_len;
+	unsigned long	*sysenter_mask_ptr;
+	unsigned long	*sysexit_mask_ptr;
 };
 
 #define perf_flags(attr)	(*(&(attr)->read_format + 1))
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/events/core.c linux-3.10.11-100.fc18.x86_64/kernel/events/core.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/events/core.c	2013-09-23 12:03:25.719253908 +0200
+++ linux-3.10.11-100.fc18.x86_64/kernel/events/core.c	2013-09-30 12:11:21.929011933 +0200
@@ -43,6 +43,7 @@
 #include "internal.h"
 
 #include <asm/irq_regs.h>
+#include <asm/syscall.h>
 
 struct remote_function_call {
 	struct task_struct	*p;
@@ -2933,6 +2934,7 @@ static void free_event_rcu(struct rcu_he
 	if (event->ns)
 		put_pid_ns(event->ns);
 	perf_event_free_filter(event);
+	kfree(event->attr.sysenter_mask_ptr);
 	kfree(event);
 }
 
@@ -4964,7 +4966,8 @@ static void perf_log_throttle(struct per
 
 static int __perf_event_overflow(struct perf_event *event,
 				   int throttle, struct perf_sample_data *data,
-				   struct pt_regs *regs)
+				   struct pt_regs *regs,
+				   struct pt_regs *user_regs)
 {
 	int events = atomic_read(&event->event_limit);
 	struct hw_perf_event *hwc = &event->hw;
@@ -5026,14 +5029,35 @@ static int __perf_event_overflow(struct
 		irq_work_queue(&event->pending);
 	}
 
+	if (!in_interrupt() && event->attr.sysexit_stop && current->ptrace && user_regs) {
+		if (event->attr.sysexit_mask_len != 0) {
+			int bits;
+			int scno;
+
+			scno = syscall_get_nr(current, user_regs);
+			if (scno < 0)
+				goto stop;
+			bits = event->attr.sysexit_mask_len * 8;
+			if (scno >= bits)
+				goto stop;
+			if (!test_bit(scno, event->attr.sysexit_mask_ptr))
+				goto stop;
+			goto skip;
+		}
+ stop:
+		set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
+ skip: ;
+	}
+
 	return ret;
 }
 
 int perf_event_overflow(struct perf_event *event,
 			  struct perf_sample_data *data,
-			  struct pt_regs *regs)
+			  struct pt_regs *regs,
+			  struct pt_regs *user_regs)
 {
-	return __perf_event_overflow(event, 1, data, regs);
+	return __perf_event_overflow(event, 1, data, regs, user_regs);
 }
 
 /*
@@ -5083,7 +5107,8 @@ again:
 
 static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 				    struct perf_sample_data *data,
-				    struct pt_regs *regs)
+				    struct pt_regs *regs,
+				    struct pt_regs *user_regs)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	int throttle = 0;
@@ -5096,7 +5121,7 @@ static void perf_swevent_overflow(struct
 
 	for (; overflow; overflow--) {
 		if (__perf_event_overflow(event, throttle,
-					    data, regs)) {
+					    data, regs, user_regs)) {
 			/*
 			 * We inhibit the overflow from happening when
 			 * hwc->interrupts == MAX_INTERRUPTS.
@@ -5109,7 +5134,8 @@ static void perf_swevent_overflow(struct
 
 static void perf_swevent_event(struct perf_event *event, u64 nr,
 			       struct perf_sample_data *data,
-			       struct pt_regs *regs)
+			       struct pt_regs *regs,
+			       struct pt_regs *user_regs)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
@@ -5123,17 +5149,17 @@ static void perf_swevent_event(struct pe
 
 	if ((event->attr.sample_type & PERF_SAMPLE_PERIOD) && !event->attr.freq) {
 		data->period = nr;
-		return perf_swevent_overflow(event, 1, data, regs);
+		return perf_swevent_overflow(event, 1, data, regs, user_regs);
 	} else
 		data->period = event->hw.last_period;
 
 	if (nr == 1 && hwc->sample_period == 1 && !event->attr.freq)
-		return perf_swevent_overflow(event, 1, data, regs);
+		return perf_swevent_overflow(event, 1, data, regs, user_regs);
 
 	if (local64_add_negative(nr, &hwc->period_left))
 		return;
 
-	perf_swevent_overflow(event, 0, data, regs);
+	perf_swevent_overflow(event, 0, data, regs, user_regs);
 }
 
 static int perf_exclude_event(struct perf_event *event,
@@ -5223,7 +5249,8 @@ find_swevent_head(struct swevent_htable
 static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
 				    u64 nr,
 				    struct perf_sample_data *data,
-				    struct pt_regs *regs)
+				    struct pt_regs *regs,
+				    struct pt_regs *user_regs)
 {
 	struct swevent_htable *swhash = &__get_cpu_var(swevent_htable);
 	struct perf_event *event;
@@ -5236,7 +5263,7 @@ static void do_perf_sw_event(enum perf_t
 
 	hlist_for_each_entry_rcu(event, head, hlist_entry) {
 		if (perf_swevent_match(event, type, event_id, data, regs))
-			perf_swevent_event(event, nr, data, regs);
+			perf_swevent_event(event, nr, data, regs, user_regs);
 	}
 end:
 	rcu_read_unlock();
@@ -5269,7 +5296,7 @@ void __perf_sw_event(u32 event_id, u64 n
 
 	perf_sample_data_init(&data, addr, 0);
 
-	do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs);
+	do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs, NULL);
 
 	perf_swevent_put_recursion_context(rctx);
 	preempt_enable_notrace();
@@ -5514,7 +5541,8 @@ static int perf_tp_event_match(struct pe
 }
 
 void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
-		   struct pt_regs *regs, struct hlist_head *head, int rctx,
+		   struct pt_regs *regs, struct pt_regs *user_regs,
+		   struct hlist_head *head, int rctx,
 		   struct task_struct *task)
 {
 	struct perf_sample_data data;
@@ -5530,7 +5558,7 @@ void perf_tp_event(u64 addr, u64 count,
 
 	hlist_for_each_entry_rcu(event, head, hlist_entry) {
 		if (perf_tp_event_match(event, &data, regs))
-			perf_swevent_event(event, count, &data, regs);
+			perf_swevent_event(event, count, &data, regs, user_regs);
 	}
 
 	/*
@@ -5552,7 +5580,7 @@ void perf_tp_event(u64 addr, u64 count,
 			if (event->attr.config != entry->type)
 				continue;
 			if (perf_tp_event_match(event, &data, regs))
-				perf_swevent_event(event, count, &data, regs);
+				perf_swevent_event(event, count, &data, regs, user_regs);
 		}
 unlock:
 		rcu_read_unlock();
@@ -5656,7 +5684,7 @@ void perf_bp_event(struct perf_event *bp
 	perf_sample_data_init(&sample, bp->attr.bp_addr, 0);
 
 	if (!bp->hw.state && !perf_exclude_event(bp, regs))
-		perf_swevent_event(bp, 1, &sample, regs);
+		perf_swevent_event(bp, 1, &sample, regs, NULL);
 }
 #endif
 
@@ -5684,7 +5712,7 @@ static enum hrtimer_restart perf_swevent
 
 	if (regs && !perf_exclude_event(event, regs)) {
 		if (!(event->attr.exclude_idle && is_idle_task(current)))
-			if (__perf_event_overflow(event, 1, &data, regs))
+			if (__perf_event_overflow(event, 1, &data, regs, NULL))
 				ret = HRTIMER_NORESTART;
 	}
 
@@ -6469,6 +6497,32 @@ static int perf_copy_attr(struct perf_ev
 			ret = -EINVAL;
 	}
 
+	if ((attr->sysenter_mask_len | attr->sysexit_mask_len) & (sizeof(long)-1))
+		return -EINVAL;
+	size = attr->sysenter_mask_len + attr->sysexit_mask_len;
+	if (size > PAGE_SIZE)
+		return -EINVAL;
+	if (size != 0) {
+		unsigned long *kp = kzalloc(size, GFP_KERNEL);
+		if (!kp)
+			return -ENOMEM;
+
+		ret = copy_from_user(kp, (void __user *)attr->sysenter_mask_ptr, attr->sysenter_mask_len);
+		attr->sysenter_mask_ptr = kp;
+		if (!ret) {
+			kp = (void*)kp + attr->sysenter_mask_len;
+			ret = copy_from_user(kp, (void __user *)attr->sysexit_mask_ptr, attr->sysexit_mask_len);
+			attr->sysexit_mask_ptr = kp;
+		}
+		if (ret) {
+			kfree(attr->sysenter_mask_ptr);
+			goto out;
+		}
+	} else {
+		attr->sysenter_mask_ptr = NULL;
+		attr->sysexit_mask_ptr = NULL;
+	}
+
 out:
 	return ret;
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_event_perf.c linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_event_perf.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_event_perf.c	2013-07-01 00:13:29.000000000 +0200
+++ linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_event_perf.c	2013-09-30 11:19:04.301849363 +0200
@@ -282,7 +282,7 @@ perf_ftrace_function_call(unsigned long
 
 	head = this_cpu_ptr(event_function.perf_events);
 	perf_trace_buf_submit(entry, ENTRY_SIZE, rctx, 0,
-			      1, &regs, head, NULL);
+			      1, &regs, NULL, head, NULL);
 
 #undef ENTRY_SIZE
 }
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_kprobe.c linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_kprobe.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_kprobe.c	2013-09-23 12:03:25.726253905 +0200
+++ linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_kprobe.c	2013-09-30 11:19:04.301849363 +0200
@@ -1193,7 +1193,7 @@ kprobe_perf_func(struct trace_probe *tp,
 
 	head = this_cpu_ptr(call->perf_events);
 	perf_trace_buf_submit(entry, size, rctx,
-					entry->ip, 1, regs, head, NULL);
+					entry->ip, 1, regs, NULL, head, NULL);
 }
 
 /* Kretprobe profile handler */
@@ -1225,7 +1225,7 @@ kretprobe_perf_func(struct trace_probe *
 
 	head = this_cpu_ptr(call->perf_events);
 	perf_trace_buf_submit(entry, size, rctx,
-					entry->ret_ip, 1, regs, head, NULL);
+					entry->ret_ip, 1, regs, NULL, head, NULL);
 }
 #endif	/* CONFIG_PERF_EVENTS */
 
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_syscalls.c linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_syscalls.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_syscalls.c	2013-09-23 12:03:25.726253905 +0200
+++ linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_syscalls.c	2013-09-30 11:19:04.301849363 +0200
@@ -585,7 +585,7 @@ static void perf_syscall_enter(void *ign
 			       (unsigned long *)&rec->args);
 
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
-	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
+	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, regs, head, NULL);
 }
 
 static int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -663,7 +663,7 @@ static void perf_syscall_exit(void *igno
 	rec->ret = syscall_get_return_value(current, regs);
 
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
-	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
+	perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, regs, head, NULL);
 }
 
 static int perf_sysexit_enable(struct ftrace_event_call *call)
diff -urp linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_uprobe.c linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_uprobe.c
--- linux-3.10.11-100.fc18.x86_64.ORG/kernel/trace/trace_uprobe.c	2013-09-23 12:03:25.727253904 +0200
+++ linux-3.10.11-100.fc18.x86_64/kernel/trace/trace_uprobe.c	2013-09-30 11:19:04.301849363 +0200
@@ -862,7 +862,7 @@ static void uprobe_perf_print(struct tra
 	for (i = 0; i < tu->nr_args; i++)
 		call_fetch(&tu->args[i].fetch, regs, data + tu->args[i].offset);
 
-	perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+	perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, NULL, head, NULL);
  out:
 	preempt_enable();
 }

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-30 11:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-17 19:06 [RFC] Full syscall argument decode in "perf trace" Arnaldo Carvalho de Melo
2013-09-18 11:35 ` Denys Vlasenko
2013-09-18 12:46   ` David Ahern
2013-09-18 13:35     ` Ingo Molnar
2013-09-18 14:33   ` Arnaldo Carvalho de Melo
2013-09-26  7:41     ` Denys Vlasenko
2013-09-30 11:33       ` Denys Vlasenko
  -- strict thread matches above, loose matches on Subject: below --
2013-09-17 15:10 Denys Vlasenko
2013-09-17 17:52 ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).