linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC] Full syscall argument decode in "perf trace"
@ 2013-09-17 19:06 Arnaldo Carvalho de Melo
  2013-09-18 11:35 ` Denys Vlasenko
  0 siblings, 1 reply; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-09-17 19:06 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tom Zanussi, Steven Rostedt, Ingo Molnar, Jiri Olsa,
	Masami Hiramatsu, Oleg Nesterov, linux-kernel

Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> I'm trying to figure out how to extend "perf trace".
 
> Currently, it shows syscall names and arguments, and only them.
> Meaning that syscalls such as open(2) are shown as:
 
>     open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
 
> The problem is, of course, that user wants to see the filename
> per se, not the address of its first byte.
 
> To improve that, we need to fetch the pointed-to data.
> There are two approaches to this: extending
> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> or selectively stopping the traced process when it reaches the thacepoint.

We don't want to stop the process at all, this is one of the major
advantages of 'perf trace' over 'strace'.

Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
tracepoint to vfs_getname to use that.
 
> First solution is attractive performance-wise, but requires a lot
> of new code: *ALL* syscalls will need to know which arguments are pointers,
> how large their pointed-to data structures are, and (remember
> readv and friends!) some of pointed-to structures themselves
> contain pointers which reference even more data.

Well, we can look at DWARF to get the function signatures, types,
librarize 'perf probe' and insert probes in the syscalls we want
decoding.

That for the cases where we don't have a tracepoint or when adding a new
tracepoint is not an option.

And this all with what we have in the kernel right now.

Also for 'perf trace' look at my perf/core branch, where we have more
syscall arg beautifiers and the machinery that is getting in place to
allow that.

Longer term we could have something like dtrace's CTF to have a more
compact type only ELF section that always go with the kernel, like we
have CFI in binaries these days.
 
- Arnaldo

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 9+ messages in thread
* [RFC] Full syscall argument decode in "perf trace"
@ 2013-09-17 15:10 Denys Vlasenko
  2013-09-17 17:52 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Denys Vlasenko @ 2013-09-17 15:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Tom Zanussi, Steven Rostedt,
	Ingo Molnar, Jiri Olsa, Masami Hiramatsu, Oleg Nesterov,
	linux-kernel

Hi,

I'm trying to figure out how to extend "perf trace".

Currently, it shows syscall names and arguments, and only them.
Meaning that syscalls such as open(2) are shown as:

    open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3

The problem is, of course, that user wants to see the filename
per se, not the address of its first byte.

To improve that, we need to fetch the pointed-to data.
There are two approaches to this: extending
"raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
or selectively stopping the traced process when it reaches the thacepoint.


First solution is attractive performance-wise, but requires a lot
of new code: *ALL* syscalls will need to know which arguments are pointers,
how large their pointed-to data structures are, and (remember
readv and friends!) some of pointed-to structures themselves
contain pointers which reference even more data.

If we want to go this way, do we want to encode all this knowledge in kernel?
If yes, how? If no, in what form userspace (perf trace) would configure
the tracepoint wrt which syscalls' arguments to copy to trace buffer?


The second solution is to pause traced process, let "perf trace" to fetch
its data (e.g. via process_vm_readv(2)) and unpause it.

The dead-simple approach ("pause on every sys_{enter,exit}") would be
no faster than strace. To make any sense, as a minimum the pausing needs
to be conditional: there is no need to stop on syscalls which do not
have indirect data (e.g. close(2), dup2(2)...).

Optimizing further, we can choose a few typical syscalls such as [f]stat(2),
write(2), and apply solution #1 ("dump data to trace buffer and don't pause")
to them.
For example, fstat(fd, &statbuf) does not need to stop on sys_enter at all,
and needs to only copy the fixed number of bytes of statbuf to trace buffer
on exit to avoid the need to pause.

If we want to go this way, how do you guys think this should be implemented?

IIUC tracepoints weren't meant to be able to influence execution,
the "pause the current process when tracepoint
is triggered" is a new feature. Does it look acceptable?
How to go about implementing it? Something like an ad-hoc extension field in
struct perf_event_attr to enable it?
Specifically, a new field or flag can enable this:
perf_event_open -> perf_event_alloc(... overflow_handler_which_conditionally_stops_current ...)

The "pausing", what it should be, exactly? In the ancient times, strace
chose to simply use SIGSTOP for similar needs, and it ended up interfering
with tracing real SIGSTOPs. I guess we don't want to repeat that. Then,
how? More specifically: when "perf trace" will read trace buffer and see
"process FOO paused in sys_exit from readv", how it should kick
process FOO to unpause it?

**end of brain dump**

Comments? Suggestions?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-30 11:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-17 19:06 [RFC] Full syscall argument decode in "perf trace" Arnaldo Carvalho de Melo
2013-09-18 11:35 ` Denys Vlasenko
2013-09-18 12:46   ` David Ahern
2013-09-18 13:35     ` Ingo Molnar
2013-09-18 14:33   ` Arnaldo Carvalho de Melo
2013-09-26  7:41     ` Denys Vlasenko
2013-09-30 11:33       ` Denys Vlasenko
  -- strict thread matches above, loose matches on Subject: below --
2013-09-17 15:10 Denys Vlasenko
2013-09-17 17:52 ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).