* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-17 8:20 Andrey Vagin
2015-02-17 8:53 ` Arnd Bergmann
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Andrey Vagin @ 2015-02-17 8:20 UTC (permalink / raw)
To: linux-kernel
Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
Pavel Emelyanov, Roger Luethi, Andrey Vagin
Here is a preview version. It provides restricted set of functionality.
I would like to collect feedback about this idea.
Currently we use the proc file system, where all information are
presented in text files, what is convenient for humans. But if we need
to get information about processes from code (e.g. in C), the procfs
doesn't look so cool.
>From code we would prefer to get information in binary format and to be
able to specify which information and for which tasks are required. Here
is a new interface with all these features, which is called task_diag.
In addition it's much faster than procfs.
task_diag is based on netlink sockets and looks like socket-diag, which
is used to get information about sockets.
A request is described by the task_diag_pid structure:
struct task_diag_pid {
__u64 show_flags; /* specify which information are required */
__u64 dump_stratagy; /* specify a group of processes */
__u32 pid;
};
A respone is a set of netlink messages. Each message describes one task.
All task properties are divided on groups. A message contains the
TASK_DIAG_MSG group and other groups if they have been requested in
show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
response will contain the TASK_DIAG_CRED group which is described by the
task_diag_creds structure.
struct task_diag_msg {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};
Another good feature of task_diag is an ability to request information
for a few processes. Currently here are two stratgies
TASK_DIAG_DUMP_ALL - get information for all tasks
TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
tasks
The task diag is much faster than the proc file system. We don't need to
create a new file descriptor for each task. We need to send a request
and get a response. It allows to get information for a few task in one
request-response iteration.
I have compared performance of procfs and task-diag for the
"ps ax -o pid,ppid" command.
A test stand contains 10348 processes.
$ ps ax -o pid,ppid | wc -l
10348
$ time ps ax -o pid,ppid > /dev/null
real 0m1.073s
user 0m0.086s
sys 0m0.903s
$ time ./task_diag_all > /dev/null
real 0m0.037s
user 0m0.004s
sys 0m0.020s
And here are statistics about syscalls which were called by each
command.
$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
20,713 syscalls:sys_exit_open
20,710 syscalls:sys_exit_close
20,708 syscalls:sys_exit_read
10,348 syscalls:sys_exit_newstat
31 syscalls:sys_exit_write
$ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
114 syscalls:sys_exit_recvfrom
49 syscalls:sys_exit_write
8 syscalls:sys_exit_mmap
4 syscalls:sys_exit_mprotect
3 syscalls:sys_exit_newfstat
You can find the test program from this experiment in the last patch.
The idea of this functionality was suggested by Pavel Emelyanov
(xemul@), when he found that operations with /proc forms a significant
part of a checkpointing time.
Ten years ago here was attempt to add a netlink interface to access to /proc
information:
http://lwn.net/Articles/99600/
Signed-off-by: Andrey Vagin <avagin@openvz.org>
git repo: https://github.com/avagin/linux-task-diag
Andrey Vagin (7):
[RFC] kernel: add a netlink interface to get information about tasks
kernel: move next_tgid from fs/proc
task-diag: add ability to get information about all tasks
task-diag: add a new group to get process credentials
kernel: add ability to iterate children of a specified task
task_diag: add ability to dump children
selftest: check the task_diag functinonality
fs/proc/array.c | 58 +---
fs/proc/base.c | 43 ---
include/linux/proc_fs.h | 13 +
include/uapi/linux/taskdiag.h | 89 ++++++
init/Kconfig | 12 +
kernel/Makefile | 1 +
kernel/pid.c | 94 ++++++
kernel/taskdiag.c | 343 +++++++++++++++++++++
tools/testing/selftests/task_diag/Makefile | 16 +
tools/testing/selftests/task_diag/task_diag.c | 59 ++++
tools/testing/selftests/task_diag/task_diag_all.c | 82 +++++
tools/testing/selftests/task_diag/task_diag_comm.c | 195 ++++++++++++
tools/testing/selftests/task_diag/task_diag_comm.h | 47 +++
tools/testing/selftests/task_diag/taskdiag.h | 1 +
14 files changed, 967 insertions(+), 86 deletions(-)
create mode 100644 include/uapi/linux/taskdiag.h
create mode 100644 kernel/taskdiag.c
create mode 100644 tools/testing/selftests/task_diag/Makefile
create mode 100644 tools/testing/selftests/task_diag/task_diag.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_all.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.c
create mode 100644 tools/testing/selftests/task_diag/task_diag_comm.h
create mode 120000 tools/testing/selftests/task_diag/taskdiag.h
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Roger Luethi <rl@hellgate.ch>
--
2.1.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 8:20 Andrey Vagin
@ 2015-02-17 8:53 ` Arnd Bergmann
2015-02-17 21:33 ` Andrew Vagin
2015-02-17 16:09 ` David Ahern
2015-02-17 19:05 ` Andy Lutomirski
2 siblings, 1 reply; 19+ messages in thread
From: Arnd Bergmann @ 2015-02-17 8:53 UTC (permalink / raw)
To: Andrey Vagin
Cc: linux-kernel, linux-api, Oleg Nesterov, Andrew Morton,
Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
> __u64 show_flags; /* specify which information are required */
> __u64 dump_stratagy; /* specify a group of processes */
>
> __u32 pid;
> };
Can you explain how the interface relates to the 'taskstats' genetlink
API? Did you consider extending that interface to provide the
information you need instead of basing on the socket-diag?
Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 8:20 Andrey Vagin
2015-02-17 8:53 ` Arnd Bergmann
@ 2015-02-17 16:09 ` David Ahern
2015-02-17 20:32 ` Andrew Vagin
2015-02-17 19:05 ` Andy Lutomirski
2 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2015-02-17 16:09 UTC (permalink / raw)
To: Andrey Vagin, linux-kernel
Cc: linux-api, Oleg Nesterov, Andrew Morton, Cyrill Gorcunov,
Pavel Emelyanov, Roger Luethi
On 2/17/15 1:20 AM, Andrey Vagin wrote:
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> 20,713 syscalls:sys_exit_open
> 20,710 syscalls:sys_exit_close
> 20,708 syscalls:sys_exit_read
> 10,348 syscalls:sys_exit_newstat
> 31 syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> 114 syscalls:sys_exit_recvfrom
> 49 syscalls:sys_exit_write
> 8 syscalls:sys_exit_mmap
> 4 syscalls:sys_exit_mprotect
> 3 syscalls:sys_exit_newfstat
'perf trace -s' gives the summary with stats.
e.g., perf trace -s -- ps ax -o pid,ppid
ps (23850), 3117 events, 99.3%, 0.000 msec
syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 353 0.000 0.010 0.035 3.14%
write 166 0.006 0.012 0.045 3.03%
open 365 0.002 0.005 0.178 11.29%
close 354 0.001 0.002 0.024 3.57%
stat 170 0.002 0.007 0.662 52.99%
fstat 19 0.002 0.003 0.003 2.31%
lseek 2 0.003 0.003 0.003 6.49%
mmap 50 0.004 0.006 0.013 3.40%
...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 8:20 Andrey Vagin
2015-02-17 8:53 ` Arnd Bergmann
2015-02-17 16:09 ` David Ahern
@ 2015-02-17 19:05 ` Andy Lutomirski
2015-02-18 14:27 ` Andrew Vagin
2 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2015-02-17 19:05 UTC (permalink / raw)
To: Andrey Vagin
Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
Andrew Morton, Linux API, linux-kernel@vger.kernel.org
On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
>
> Here is a preview version. It provides restricted set of functionality.
> I would like to collect feedback about this idea.
>
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans. But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
>
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
>
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
>
> A request is described by the task_diag_pid structure:
>
> struct task_diag_pid {
> __u64 show_flags; /* specify which information are required */
> __u64 dump_stratagy; /* specify a group of processes */
>
> __u32 pid;
> };
>
> A respone is a set of netlink messages. Each message describes one task.
> All task properties are divided on groups. A message contains the
> TASK_DIAG_MSG group and other groups if they have been requested in
> show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> response will contain the TASK_DIAG_CRED group which is described by the
> task_diag_creds structure.
>
> struct task_diag_msg {
> __u32 tgid;
> __u32 pid;
> __u32 ppid;
> __u32 tpid;
> __u32 sid;
> __u32 pgid;
> __u8 state;
> char comm[TASK_DIAG_COMM_LEN];
> };
>
> Another good feature of task_diag is an ability to request information
> for a few processes. Currently here are two stratgies
> TASK_DIAG_DUMP_ALL - get information for all tasks
> TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> tasks
>
> The task diag is much faster than the proc file system. We don't need to
> create a new file descriptor for each task. We need to send a request
> and get a response. It allows to get information for a few task in one
> request-response iteration.
>
> I have compared performance of procfs and task-diag for the
> "ps ax -o pid,ppid" command.
>
> A test stand contains 10348 processes.
> $ ps ax -o pid,ppid | wc -l
> 10348
>
> $ time ps ax -o pid,ppid > /dev/null
>
> real 0m1.073s
> user 0m0.086s
> sys 0m0.903s
>
> $ time ./task_diag_all > /dev/null
>
> real 0m0.037s
> user 0m0.004s
> sys 0m0.020s
>
> And here are statistics about syscalls which were called by each
> command.
> $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> 20,713 syscalls:sys_exit_open
> 20,710 syscalls:sys_exit_close
> 20,708 syscalls:sys_exit_read
> 10,348 syscalls:sys_exit_newstat
> 31 syscalls:sys_exit_write
>
> $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> 114 syscalls:sys_exit_recvfrom
> 49 syscalls:sys_exit_write
> 8 syscalls:sys_exit_mmap
> 4 syscalls:sys_exit_mprotect
> 3 syscalls:sys_exit_newfstat
>
> You can find the test program from this experiment in the last patch.
>
> The idea of this functionality was suggested by Pavel Emelyanov
> (xemul@), when he found that operations with /proc forms a significant
> part of a checkpointing time.
>
> Ten years ago here was attempt to add a netlink interface to access to /proc
> information:
> http://lwn.net/Articles/99600/
I don't suppose this could use real syscalls instead of netlink. If
nothing else, netlink seems to conflate pid and net namespaces.
Also, using an asynchronous interface (send, poll?, recv) for
something that's inherently synchronous (as the kernel a local
question) seems awkward to me.
--Andy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 16:09 ` David Ahern
@ 2015-02-17 20:32 ` Andrew Vagin
0 siblings, 0 replies; 19+ messages in thread
From: Andrew Vagin @ 2015-02-17 20:32 UTC (permalink / raw)
To: David Ahern
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Tue, Feb 17, 2015 at 09:09:47AM -0700, David Ahern wrote:
> On 2/17/15 1:20 AM, Andrey Vagin wrote:
> >And here are statistics about syscalls which were called by each
> >command.
> >$ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 20,713 syscalls:sys_exit_open
> > 20,710 syscalls:sys_exit_close
> > 20,708 syscalls:sys_exit_read
> > 10,348 syscalls:sys_exit_newstat
> > 31 syscalls:sys_exit_write
> >
> >$ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 114 syscalls:sys_exit_recvfrom
> > 49 syscalls:sys_exit_write
> > 8 syscalls:sys_exit_mmap
> > 4 syscalls:sys_exit_mprotect
> > 3 syscalls:sys_exit_newfstat
>
> 'perf trace -s' gives the summary with stats.
> e.g., perf trace -s -- ps ax -o pid,ppid
Thank you for this command, I haven't used it before.
ps (21301), 145271 events, 100.0%, 0.000 msec
syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 20717 0.000 0.020 1.631 0.64%
write 1 0.019 0.019 0.019 0.00%
open 20722 0.025 0.035 3.624 0.93%
close 20719 0.006 0.009 1.059 0.95%
stat 10352 0.015 0.025 1.748 0.95%
fstat 12 0.010 0.012 0.020 6.17%
lseek 2 0.011 0.012 0.012 3.08%
mmap 30 0.012 0.034 0.094 9.35%
mprotect 17 0.034 0.045 0.067 4.86%
munmap 3 0.028 0.058 0.108 44.12%
brk 4 0.011 0.015 0.019 11.24%
rt_sigaction 25 0.011 0.011 0.014 1.27%
rt_sigprocmask 1 0.012 0.012 0.012 0.00%
ioctl 4 0.010 0.012 0.014 6.94%
access 1 0.034 0.034 0.034 0.00%
execve 6 0.000 0.496 2.794 92.58%
uname 1 0.015 0.015 0.015 0.00%
getdents 12 0.019 0.691 1.158 13.04%
getrlimit 1 0.012 0.012 0.012 0.00%
geteuid 1 0.012 0.012 0.012 0.00%
arch_prctl 1 0.013 0.013 0.013 0.00%
futex 1 0.020 0.020 0.020 0.00%
set_tid_address 1 0.012 0.012 0.012 0.00%
openat 1 0.030 0.030 0.030 0.00%
set_robust_list 1 0.011 0.011 0.011 0.00%
task_diag_all (21304), 569 events, 98.6%, 0.000 msec
syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 2 0.000 0.045 0.090 100.00%
write 77 0.010 0.013 0.083 7.93%
open 2 0.031 0.038 0.045 19.64%
close 3 0.010 0.014 0.017 13.43%
fstat 3 0.011 0.011 0.012 3.79%
mmap 8 0.013 0.027 0.049 16.72%
mprotect 4 0.034 0.043 0.052 8.86%
munmap 1 0.031 0.031 0.031 0.00%
brk 1 0.014 0.014 0.014 0.00%
ioctl 1 0.010 0.010 0.010 0.00%
access 1 0.030 0.030 0.030 0.00%
getpid 1 0.011 0.011 0.011 0.00%
socket 1 0.045 0.045 0.045 0.00%
sendto 2 0.091 0.104 0.117 12.63%
recvfrom 175 0.026 0.093 0.141 1.10%
bind 1 0.014 0.014 0.014 0.00%
execve 1 0.000 0.000 0.000 0.00%
arch_prctl 1 0.011 0.011 0.011 0.00%
>
> ps (23850), 3117 events, 99.3%, 0.000 msec
>
> syscall calls min avg max stddev
> (msec) (msec) (msec) (%)
> --------------- -------- --------- --------- --------- ------
> read 353 0.000 0.010 0.035 3.14%
> write 166 0.006 0.012 0.045 3.03%
> open 365 0.002 0.005 0.178 11.29%
> close 354 0.001 0.002 0.024 3.57%
> stat 170 0.002 0.007 0.662 52.99%
> fstat 19 0.002 0.003 0.003 2.31%
> lseek 2 0.003 0.003 0.003 6.49%
> mmap 50 0.004 0.006 0.013 3.40%
> ...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 8:53 ` Arnd Bergmann
@ 2015-02-17 21:33 ` Andrew Vagin
2015-02-18 11:06 ` Arnd Bergmann
0 siblings, 1 reply; 19+ messages in thread
From: Andrew Vagin @ 2015-02-17 21:33 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> > __u64 show_flags; /* specify which information are required */
> > __u64 dump_stratagy; /* specify a group of processes */
> >
> > __u32 pid;
> > };
>
> Can you explain how the interface relates to the 'taskstats' genetlink
> API? Did you consider extending that interface to provide the
> information you need instead of basing on the socket-diag?
It isn't based on the socket-diag, it looks like socket-diag.
Current task_diag registers a new genl family, but we can use the taskstats
family and add task_diag commands to it.
Thanks,
Andrew
>
> Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 21:33 ` Andrew Vagin
@ 2015-02-18 11:06 ` Arnd Bergmann
2015-02-18 12:42 ` Andrew Vagin
0 siblings, 1 reply; 19+ messages in thread
From: Arnd Bergmann @ 2015-02-18 11:06 UTC (permalink / raw)
To: Andrew Vagin
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > > __u64 show_flags; /* specify which information are required */
> > > __u64 dump_stratagy; /* specify a group of processes */
> > >
> > > __u32 pid;
> > > };
> >
> > Can you explain how the interface relates to the 'taskstats' genetlink
> > API? Did you consider extending that interface to provide the
> > information you need instead of basing on the socket-diag?
>
> It isn't based on the socket-diag, it looks like socket-diag.
>
> Current task_diag registers a new genl family, but we can use the taskstats
> family and add task_diag commands to it.
What I meant was more along the lines of making it look like taskstats
by adding new fields to 'struct taskstat' for what you want return.
I don't know if that is possible or a good idea for the information
you want to get out of the kernel, but it seems like a more natural
interface, as it already has some of the same data (comm, gid, pid,
ppid, ...).
Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-18 11:06 ` Arnd Bergmann
@ 2015-02-18 12:42 ` Andrew Vagin
2015-02-18 14:46 ` Arnd Bergmann
0 siblings, 1 reply; 19+ messages in thread
From: Andrew Vagin @ 2015-02-18 12:42 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > > __u64 show_flags; /* specify which information are required */
> > > > __u64 dump_stratagy; /* specify a group of processes */
> > > >
> > > > __u32 pid;
> > > > };
> > >
> > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > API? Did you consider extending that interface to provide the
> > > information you need instead of basing on the socket-diag?
> >
> > It isn't based on the socket-diag, it looks like socket-diag.
> >
> > Current task_diag registers a new genl family, but we can use the taskstats
> > family and add task_diag commands to it.
>
> What I meant was more along the lines of making it look like taskstats
> by adding new fields to 'struct taskstat' for what you want return.
> I don't know if that is possible or a good idea for the information
> you want to get out of the kernel, but it seems like a more natural
> interface, as it already has some of the same data (comm, gid, pid,
> ppid, ...).
Now I see what you mean. task_diag has more flexible and universal
interface than taskstat. A response of taskstat only contains a
taskstats structure. A response of taskdiag can contains a few types of
properties. Each type is described by its own structure.
Curently here are only two groups of parameters: task_diag_msg and
task_diag_creds.
task_diag_msg contains a few basic parameters.
task_diag_creds contains credentials.
I'm going to add other groups to describe all kind of task properties
which currently are presented in procfs (e.g. /proc/pid/maps,
/proc/pid/fding/*, /proc/pid/status, etc).
One of features of task_diag is an ability to choose which information
are required. This allows to minimize a response size and a time, which
is requred to fill this response.
struct task_diag_msg {
__u32 tgid;
__u32 pid;
__u32 ppid;
__u32 tpid;
__u32 sid;
__u32 pgid;
__u8 state;
char comm[TASK_DIAG_COMM_LEN];
};
struct task_diag_creds {
struct task_diag_caps cap_inheritable;
struct task_diag_caps cap_permitted;
struct task_diag_caps cap_effective;
struct task_diag_caps cap_bset;
__u32 uid;
__u32 euid;
__u32 suid;
__u32 fsuid;
__u32 gid;
__u32 egid;
__u32 sgid;
__u32 fsgid;
};
Thanks,
Andrew
>
> Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-17 19:05 ` Andy Lutomirski
@ 2015-02-18 14:27 ` Andrew Vagin
2015-02-19 1:18 ` Andy Lutomirski
0 siblings, 1 reply; 19+ messages in thread
From: Andrew Vagin @ 2015-02-18 14:27 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andrey Vagin, Pavel Emelyanov, Roger Luethi, Oleg Nesterov,
Cyrill Gorcunov, Andrew Morton, Linux API,
linux-kernel@vger.kernel.org
On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans. But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> > __u64 show_flags; /* specify which information are required */
> > __u64 dump_stratagy; /* specify a group of processes */
> >
> > __u32 pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> > __u32 tgid;
> > __u32 pid;
> > __u32 ppid;
> > __u32 tpid;
> > __u32 sid;
> > __u32 pgid;
> > __u8 state;
> > char comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real 0m1.073s
> > user 0m0.086s
> > sys 0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real 0m0.037s
> > user 0m0.004s
> > sys 0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 20,713 syscalls:sys_exit_open
> > 20,710 syscalls:sys_exit_close
> > 20,708 syscalls:sys_exit_read
> > 10,348 syscalls:sys_exit_newstat
> > 31 syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > 114 syscalls:sys_exit_recvfrom
> > 49 syscalls:sys_exit_write
> > 8 syscalls:sys_exit_mmap
> > 4 syscalls:sys_exit_mprotect
> > 3 syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
>
> I don't suppose this could use real syscalls instead of netlink. If
> nothing else, netlink seems to conflate pid and net namespaces.
What do you mean by "conflate pid and net namespaces"?
>
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.
Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
2) | netlink_sendmsg() {
2) | netlink_unicast() {
2) | taskdiag_doit() {
2) 2.153 us | task_diag_fill();
2) | netlink_unicast() {
2) 0.185 us | netlink_attachskb();
2) 0.291 us | __netlink_sendskb();
2) 2.452 us | }
2) + 33.625 us | }
2) + 54.611 us | }
2) + 76.370 us | }
2) | netlink_recvmsg() {
2) 1.178 us | skb_recv_datagram();
2) + 46.953 us | }
If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.
3) | netlink_sendmsg() {
3) | __netlink_dump_start() {
3) | netlink_dump() {
3) | taskdiag_dumpid() {
3) 0.685 us | task_diag_fill();
...
3) 0.224 us | task_diag_fill();
3) + 74.028 us | }
3) + 88.757 us | }
3) + 89.296 us | }
3) + 98.705 us | }
3) | netlink_recvmsg() {
3) | netlink_dump() {
3) | taskdiag_dumpid() {
3) 0.594 us | task_diag_fill();
...
3) 0.242 us | task_diag_fill();
3) + 60.634 us | }
3) + 72.803 us | }
3) + 88.005 us | }
3) | netlink_recvmsg() {
3) | netlink_dump() {
3) 2.403 us | taskdiag_dumpid();
3) + 26.236 us | }
3) + 40.522 us | }
0) + 20.407 us | netlink_recvmsg();
netlink is really good for this type of tasks. It allows to create an
extendable interface which can be easy customized for different needs.
I don't think that we would want to create another similar interface
just to be independent from network subsystem.
Thanks,
Andrew
>
> --Andy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-18 12:42 ` Andrew Vagin
@ 2015-02-18 14:46 ` Arnd Bergmann
2015-02-19 14:04 ` Andrew Vagin
0 siblings, 1 reply; 19+ messages in thread
From: Arnd Bergmann @ 2015-02-18 14:46 UTC (permalink / raw)
To: Andrew Vagin
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > is used to get information about sockets.
> > > > >
> > > > > A request is described by the task_diag_pid structure:
> > > > >
> > > > > struct task_diag_pid {
> > > > > __u64 show_flags; /* specify which information are required */
> > > > > __u64 dump_stratagy; /* specify a group of processes */
> > > > >
> > > > > __u32 pid;
> > > > > };
> > > >
> > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > API? Did you consider extending that interface to provide the
> > > > information you need instead of basing on the socket-diag?
> > >
> > > It isn't based on the socket-diag, it looks like socket-diag.
> > >
> > > Current task_diag registers a new genl family, but we can use the taskstats
> > > family and add task_diag commands to it.
> >
> > What I meant was more along the lines of making it look like taskstats
> > by adding new fields to 'struct taskstat' for what you want return.
> > I don't know if that is possible or a good idea for the information
> > you want to get out of the kernel, but it seems like a more natural
> > interface, as it already has some of the same data (comm, gid, pid,
> > ppid, ...).
>
> Now I see what you mean. task_diag has more flexible and universal
> interface than taskstat. A response of taskstat only contains a
> taskstats structure. A response of taskdiag can contains a few types of
> properties. Each type is described by its own structure.
Right, so the question is whether that flexibility is actually required
here. Independent of which design you personally prefer, what are the
downsides of extending the existing but less flexible interface?
If it's good enough, that would seem to provide a more consistent
API, which in turn helps users understand the interface and use it
correctly.
> Curently here are only two groups of parameters: task_diag_msg and
> task_diag_creds.
>
> task_diag_msg contains a few basic parameters.
> task_diag_creds contains credentials.
>
> I'm going to add other groups to describe all kind of task properties
> which currently are presented in procfs (e.g. /proc/pid/maps,
> /proc/pid/fding/*, /proc/pid/status, etc).
>
> One of features of task_diag is an ability to choose which information
> are required. This allows to minimize a response size and a time, which
> is requred to fill this response.
I realize that you are trying to optimize for performance, but it
would be nice to quantify this if you want to argue for requiring
a split interface.
> struct task_diag_msg {
> __u32 tgid;
> __u32 pid;
> __u32 ppid;
> __u32 tpid;
> __u32 sid;
> __u32 pgid;
> __u8 state;
> char comm[TASK_DIAG_COMM_LEN];
> };
I guess this part would be a very natural extension to the
existing taskstats structure, and we should only add a new
one here if there are extremely good reasons for it.
> struct task_diag_creds {
> struct task_diag_caps cap_inheritable;
> struct task_diag_caps cap_permitted;
> struct task_diag_caps cap_effective;
> struct task_diag_caps cap_bset;
>
> __u32 uid;
> __u32 euid;
> __u32 suid;
> __u32 fsuid;
> __u32 gid;
> __u32 egid;
> __u32 sgid;
> __u32 fsgid;
> };
while this part could well be kept separate so you can query
it individually from the rest of taskstats, but through a
related interface.
Arnd
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-18 14:27 ` Andrew Vagin
@ 2015-02-19 1:18 ` Andy Lutomirski
2015-02-19 21:39 ` Andrew Vagin
0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2015-02-19 1:18 UTC (permalink / raw)
To: Andrew Vagin
Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
linux-kernel@vger.kernel.org, Andrew Morton, Linux API,
Andrey Vagin
On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin@parallels.com> wrote:
>
> On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> > >
> > > Here is a preview version. It provides restricted set of functionality.
> > > I would like to collect feedback about this idea.
> > >
> > > Currently we use the proc file system, where all information are
> > > presented in text files, what is convenient for humans. But if we need
> > > to get information about processes from code (e.g. in C), the procfs
> > > doesn't look so cool.
> > >
> > > From code we would prefer to get information in binary format and to be
> > > able to specify which information and for which tasks are required. Here
> > > is a new interface with all these features, which is called task_diag.
> > > In addition it's much faster than procfs.
> > >
> > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > is used to get information about sockets.
> > >
> > > A request is described by the task_diag_pid structure:
> > >
> > > struct task_diag_pid {
> > > __u64 show_flags; /* specify which information are required */
> > > __u64 dump_stratagy; /* specify a group of processes */
> > >
> > > __u32 pid;
> > > };
> > >
> > > A respone is a set of netlink messages. Each message describes one task.
> > > All task properties are divided on groups. A message contains the
> > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > response will contain the TASK_DIAG_CRED group which is described by the
> > > task_diag_creds structure.
> > >
> > > struct task_diag_msg {
> > > __u32 tgid;
> > > __u32 pid;
> > > __u32 ppid;
> > > __u32 tpid;
> > > __u32 sid;
> > > __u32 pgid;
> > > __u8 state;
> > > char comm[TASK_DIAG_COMM_LEN];
> > > };
> > >
> > > Another good feature of task_diag is an ability to request information
> > > for a few processes. Currently here are two stratgies
> > > TASK_DIAG_DUMP_ALL - get information for all tasks
> > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > tasks
> > >
> > > The task diag is much faster than the proc file system. We don't need to
> > > create a new file descriptor for each task. We need to send a request
> > > and get a response. It allows to get information for a few task in one
> > > request-response iteration.
> > >
> > > I have compared performance of procfs and task-diag for the
> > > "ps ax -o pid,ppid" command.
> > >
> > > A test stand contains 10348 processes.
> > > $ ps ax -o pid,ppid | wc -l
> > > 10348
> > >
> > > $ time ps ax -o pid,ppid > /dev/null
> > >
> > > real 0m1.073s
> > > user 0m0.086s
> > > sys 0m0.903s
> > >
> > > $ time ./task_diag_all > /dev/null
> > >
> > > real 0m0.037s
> > > user 0m0.004s
> > > sys 0m0.020s
> > >
> > > And here are statistics about syscalls which were called by each
> > > command.
> > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > 20,713 syscalls:sys_exit_open
> > > 20,710 syscalls:sys_exit_close
> > > 20,708 syscalls:sys_exit_read
> > > 10,348 syscalls:sys_exit_newstat
> > > 31 syscalls:sys_exit_write
> > >
> > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > 114 syscalls:sys_exit_recvfrom
> > > 49 syscalls:sys_exit_write
> > > 8 syscalls:sys_exit_mmap
> > > 4 syscalls:sys_exit_mprotect
> > > 3 syscalls:sys_exit_newfstat
> > >
> > > You can find the test program from this experiment in the last patch.
> > >
> > > The idea of this functionality was suggested by Pavel Emelyanov
> > > (xemul@), when he found that operations with /proc forms a significant
> > > part of a checkpointing time.
> > >
> > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > information:
> > > http://lwn.net/Articles/99600/
> >
> > I don't suppose this could use real syscalls instead of netlink. If
> > nothing else, netlink seems to conflate pid and net namespaces.
>
> What do you mean by "conflate pid and net namespaces"?
A netlink socket is bound to a network namespace, but you should be
returning data specific to a pid namespace.
On a related note, how does this interact with hidepid? More
generally, what privileges are you requiring to obtain what data?
>
> >
> > Also, using an asynchronous interface (send, poll?, recv) for
> > something that's inherently synchronous (as the kernel a local
> > question) seems awkward to me.
>
> Actually all requests are handled synchronously. We call sendmsg to send
> a request and it is handled in this syscall.
> 2) | netlink_sendmsg() {
> 2) | netlink_unicast() {
> 2) | taskdiag_doit() {
> 2) 2.153 us | task_diag_fill();
> 2) | netlink_unicast() {
> 2) 0.185 us | netlink_attachskb();
> 2) 0.291 us | __netlink_sendskb();
> 2) 2.452 us | }
> 2) + 33.625 us | }
> 2) + 54.611 us | }
> 2) + 76.370 us | }
> 2) | netlink_recvmsg() {
> 2) 1.178 us | skb_recv_datagram();
> 2) + 46.953 us | }
>
> If we request information for a group of tasks (NLM_F_DUMP), a first
> portion of data is filled from the sendmsg syscall. And then when we read
> it, the kernel fills the next portion.
>
> 3) | netlink_sendmsg() {
> 3) | __netlink_dump_start() {
> 3) | netlink_dump() {
> 3) | taskdiag_dumpid() {
> 3) 0.685 us | task_diag_fill();
> ...
> 3) 0.224 us | task_diag_fill();
> 3) + 74.028 us | }
> 3) + 88.757 us | }
> 3) + 89.296 us | }
> 3) + 98.705 us | }
> 3) | netlink_recvmsg() {
> 3) | netlink_dump() {
> 3) | taskdiag_dumpid() {
> 3) 0.594 us | task_diag_fill();
> ...
> 3) 0.242 us | task_diag_fill();
> 3) + 60.634 us | }
> 3) + 72.803 us | }
> 3) + 88.005 us | }
> 3) | netlink_recvmsg() {
> 3) | netlink_dump() {
> 3) 2.403 us | taskdiag_dumpid();
> 3) + 26.236 us | }
> 3) + 40.522 us | }
> 0) + 20.407 us | netlink_recvmsg();
>
>
> netlink is really good for this type of tasks. It allows to create an
> extendable interface which can be easy customized for different needs.
>
> I don't think that we would want to create another similar interface
> just to be independent from network subsystem.
I guess this is a bit streamy in that you ask one question and get
multiple answers.
>
> Thanks,
> Andrew
>
> >
> > --Andy
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 12:50 Pavel Odintsov
0 siblings, 0 replies; 19+ messages in thread
From: Pavel Odintsov @ 2015-02-19 12:50 UTC (permalink / raw)
To: linux-kernel
Hello, folks!
It's very useful patches and they can do my tasks simpler and faster.
In my day to day work I working with Linux servers with enormous
amount of processes (~25 000 per server). This servers run multiple
hundreds of Linux containers.
If I want analyze processor load, network load or check something else
I use top/atop/htop/netstat. But they work very slow and consume
significant amount of CPU power for parsing multiple thousands text
files in /proc (like /proc/tcp, /proc/udp, /proc/status,
/proc/$pid/status).
Some time ago I worked on malware detection toolkit for Linux -
Antidoto (https://github.com/FastVPSEestiOu/Antidoto) which uses /proc
filesystem very deeply. For detecting malware I need check every
descriptor, every sockets and get complete information about all
processes on system.
But with current text file based architecture of /proc I can't achieve
suitable speed of my toolkit.
For example, there you can look at time of processing all network
connections for server with 20244 processes with
linux_network_activity_tracker.pl
(https://github.com/FastVPSEestiOu/Antidoto/blob/master/linux_network_activity_tracker.pl):
real 1m26.637s
user 0m23.945s
sys 0m43.978s
As you can see this time is very huge but I use latest CPUs from Intel
(Xepn 2697v3).
I have multiple ideas about complete realtime Linux server monitoring
but without ability to pull information from the Linux Kernel faster I
can't realize they.
--
Sincerely yours, Pavel Odintsov
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
@ 2015-02-19 13:00 Pavel Odintsov
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
0 siblings, 1 reply; 19+ messages in thread
From: Pavel Odintsov @ 2015-02-19 13:00 UTC (permalink / raw)
To: linux-kernel
Hello!
In addition to my post I want to mention another issue related with
slow /proc read in perf toolkit. On my server with 25 000 processes I
need about ~15 minutes for loading perf top toolkit completely.
https://bugzilla.kernel.org/show_bug.cgi?id=86991
--
Sincerely yours, Pavel Odintsov
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-18 14:46 ` Arnd Bergmann
@ 2015-02-19 14:04 ` Andrew Vagin
0 siblings, 0 replies; 19+ messages in thread
From: Andrew Vagin @ 2015-02-19 14:04 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Andrey Vagin, linux-kernel, linux-api, Oleg Nesterov,
Andrew Morton, Cyrill Gorcunov, Pavel Emelyanov, Roger Luethi
On Wed, Feb 18, 2015 at 03:46:31PM +0100, Arnd Bergmann wrote:
> On Wednesday 18 February 2015 15:42:11 Andrew Vagin wrote:
> > On Wed, Feb 18, 2015 at 12:06:40PM +0100, Arnd Bergmann wrote:
> > > On Wednesday 18 February 2015 00:33:13 Andrew Vagin wrote:
> > > > On Tue, Feb 17, 2015 at 09:53:09AM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 February 2015 11:20:19 Andrey Vagin wrote:
> > > > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > > > is used to get information about sockets.
> > > > > >
> > > > > > A request is described by the task_diag_pid structure:
> > > > > >
> > > > > > struct task_diag_pid {
> > > > > > __u64 show_flags; /* specify which information are required */
> > > > > > __u64 dump_stratagy; /* specify a group of processes */
> > > > > >
> > > > > > __u32 pid;
> > > > > > };
> > > > >
> > > > > Can you explain how the interface relates to the 'taskstats' genetlink
> > > > > API? Did you consider extending that interface to provide the
> > > > > information you need instead of basing on the socket-diag?
> > > >
> > > > It isn't based on the socket-diag, it looks like socket-diag.
> > > >
> > > > Current task_diag registers a new genl family, but we can use the taskstats
> > > > family and add task_diag commands to it.
> > >
> > > What I meant was more along the lines of making it look like taskstats
> > > by adding new fields to 'struct taskstat' for what you want return.
> > > I don't know if that is possible or a good idea for the information
> > > you want to get out of the kernel, but it seems like a more natural
> > > interface, as it already has some of the same data (comm, gid, pid,
> > > ppid, ...).
> >
> > Now I see what you mean. task_diag has more flexible and universal
> > interface than taskstat. A response of taskstat only contains a
> > taskstats structure. A response of taskdiag can contains a few types of
> > properties. Each type is described by its own structure.
>
> Right, so the question is whether that flexibility is actually required
> here. Independent of which design you personally prefer, what are the
> downsides of extending the existing but less flexible interface?
I have looked at taskstat once again.
The format of response messages for taskstat and taskdiag are the same.
It's a netlink message with a set of nested attributes. New attributes
can be added without breaking backward compatibility.
The request can be expanded to be able to specified which information is
required and for which tasks.
These two features allow to significantly improve performance, because
in this case we don't need to do a system call for each task.
I have done a few experiments to prove these words.
task_proc_all reads /proc/pid/stat for each tast
$ time ./task_proc_all > /dev/null
real 0m1.528s
user 0m0.016s
sys 0m1.341s
task_diag uses task_diag and requests information for each task
separately.
$ time ./task_diag > /dev/null
real 0m1.166s
user 0m0.024s
sys 0m1.127s
task_diag_all uses task_diag and requests information for all tasks in
one request.
$ time ./task_diag_all > /dev/null
real 0m0.077s
user 0m0.018s
sys 0m0.053s
So you can see that the ability to request information for a group of
tasks allows to be more effective.
The summary of this message is that we can use the interface of
taskstats with some extensions.
Arnd, thank you for your opinion and suggestions.
>
> If it's good enough, that would seem to provide a more consistent
> API, which in turn helps users understand the interface and use it
> correctly.
>
> > Curently here are only two groups of parameters: task_diag_msg and
> > task_diag_creds.
> >
> > task_diag_msg contains a few basic parameters.
> > task_diag_creds contains credentials.
> >
> > I'm going to add other groups to describe all kind of task properties
> > which currently are presented in procfs (e.g. /proc/pid/maps,
> > /proc/pid/fding/*, /proc/pid/status, etc).
> >
> > One of features of task_diag is an ability to choose which information
> > are required. This allows to minimize a response size and a time, which
> > is requred to fill this response.
>
> I realize that you are trying to optimize for performance, but it
> would be nice to quantify this if you want to argue for requiring
> a split interface.
>
> > struct task_diag_msg {
> > __u32 tgid;
> > __u32 pid;
> > __u32 ppid;
> > __u32 tpid;
> > __u32 sid;
> > __u32 pgid;
> > __u8 state;
> > char comm[TASK_DIAG_COMM_LEN];
> > };
>
> I guess this part would be a very natural extension to the
> existing taskstats structure, and we should only add a new
> one here if there are extremely good reasons for it.
The task_diag_msg structure contains properties which are used more
frequently than statistics from the taststats structure.
The size of the task_diag_msg structure is 44 bytes, the size of the
taststats structure 328. If we have more data, we need to do more
system calls. So I have done one more experiment to look how it affects
perfomance:
If we use the task_diag_msg structure:
$ time ./task_diag_all > /dev/null
real 0m0.077s
user 0m0.018s
sys 0m0.053s
If we use the taststats structure:
$ time ./task_diag_all > /dev/null
real 0m0.117s
user 0m0.029s
sys 0m0.085s
Thanks,
Andrew
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-19 1:18 ` Andy Lutomirski
@ 2015-02-19 21:39 ` Andrew Vagin
2015-02-20 20:33 ` Andy Lutomirski
0 siblings, 1 reply; 19+ messages in thread
From: Andrew Vagin @ 2015-02-19 21:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
linux-kernel@vger.kernel.org, Andrew Morton, Linux API,
Andrey Vagin
On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
> On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin@parallels.com> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> > > >
> > > > Here is a preview version. It provides restricted set of functionality.
> > > > I would like to collect feedback about this idea.
> > > >
> > > > Currently we use the proc file system, where all information are
> > > > presented in text files, what is convenient for humans. But if we need
> > > > to get information about processes from code (e.g. in C), the procfs
> > > > doesn't look so cool.
> > > >
> > > > From code we would prefer to get information in binary format and to be
> > > > able to specify which information and for which tasks are required. Here
> > > > is a new interface with all these features, which is called task_diag.
> > > > In addition it's much faster than procfs.
> > > >
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > > __u64 show_flags; /* specify which information are required */
> > > > __u64 dump_stratagy; /* specify a group of processes */
> > > >
> > > > __u32 pid;
> > > > };
> > > >
> > > > A respone is a set of netlink messages. Each message describes one task.
> > > > All task properties are divided on groups. A message contains the
> > > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > > response will contain the TASK_DIAG_CRED group which is described by the
> > > > task_diag_creds structure.
> > > >
> > > > struct task_diag_msg {
> > > > __u32 tgid;
> > > > __u32 pid;
> > > > __u32 ppid;
> > > > __u32 tpid;
> > > > __u32 sid;
> > > > __u32 pgid;
> > > > __u8 state;
> > > > char comm[TASK_DIAG_COMM_LEN];
> > > > };
> > > >
> > > > Another good feature of task_diag is an ability to request information
> > > > for a few processes. Currently here are two stratgies
> > > > TASK_DIAG_DUMP_ALL - get information for all tasks
> > > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > > tasks
> > > >
> > > > The task diag is much faster than the proc file system. We don't need to
> > > > create a new file descriptor for each task. We need to send a request
> > > > and get a response. It allows to get information for a few task in one
> > > > request-response iteration.
> > > >
> > > > I have compared performance of procfs and task-diag for the
> > > > "ps ax -o pid,ppid" command.
> > > >
> > > > A test stand contains 10348 processes.
> > > > $ ps ax -o pid,ppid | wc -l
> > > > 10348
> > > >
> > > > $ time ps ax -o pid,ppid > /dev/null
> > > >
> > > > real 0m1.073s
> > > > user 0m0.086s
> > > > sys 0m0.903s
> > > >
> > > > $ time ./task_diag_all > /dev/null
> > > >
> > > > real 0m0.037s
> > > > user 0m0.004s
> > > > sys 0m0.020s
> > > >
> > > > And here are statistics about syscalls which were called by each
> > > > command.
> > > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 20,713 syscalls:sys_exit_open
> > > > 20,710 syscalls:sys_exit_close
> > > > 20,708 syscalls:sys_exit_read
> > > > 10,348 syscalls:sys_exit_newstat
> > > > 31 syscalls:sys_exit_write
> > > >
> > > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 114 syscalls:sys_exit_recvfrom
> > > > 49 syscalls:sys_exit_write
> > > > 8 syscalls:sys_exit_mmap
> > > > 4 syscalls:sys_exit_mprotect
> > > > 3 syscalls:sys_exit_newfstat
> > > >
> > > > You can find the test program from this experiment in the last patch.
> > > >
> > > > The idea of this functionality was suggested by Pavel Emelyanov
> > > > (xemul@), when he found that operations with /proc forms a significant
> > > > part of a checkpointing time.
> > > >
> > > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > > information:
> > > > http://lwn.net/Articles/99600/
> > >
> > > I don't suppose this could use real syscalls instead of netlink. If
> > > nothing else, netlink seems to conflate pid and net namespaces.
> >
> > What do you mean by "conflate pid and net namespaces"?
>
> A netlink socket is bound to a network namespace, but you should be
> returning data specific to a pid namespace.
Here is a good question. When we mount a procfs instance, the current
pidns is saved on a superblock. Then if we read data from
this procfs from another pidns, we will see pid-s from the pidns where
this procfs has been mounted.
$ unshare -p -- bash -c '(bash)'
$ cat /proc/self/status | grep ^Pid:
Pid: 15770
$ echo $$
1
A similar situation with socket_diag. A socket_diag socket is bound to a
network namespace. If we open a socket_diag socket and change a network
namespace, it will return infromation about the initial netns.
In this version I always use a current pid namespace.
But to be consistant with other kernel logic, a socket diag has to be
linked with a pidns where it has been created.
>
> On a related note, how does this interact with hidepid? More
Currently it always work as procfs with hidepid = 2 (highest level of
security).
> generally, what privileges are you requiring to obtain what data?
It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true
>
> >
> > >
> > > Also, using an asynchronous interface (send, poll?, recv) for
> > > something that's inherently synchronous (as the kernel a local
> > > question) seems awkward to me.
> >
> > Actually all requests are handled synchronously. We call sendmsg to send
> > a request and it is handled in this syscall.
> > 2) | netlink_sendmsg() {
> > 2) | netlink_unicast() {
> > 2) | taskdiag_doit() {
> > 2) 2.153 us | task_diag_fill();
> > 2) | netlink_unicast() {
> > 2) 0.185 us | netlink_attachskb();
> > 2) 0.291 us | __netlink_sendskb();
> > 2) 2.452 us | }
> > 2) + 33.625 us | }
> > 2) + 54.611 us | }
> > 2) + 76.370 us | }
> > 2) | netlink_recvmsg() {
> > 2) 1.178 us | skb_recv_datagram();
> > 2) + 46.953 us | }
> >
> > If we request information for a group of tasks (NLM_F_DUMP), a first
> > portion of data is filled from the sendmsg syscall. And then when we read
> > it, the kernel fills the next portion.
> >
> > 3) | netlink_sendmsg() {
> > 3) | __netlink_dump_start() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.685 us | task_diag_fill();
> > ...
> > 3) 0.224 us | task_diag_fill();
> > 3) + 74.028 us | }
> > 3) + 88.757 us | }
> > 3) + 89.296 us | }
> > 3) + 98.705 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.594 us | task_diag_fill();
> > ...
> > 3) 0.242 us | task_diag_fill();
> > 3) + 60.634 us | }
> > 3) + 72.803 us | }
> > 3) + 88.005 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) 2.403 us | taskdiag_dumpid();
> > 3) + 26.236 us | }
> > 3) + 40.522 us | }
> > 0) + 20.407 us | netlink_recvmsg();
> >
> >
> > netlink is really good for this type of tasks. It allows to create an
> > extendable interface which can be easy customized for different needs.
> >
> > I don't think that we would want to create another similar interface
> > just to be independent from network subsystem.
>
> I guess this is a bit streamy in that you ask one question and get
> multiple answers.
It's like seq_file in procfs. The kernel allocates a buffer then fills
it, copies it into userspace, fills it again, ... repeats these actions.
And we can read data from file by portions.
Actually here is one more analogy. When we open a file in procfs,
we sends a request to the kernel and a file path is a request body in
this case. But in case of procfs, we can't construct requests, we only
have a set of predefined requests.
>
> >
> > Thanks,
> > Andrew
> >
> > >
> > > --Andy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-19 21:39 ` Andrew Vagin
@ 2015-02-20 20:33 ` Andy Lutomirski
0 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2015-02-20 20:33 UTC (permalink / raw)
To: Andrew Vagin
Cc: Pavel Emelyanov, Roger Luethi, Oleg Nesterov, Cyrill Gorcunov,
linux-kernel@vger.kernel.org, Andrew Morton, Linux API,
Andrey Vagin
On Thu, Feb 19, 2015 at 1:39 PM, Andrew Vagin <avagin@parallels.com> wrote:
> On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
>> > > I don't suppose this could use real syscalls instead of netlink. If
>> > > nothing else, netlink seems to conflate pid and net namespaces.
>> >
>> > What do you mean by "conflate pid and net namespaces"?
>>
>> A netlink socket is bound to a network namespace, but you should be
>> returning data specific to a pid namespace.
>
> Here is a good question. When we mount a procfs instance, the current
> pidns is saved on a superblock. Then if we read data from
> this procfs from another pidns, we will see pid-s from the pidns where
> this procfs has been mounted.
>
> $ unshare -p -- bash -c '(bash)'
> $ cat /proc/self/status | grep ^Pid:
> Pid: 15770
> $ echo $$
> 1
>
> A similar situation with socket_diag. A socket_diag socket is bound to a
> network namespace. If we open a socket_diag socket and change a network
> namespace, it will return infromation about the initial netns.
>
> In this version I always use a current pid namespace.
> But to be consistant with other kernel logic, a socket diag has to be
> linked with a pidns where it has been created.
>
Attaching a pidns to every freshly created netlink socket seems odd,
but I don't see a better solution that still uses netlink.
>>
>> On a related note, how does this interact with hidepid? More
>
> Currently it always work as procfs with hidepid = 2 (highest level of
> security).
>
>> generally, what privileges are you requiring to obtain what data?
>
> It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true
Sounds good to me.
>
>>
>> >
>> > >
>> > > Also, using an asynchronous interface (send, poll?, recv) for
>> > > something that's inherently synchronous (as the kernel a local
>> > > question) seems awkward to me.
>> >
>> > Actually all requests are handled synchronously. We call sendmsg to send
>> > a request and it is handled in this syscall.
>> > 2) | netlink_sendmsg() {
>> > 2) | netlink_unicast() {
>> > 2) | taskdiag_doit() {
>> > 2) 2.153 us | task_diag_fill();
>> > 2) | netlink_unicast() {
>> > 2) 0.185 us | netlink_attachskb();
>> > 2) 0.291 us | __netlink_sendskb();
>> > 2) 2.452 us | }
>> > 2) + 33.625 us | }
>> > 2) + 54.611 us | }
>> > 2) + 76.370 us | }
>> > 2) | netlink_recvmsg() {
>> > 2) 1.178 us | skb_recv_datagram();
>> > 2) + 46.953 us | }
>> >
>> > If we request information for a group of tasks (NLM_F_DUMP), a first
>> > portion of data is filled from the sendmsg syscall. And then when we read
>> > it, the kernel fills the next portion.
>> >
>> > 3) | netlink_sendmsg() {
>> > 3) | __netlink_dump_start() {
>> > 3) | netlink_dump() {
>> > 3) | taskdiag_dumpid() {
>> > 3) 0.685 us | task_diag_fill();
>> > ...
>> > 3) 0.224 us | task_diag_fill();
>> > 3) + 74.028 us | }
>> > 3) + 88.757 us | }
>> > 3) + 89.296 us | }
>> > 3) + 98.705 us | }
>> > 3) | netlink_recvmsg() {
>> > 3) | netlink_dump() {
>> > 3) | taskdiag_dumpid() {
>> > 3) 0.594 us | task_diag_fill();
>> > ...
>> > 3) 0.242 us | task_diag_fill();
>> > 3) + 60.634 us | }
>> > 3) + 72.803 us | }
>> > 3) + 88.005 us | }
>> > 3) | netlink_recvmsg() {
>> > 3) | netlink_dump() {
>> > 3) 2.403 us | taskdiag_dumpid();
>> > 3) + 26.236 us | }
>> > 3) + 40.522 us | }
>> > 0) + 20.407 us | netlink_recvmsg();
>> >
>> >
>> > netlink is really good for this type of tasks. It allows to create an
>> > extendable interface which can be easy customized for different needs.
>> >
>> > I don't think that we would want to create another similar interface
>> > just to be independent from network subsystem.
>>
>> I guess this is a bit streamy in that you ask one question and get
>> multiple answers.
>
> It's like seq_file in procfs. The kernel allocates a buffer then fills
> it, copies it into userspace, fills it again, ... repeats these actions.
> And we can read data from file by portions.
>
> Actually here is one more analogy. When we open a file in procfs,
> we sends a request to the kernel and a file path is a request body in
> this case. But in case of procfs, we can't construct requests, we only
> have a set of predefined requests.
Fair enough. Procfs is also a bit absurd and only makes sense because
it's compatible with lots of tools. In a totally sane world, I would
argue that you should issue one syscall asking questions about a bit
and you should get answers immediately.
--Andy
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-19 13:00 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Pavel Odintsov
@ 2015-02-27 20:43 ` Arnaldo Carvalho de Melo
2015-02-27 20:54 ` David Ahern
0 siblings, 1 reply; 19+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-02-27 20:43 UTC (permalink / raw)
To: Pavel Odintsov; +Cc: linux-kernel, David Ahern
Em Thu, Feb 19, 2015 at 05:00:12PM +0400, Pavel Odintsov escreveu:
> Hello!
>
> In addition to my post I want to mention another issue related with
> slow /proc read in perf toolkit. On my server with 25 000 processes I
> need about ~15 minutes for loading perf top toolkit completely.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=86991
Right, one way would be to, in the 'perf top' case to defer getting
thread information to when we need it, i.e. when we get a sample for a
pid that we have no struct thread associated with.
We would speed up 'perf top' startup but would introduce jitter down the
line, and would be up for races, but hey, we already are, using /proc
:-/
But that would not work for 'perf record', as we need to in advance
generate those records as we don't do any processing of samples...
Yeah, for preexisting threads we do have a problem since day one, what
we use is just what can be done with existing stuff.
I saw that there were some more messages in this thread, its just that I
haven't found them in my mailbox when David Ahern pointed this out this
discussion to me :-\
>From the subject line, there is patchkit, but I couldn't find it... Can
you resend it to me or point me to some url where I can get it?
- Arnaldo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
@ 2015-02-27 20:54 ` David Ahern
2015-02-27 21:50 ` Arnaldo Carvalho de Melo
0 siblings, 1 reply; 19+ messages in thread
From: David Ahern @ 2015-02-27 20:54 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Pavel Odintsov; +Cc: linux-kernel
On 2/27/15 1:43 PM, Arnaldo Carvalho de Melo wrote:
> From the subject line, there is patchkit, but I couldn't find it... Can
> you resend it to me or point me to some url where I can get it?
https://lkml.org/lkml/2015/2/17/64
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
2015-02-27 20:54 ` David Ahern
@ 2015-02-27 21:50 ` Arnaldo Carvalho de Melo
0 siblings, 0 replies; 19+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-02-27 21:50 UTC (permalink / raw)
To: David Ahern; +Cc: Pavel Odintsov, Andrew Vagin, linux-kernel
Em Fri, Feb 27, 2015 at 01:54:03PM -0700, David Ahern escreveu:
> On 2/27/15 1:43 PM, Arnaldo Carvalho de Melo wrote:
>
> > From the subject line, there is patchkit, but I couldn't find it... Can
> >you resend it to me or point me to some url where I can get it?
>
> https://lkml.org/lkml/2015/2/17/64
Yeah, I eventually found it, this would be great for perf:
Another good feature of task_diag is an ability to request information
for a few processes. Currently here are two stratgies
TASK_DIAG_DUMP_ALL - get information for all tasks
TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
tasks
I.e. 'perf record -a' would use that TASK_DIAG_DUMP_ALL to synthesize
PERF_RECORD_{FORK,COMM} events, we would still need some way to generate
the PERF_RECORD_MMAP entries tho.
- Arnaldo
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-02-27 21:50 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-19 13:00 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Pavel Odintsov
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
2015-02-27 20:54 ` David Ahern
2015-02-27 21:50 ` Arnaldo Carvalho de Melo
-- strict thread matches above, loose matches on Subject: below --
2015-02-19 12:50 Pavel Odintsov
2015-02-17 8:20 Andrey Vagin
2015-02-17 8:53 ` Arnd Bergmann
2015-02-17 21:33 ` Andrew Vagin
2015-02-18 11:06 ` Arnd Bergmann
2015-02-18 12:42 ` Andrew Vagin
2015-02-18 14:46 ` Arnd Bergmann
2015-02-19 14:04 ` Andrew Vagin
2015-02-17 16:09 ` David Ahern
2015-02-17 20:32 ` Andrew Vagin
2015-02-17 19:05 ` Andy Lutomirski
2015-02-18 14:27 ` Andrew Vagin
2015-02-19 1:18 ` Andy Lutomirski
2015-02-19 21:39 ` Andrew Vagin
2015-02-20 20:33 ` Andy Lutomirski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox