linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Vagin <avagin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
To: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>,
	Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>,
	David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Cyrill Gorcunov
	<gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>,
	Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>,
	Roger Luethi <rl-7uj+XXdSDtwfv37vnLkPlQ@public.gmane.org>,
	Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>,
	Arnaldo Carvalho de Melo
	<acme-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Pavel Odintsov
	<pavel.odintsov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2)
Date: Tue, 24 Nov 2015 18:18:12 +0300	[thread overview]
Message-ID: <20151124151811.GA16393@odin.com> (raw)
In-Reply-To: <1436172445-6979-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

Hello Everybody,

Sorry for the long delay. I wanted to resurrect this thread.

Andy suggested to create a new syscall instead of using netlink
interface.
> Would it make more sense to have a new syscall instead?  You could
> even still use nlattr formatting for the syscall results.

I tried to implement it to understand how it looks like. Here is my
version:
https://github.com/avagin/linux-task-diag/blob/task_diag_syscall/kernel/task_diag.c#L665
I could not invent a better interfaces for it than using netlink
messages as arguments. I know it looks weird.

I could not say that I understood why a new system call is better
than using a netlink socket, so I tried to solve the problem which
were mentioned for the netlink interface.

The magor question was how to support pid and user namespaces in task_diag.
I think I found a good and logical solution.

As for pidns, we can use scm credentials, which is connected to each
socket message. They contain requestor’s pid and we can get a pid
namespace from it. In this case, we get a good feature to specify a pid
namespace without entering into it. For that, an user need to specify
any process from this pidns in an scm message.

As for credentials, we can get them from file->f_cred. In this case we
are able to create a socket and decrease permissions of the current
process, but the socket will work as before. It’s the common behaviour for
file descriptors.

As before, I incline to use the netlink interface for task_diag:

* Netlink is designed for such type of workloads. It allows to expand
  the interface and save backward compatibility. It allows to generates
  packets with a different set of parameters.
* If we use a file descriptor, we can create it and decrease
  capabilities of the current process. It's a good feature which will be
  unavailable if we decide to create a system call.
* task_stat is a bad example, because a few problems were not solved in
  it.

I’m going to send the next version of the task_diag patches in a few
days. Any comments are welcome.

Here is the git repo with the current version:
https://github.com/avagin/linux-task-diag/commits/master

Thanks,
Andrew

On Mon, Jul 06, 2015 at 11:47:01AM +0300, Andrey Vagin wrote:
> Currently we use the proc file system, where all information are
> presented in text files, what is convenient for humans.  But if we need
> to get information about processes from code (e.g. in C), the procfs
> doesn't look so cool.
> 
> From code we would prefer to get information in binary format and to be
> able to specify which information and for which tasks are required. Here
> is a new interface with all these features, which is called task_diag.
> In addition it's much faster than procfs.
> 
> task_diag is based on netlink sockets and looks like socket-diag, which
> is used to get information about sockets.
> 
> A request is described by the task_diag_pid structure:
> 
> struct task_diag_pid {
>        __u64   show_flags;	/* specify which information are required */
>        __u64   dump_stratagy;   /* specify a group of processes */
> 
>        __u32   pid;
> };
> 
> dump_stratagy specifies a group of processes:
> /* per-process strategies */
> TASK_DIAG_DUMP_CHILDREN	- all children
> TASK_DIAG_DUMP_THREAD	- all threads
> TASK_DIAG_DUMP_ONE	- one process
> /* system wide strategies (the pid fiel is ignored) */
> TASK_DIAG_DUMP_ALL	  - all processes
> TASK_DIAG_DUMP_ALL_THREAD - all threads
> 
> show_flags specifies which information are required.
> If we set the TASK_DIAG_SHOW_BASE flag, the response message will
> contain the TASK_DIAG_BASE attribute which is described by the
> task_diag_base structure.
> 
> struct task_diag_base {
> 	__u32	tgid;
> 	__u32	pid;
> 	__u32	ppid;
> 	__u32	tpid;
> 	__u32	sid;
> 	__u32	pgid;
> 	__u8	state;
> 	char	comm[TASK_DIAG_COMM_LEN];
> };
> 
> In future, it can be extended by optional attributes. The request
> describes which task properties are required and for which processes
> they are required for.
> 
> A response can be divided into a few netlink packets if the NETLINK_DUMP
> has been set in a request. Each task is described by a message. Each
> message contains the TASK_DIAG_PID attribute and optional attributes
> which have been requested (show_flags). A message can be divided into a
> few parts if it doesn’t fit into a current netlink packet. In this case,
> the first message in the next packet contains the same PID and
> attributes which doesn’t  fit into the previous message.
> 
> The task diag is much faster than the proc file system. We don't need to
> create a new file descriptor for each task. We need to send a request
> and get a response. It allows to get information for a few tasks in one
> request-response iteration.
> 
> As for security, task_diag always works as procfs with hidepid = 2 (highest
> level of security).
> 
> I have compared performance of procfs and task-diag for the
> "ps ax -o pid,ppid" command.
> 
> A test stand contains 30108 processes.
> $ ps ax -o pid,ppid | wc -l
> 30108
> 
> $ time ps ax -o pid,ppid > /dev/null
> 
> real	0m0.836s
> user	0m0.238s
> sys	0m0.583s
> 
> Read /proc/PID/stat for each task
> $ time ./task_proc_all > /dev/null
> 
> real	0m0.258s
> user	0m0.019s
> sys	0m0.232s
> 
> $ time ./task_diag_all > /dev/null
> 
> real	0m0.052s
> user	0m0.013s
> sys	0m0.036s
> 
> And here are statistics on syscalls which were called by each
> command.
> 
> $ perf trace -s -o log -- ./task_proc_all > /dev/null
> 
>  Summary of events:
> 
>  task_proc_all (30781), 180785 events, 100.0%, 0.000 msec
> 
>    syscall            calls      min       avg       max      stddev
>                                (msec)    (msec)    (msec)        (%)
>    --------------- -------- --------- --------- ---------     ------
>    read               30111     0.000     0.013     0.107      0.21%
>    write                  1     0.008     0.008     0.008      0.00%
>    open               30111     0.007     0.012     0.145      0.24%
>    close              30112     0.004     0.011     0.110      0.20%
>    fstat                  3     0.009     0.013     0.016     16.15%
>    mmap                   8     0.011     0.020     0.027     11.24%
>    mprotect               4     0.019     0.023     0.028      8.33%
>    munmap                 1     0.026     0.026     0.026      0.00%
>    brk                    8     0.007     0.015     0.024     11.94%
>    ioctl                  1     0.007     0.007     0.007      0.00%
>    access                 1     0.019     0.019     0.019      0.00%
>    execve                 1     0.000     0.000     0.000      0.00%
>    getdents              29     0.008     1.010     2.215      8.88%
>    arch_prctl             1     0.016     0.016     0.016      0.00%
>    openat                 1     0.021     0.021     0.021      0.00%
> 
> 
> $ perf trace -s -o log -- ./task_diag_all > /dev/null
>  Summary of events:
> 
>  task_diag_all (30762), 717 events, 98.9%, 0.000 msec
> 
>    syscall            calls      min       avg       max      stddev
>                                (msec)    (msec)    (msec)        (%)
>    --------------- -------- --------- --------- ---------     ------
>    read                   2     0.000     0.008     0.016    100.00%
>    write                197     0.008     0.019     0.041      3.00%
>    open                   2     0.023     0.029     0.036     22.45%
>    close                  3     0.010     0.012     0.014     11.34%
>    fstat                  3     0.012     0.044     0.106     70.52%
>    mmap                   8     0.014     0.031     0.054     18.88%
>    mprotect               4     0.016     0.023     0.027     10.93%
>    munmap                 1     0.022     0.022     0.022      0.00%
>    brk                    1     0.040     0.040     0.040      0.00%
>    ioctl                  1     0.011     0.011     0.011      0.00%
>    access                 1     0.032     0.032     0.032      0.00%
>    getpid                 1     0.012     0.012     0.012      0.00%
>    socket                 1     0.032     0.032     0.032      0.00%
>    sendto                 2     0.032     0.095     0.157     65.77%
>    recvfrom             129     0.009     0.235     0.418      2.45%
>    bind                   1     0.018     0.018     0.018      0.00%
>    execve                 1     0.000     0.000     0.000      0.00%
>    arch_prctl             1     0.012     0.012     0.012      0.00%
> 
> You can find the test program from this experiment in tools/test/selftest/taskdiag.
> 
> The idea of this functionality was suggested by Pavel Emelyanov
> (xemul@), when he found that operations with /proc forms a significant
> part of a checkpointing time.
> 
> Ten years ago there was attempt to add a netlink interface to access to /proc
> information:
> http://lwn.net/Articles/99600/
> 
> git repo: https://github.com/avagin/linux-task-diag
> 
> Changes from the first version:
> 
> David Ahern implemented all required functionality to use task_diag in
> perf.
> 
> Bellow you can find his results how it affects performance.
> > Using the fork test command:
> >    10,000 processes; 10k proc with 5 threads = 50,000 tasks
> >    reading /proc: 11.3 sec
> >    task_diag:      2.2 sec
> >
> > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
> >
> > 128 instances of sepcjbb, 80,000+ tasks:
> >     reading /proc: 32.1 sec
> >     task_diag:      3.9 sec
> >
> > So overall much snappier startup times.
> 
> Many thanks to David Ahern for the help with improving task_diag.
> 
> Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Cyrill Gorcunov <gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> Cc: Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Roger Luethi <rl-7uj+XXdSDtwfv37vnLkPlQ@public.gmane.org>
> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
> Cc: Arnaldo Carvalho de Melo <acme-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: David Ahern <dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> Cc: Pavel Odintsov <pavel.odintsov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> --
> 2.1.0
> 

  parent reply	other threads:[~2015-11-24 15:18 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-06  8:47 [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2) Andrey Vagin
     [not found] ` <1436172445-6979-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2015-07-06  8:47   ` [PATCH 01/24] kernel: define taststats commands in the one place Andrey Vagin
2015-07-06  8:47   ` [PATCH 02/24] kernel: add a netlink interface to get information about tasks (v2) Andrey Vagin
2015-07-06  8:47   ` [PATCH 03/24] kernel: make taskstats available from all net namespaces Andrey Vagin
2015-07-06  8:47   ` [PATCH 04/24] kernel: move next_tgid from fs/proc Andrey Vagin
2015-07-06  8:47   ` [PATCH 05/24] task_diag: add ability to get information about all tasks Andrey Vagin
2015-07-06  8:47   ` [PATCH 06/24] task_diag: add ability to split per-task data on a few netlink messages Andrey Vagin
2015-07-06  8:47   ` [PATCH 07/24] task_diag: add a new group to get process credentials Andrey Vagin
2015-07-06  8:47   ` [PATCH 08/24] proc: pick out a function to iterate task children Andrey Vagin
     [not found]     ` <1436172445-6979-9-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2015-07-14 18:02       ` Oleg Nesterov
     [not found]         ` <20150714180235.GB8088-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-07-17 15:57           ` Andrew Vagin
     [not found]             ` <20150717155728.GB6685-wo1vFcy6AUs@public.gmane.org>
2015-07-18 21:22               ` Oleg Nesterov
2015-07-06  8:47   ` [PATCH 09/24] proc: move task_next_child() from fs/proc Andrey Vagin
2015-07-06  8:47   ` [PATCH 10/24] task_diag: add ability to dump children (v2) Andrey Vagin
2015-07-06  8:47   ` [PATCH 11/24] task_diag: add a new group to get task statistics Andrey Vagin
2015-07-06  8:47   ` [PATCH 17/24] task_diag: add ability to dump theads Andrey Vagin
2015-07-06  8:47   ` [PATCH 24/24] task_diag: Enhance fork tool to spawn threads Andrey Vagin
2015-11-24 15:18   ` Andrew Vagin [this message]
     [not found]     ` <20151124151811.GA16393-wo1vFcy6AUs@public.gmane.org>
2015-12-03 23:20       ` [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2) Andy Lutomirski
2015-12-03 23:43         ` Arnd Bergmann
2015-12-14  8:05           ` Andrew Vagin
     [not found]         ` <CALCETrUzOBybH0-rcgvzMNazjadZpuxkBZLkoUDY30X_-cqBzg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-14  7:52           ` Andrew Vagin
2015-12-14 22:38             ` Andy Lutomirski
     [not found]               ` <CALCETrU_MtDa3p64R5bLx4BU5mKTDD0iEgtA4nLRHPfS2JbhOQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-12-15 15:53                 ` Andrew Vagin
     [not found]                   ` <20151215155358.GC24236-wo1vFcy6AUs@public.gmane.org>
2015-12-15 16:43                     ` Andy Lutomirski
2015-07-06  8:47 ` [PATCH 12/24] task_diag: add a new group to get tasks memory mappings (v2) Andrey Vagin
     [not found]   ` <1436172445-6979-13-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2015-07-14 18:08     ` Oleg Nesterov
     [not found]       ` <20150714180857.GC8088-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-07-15  2:02         ` David Ahern
2015-07-06  8:47 ` [PATCH 13/24] task_diag: shows memory consumption for " Andrey Vagin
2015-07-06  8:47 ` [PATCH 14/24] task_diag: add a marcos to enumirate memory mappings Andrey Vagin
2015-07-06  8:47 ` [PATCH 15/24] proc: give task_struct instead of pid into first_tid Andrey Vagin
2015-07-14 18:11   ` Oleg Nesterov
2015-07-06  8:47 ` [PATCH 16/24] proc: move first_tid and next_tid out of proc Andrey Vagin
2015-07-06  8:47 ` [PATCH 18/24] task_diag: add ability to handle one task in a continious mode Andrey Vagin
2015-07-06  8:47 ` [PATCH 19/24] task_diag: Add option to dump all threads for all tasks Andrey Vagin
2015-07-06  8:47 ` [PATCH 20/24] task_diag: Only add VMAs for thread_group leader Andrey Vagin
     [not found]   ` <1436172445-6979-21-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2015-07-14 17:47     ` Oleg Nesterov
2015-07-15  2:01       ` David Ahern
     [not found]         ` <55A5BF0F.7090808-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-07-15 13:31           ` Oleg Nesterov
2015-07-06  8:47 ` [PATCH 21/24] task diag: Add support for TGID attribute Andrey Vagin
2015-07-06  8:47 ` [PATCH 22/24] Documentation: add documentation for task_diag Andrey Vagin
2015-07-06  8:47 ` [PATCH 23/24] selftest: check the task_diag functinonality Andrey Vagin
2015-07-06 17:10 ` [PATCH 0/24] kernel: add a netlink interface to get information about processes (v2) Andy Lutomirski
2015-07-07 15:43   ` Andrew Vagin
     [not found]     ` <20150707154345.GA1593-wo1vFcy6AUs@public.gmane.org>
2015-07-07 15:56       ` Andy Lutomirski
2015-07-07 16:17         ` David Ahern
2015-07-07 16:24           ` Andy Lutomirski
     [not found]             ` <CALCETrWRT--XO6jYyno_i0nUZEoRuq3S5_n-qFRSt2rmkd3jMQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-07 16:41               ` David Ahern
2015-07-08 16:10         ` Andrew Vagin
2015-07-08 17:39           ` Andy Lutomirski
2015-07-08 22:49             ` Andrey Vagin
     [not found]               ` <CANaxB-yMKGWJ1r0GMR9VfAq_xHn6bTjYmkDXST4suNNqu4GVjA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-08 23:48                 ` Andy Lutomirski
2015-07-07 16:25       ` Arnaldo Carvalho de Melo
     [not found]         ` <20150707162552.GM3326-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-07-07 16:27           ` Andy Lutomirski
     [not found]             ` <CALCETrWEXRif4pFUzVJq1T=KWKvd=tbEDf-vpr5MJtVK1_RWYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-07 16:56               ` David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151124151811.GA16393@odin.com \
    --to=avagin-5hdwgun5lf+gspxsjd1c4w@public.gmane.org \
    --cc=acme-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=arnd-r2nGTMty4D4@public.gmane.org \
    --cc=avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=dsahern-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org \
    --cc=oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=pavel.odintsov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=rl-7uj+XXdSDtwfv37vnLkPlQ@public.gmane.org \
    --cc=xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).