public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Vagin <avagin@parallels.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Andrey Vagin <avagin@openvz.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Roger Luethi <rl@hellgate.ch>, Oleg Nesterov <oleg@redhat.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
Date: Wed, 18 Feb 2015 17:27:19 +0300	[thread overview]
Message-ID: <20150218142718.GA30542@paralelels.com> (raw)
In-Reply-To: <CALCETrWyQpr-x=No4mK_95gSANL-_fTr3qC7WjT_5TyFQb_rGw@mail.gmail.com>

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

  reply	other threads:[~2015-02-18 14:27 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
2015-02-17  8:20 ` [PATCH 1/7] kernel: add a netlink interface to get information about tasks Andrey Vagin
2015-02-17  8:20 ` [PATCH 2/7] kernel: move next_tgid from fs/proc Andrey Vagin
2015-02-17  8:20 ` [PATCH 3/7] task-diag: add ability to get information about all tasks Andrey Vagin
2015-02-17  8:20 ` [PATCH 4/7] task-diag: add a new group to get process credentials Andrey Vagin
2015-02-17  8:20 ` [PATCH 5/7] kernel: add ability to iterate children of a specified task Andrey Vagin
2015-02-17  8:20 ` [PATCH 6/7] task_diag: add ability to dump children Andrey Vagin
2015-02-17  8:20 ` [PATCH 7/7] selftest: check the task_diag functinonality Andrey Vagin
2015-02-17  8:53 ` [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Arnd Bergmann
2015-02-17 21:33   ` Andrew Vagin
2015-02-18 11:06     ` Arnd Bergmann
2015-02-18 12:42       ` Andrew Vagin
2015-02-18 14:46         ` Arnd Bergmann
2015-02-19 14:04           ` Andrew Vagin
2015-02-17 16:09 ` David Ahern
2015-02-17 20:32   ` Andrew Vagin
2015-02-17 19:05 ` Andy Lutomirski
2015-02-18 14:27   ` Andrew Vagin [this message]
2015-02-19  1:18     ` Andy Lutomirski
2015-02-19 21:39       ` Andrew Vagin
2015-02-20 20:33         ` Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2015-02-19 12:50 Pavel Odintsov
2015-02-19 13:00 Pavel Odintsov
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
2015-02-27 20:54   ` David Ahern
2015-02-27 21:50     ` Arnaldo Carvalho de Melo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150218142718.GA30542@paralelels.com \
    --to=avagin@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@openvz.org \
    --cc=gorcunov@openvz.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=oleg@redhat.com \
    --cc=rl@hellgate.ch \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox