All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Vagin <avagin-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
To: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Cc: Andrey Vagin <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>,
	Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>,
	Roger Luethi <rl-7uj+XXdSDtwfv37vnLkPlQ@public.gmane.org>,
	Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Cyrill Gorcunov
	<gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
Date: Wed, 18 Feb 2015 17:27:19 +0300	[thread overview]
Message-ID: <20150218142718.GA30542@paralelels.com> (raw)
In-Reply-To: <CALCETrWyQpr-x=No4mK_95gSANL-_fTr3qC7WjT_5TyFQb_rGw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

WARNING: multiple messages have this Message-ID (diff)
From: Andrew Vagin <avagin@parallels.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Andrey Vagin <avagin@openvz.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Roger Luethi <rl@hellgate.ch>, Oleg Nesterov <oleg@redhat.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux API <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes
Date: Wed, 18 Feb 2015 17:27:19 +0300	[thread overview]
Message-ID: <20150218142718.GA30542@paralelels.com> (raw)
In-Reply-To: <CALCETrWyQpr-x=No4mK_95gSANL-_fTr3qC7WjT_5TyFQb_rGw@mail.gmail.com>

On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@openvz.org> wrote:
> >
> > Here is a preview version. It provides restricted set of functionality.
> > I would like to collect feedback about this idea.
> >
> > Currently we use the proc file system, where all information are
> > presented in text files, what is convenient for humans.  But if we need
> > to get information about processes from code (e.g. in C), the procfs
> > doesn't look so cool.
> >
> > From code we would prefer to get information in binary format and to be
> > able to specify which information and for which tasks are required. Here
> > is a new interface with all these features, which is called task_diag.
> > In addition it's much faster than procfs.
> >
> > task_diag is based on netlink sockets and looks like socket-diag, which
> > is used to get information about sockets.
> >
> > A request is described by the task_diag_pid structure:
> >
> > struct task_diag_pid {
> >        __u64   show_flags;      /* specify which information are required */
> >        __u64   dump_stratagy;   /* specify a group of processes */
> >
> >        __u32   pid;
> > };
> >
> > A respone is a set of netlink messages. Each message describes one task.
> > All task properties are divided on groups. A message contains the
> > TASK_DIAG_MSG group and other groups if they have been requested in
> > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > response will contain the TASK_DIAG_CRED group which is described by the
> > task_diag_creds structure.
> >
> > struct task_diag_msg {
> >         __u32   tgid;
> >         __u32   pid;
> >         __u32   ppid;
> >         __u32   tpid;
> >         __u32   sid;
> >         __u32   pgid;
> >         __u8    state;
> >         char    comm[TASK_DIAG_COMM_LEN];
> > };
> >
> > Another good feature of task_diag is an ability to request information
> > for a few processes. Currently here are two stratgies
> > TASK_DIAG_DUMP_ALL      - get information for all tasks
> > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> >                           tasks
> >
> > The task diag is much faster than the proc file system. We don't need to
> > create a new file descriptor for each task. We need to send a request
> > and get a response. It allows to get information for a few task in one
> > request-response iteration.
> >
> > I have compared performance of procfs and task-diag for the
> > "ps ax -o pid,ppid" command.
> >
> > A test stand contains 10348 processes.
> > $ ps ax -o pid,ppid | wc -l
> > 10348
> >
> > $ time ps ax -o pid,ppid > /dev/null
> >
> > real    0m1.073s
> > user    0m0.086s
> > sys     0m0.903s
> >
> > $ time ./task_diag_all > /dev/null
> >
> > real    0m0.037s
> > user    0m0.004s
> > sys     0m0.020s
> >
> > And here are statistics about syscalls which were called by each
> > command.
> > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid  2>&1 | grep syscalls | sort -n -r | head -n 5
> >             20,713      syscalls:sys_exit_open
> >             20,710      syscalls:sys_exit_close
> >             20,708      syscalls:sys_exit_read
> >             10,348      syscalls:sys_exit_newstat
> >                 31      syscalls:sys_exit_write
> >
> > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all  2>&1 | grep syscalls | sort -n -r | head -n 5
> >                114      syscalls:sys_exit_recvfrom
> >                 49      syscalls:sys_exit_write
> >                  8      syscalls:sys_exit_mmap
> >                  4      syscalls:sys_exit_mprotect
> >                  3      syscalls:sys_exit_newfstat
> >
> > You can find the test program from this experiment in the last patch.
> >
> > The idea of this functionality was suggested by Pavel Emelyanov
> > (xemul@), when he found that operations with /proc forms a significant
> > part of a checkpointing time.
> >
> > Ten years ago here was attempt to add a netlink interface to access to /proc
> > information:
> > http://lwn.net/Articles/99600/
> 
> I don't suppose this could use real syscalls instead of netlink.  If
> nothing else, netlink seems to conflate pid and net namespaces.

What do you mean by "conflate pid and net namespaces"?

> 
> Also, using an asynchronous interface (send, poll?, recv) for
> something that's inherently synchronous (as the kernel a local
> question) seems awkward to me.

Actually all requests are handled synchronously. We call sendmsg to send
a request and it is handled in this syscall.
 2)               |  netlink_sendmsg() {
 2)               |    netlink_unicast() {
 2)               |      taskdiag_doit() {
 2)   2.153 us    |        task_diag_fill();
 2)               |        netlink_unicast() {
 2)   0.185 us    |          netlink_attachskb();
 2)   0.291 us    |          __netlink_sendskb();
 2)   2.452 us    |        }
 2) + 33.625 us   |      }
 2) + 54.611 us   |    }
 2) + 76.370 us   |  }
 2)               |  netlink_recvmsg() {
 2)   1.178 us    |    skb_recv_datagram();
 2) + 46.953 us   |  }

If we request information for a group of tasks (NLM_F_DUMP), a first
portion of data is filled from the sendmsg syscall. And then when we read
it, the kernel fills the next portion.

 3)               |  netlink_sendmsg() {
 3)               |    __netlink_dump_start() {
 3)               |      netlink_dump() {
 3)               |        taskdiag_dumpid() {
 3)   0.685 us    |          task_diag_fill();
...
 3)   0.224 us    |          task_diag_fill();
 3) + 74.028 us   |        }
 3) + 88.757 us   |      }
 3) + 89.296 us   |    }
 3) + 98.705 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)               |      taskdiag_dumpid() {
 3)   0.594 us    |        task_diag_fill();
...
 3)   0.242 us    |        task_diag_fill();
 3) + 60.634 us   |      }
 3) + 72.803 us   |    }
 3) + 88.005 us   |  }
 3)               |  netlink_recvmsg() {
 3)               |    netlink_dump() {
 3)   2.403 us    |      taskdiag_dumpid();
 3) + 26.236 us   |    }
 3) + 40.522 us   |  }
 0) + 20.407 us   |  netlink_recvmsg();


netlink is really good for this type of tasks.  It allows to create an
extendable interface which can be easy customized for different needs.

I don't think that we would want to create another similar interface
just to be independent from network subsystem.

Thanks,
Andrew

> 
> --Andy

  parent reply	other threads:[~2015-02-18 14:27 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-17  8:20 [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andrey Vagin
     [not found] ` <1424161226-15176-1-git-send-email-avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2015-02-17  8:20   ` [PATCH 1/7] kernel: add a netlink interface to get information about tasks Andrey Vagin
2015-02-17  8:20     ` Andrey Vagin
2015-02-17  8:20   ` [PATCH 3/7] task-diag: add ability to get information about all tasks Andrey Vagin
2015-02-17  8:20     ` Andrey Vagin
2015-02-17  8:20   ` [PATCH 4/7] task-diag: add a new group to get process credentials Andrey Vagin
2015-02-17  8:20     ` Andrey Vagin
2015-02-17  8:53   ` [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Arnd Bergmann
2015-02-17  8:53     ` Arnd Bergmann
2015-02-17 21:33     ` Andrew Vagin
2015-02-17 21:33       ` Andrew Vagin
     [not found]       ` <20150217213313.GB7091-yYYamFZzV1regbzhZkK2zA@public.gmane.org>
2015-02-18 11:06         ` Arnd Bergmann
2015-02-18 11:06           ` Arnd Bergmann
2015-02-18 12:42           ` Andrew Vagin
2015-02-18 12:42             ` Andrew Vagin
     [not found]             ` <20150218123659.GA24098-yYYamFZzV1regbzhZkK2zA@public.gmane.org>
2015-02-18 14:46               ` Arnd Bergmann
2015-02-18 14:46                 ` Arnd Bergmann
2015-02-19 14:04                 ` Andrew Vagin
2015-02-19 14:04                   ` Andrew Vagin
2015-02-17 16:09   ` David Ahern
2015-02-17 16:09     ` David Ahern
     [not found]     ` <54E367CB.9030309-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-02-17 20:32       ` Andrew Vagin
2015-02-17 20:32         ` Andrew Vagin
2015-02-17  8:20 ` [PATCH 2/7] kernel: move next_tgid from fs/proc Andrey Vagin
2015-02-17  8:20 ` [PATCH 5/7] kernel: add ability to iterate children of a specified task Andrey Vagin
2015-02-17  8:20 ` [PATCH 6/7] task_diag: add ability to dump children Andrey Vagin
2015-02-17  8:20 ` [PATCH 7/7] selftest: check the task_diag functinonality Andrey Vagin
2015-02-17 19:05 ` [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes Andy Lutomirski
     [not found]   ` <CALCETrWyQpr-x=No4mK_95gSANL-_fTr3qC7WjT_5TyFQb_rGw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-18 14:27     ` Andrew Vagin [this message]
2015-02-18 14:27       ` Andrew Vagin
     [not found]       ` <20150218142718.GA30542-yYYamFZzV1regbzhZkK2zA@public.gmane.org>
2015-02-19  1:18         ` Andy Lutomirski
2015-02-19  1:18           ` Andy Lutomirski
     [not found]           ` <CALCETrU5B+1g9B3GH2WpPMaB98thXxpL1fAsHjssK1t_fDM_ZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-19 21:39             ` Andrew Vagin
2015-02-19 21:39               ` Andrew Vagin
     [not found]               ` <20150219213929.GA16250-yYYamFZzV1regbzhZkK2zA@public.gmane.org>
2015-02-20 20:33                 ` Andy Lutomirski
2015-02-20 20:33                   ` Andy Lutomirski
  -- strict thread matches above, loose matches on Subject: below --
2015-02-19 12:50 Pavel Odintsov
2015-02-19 13:00 Pavel Odintsov
2015-02-27 20:43 ` Arnaldo Carvalho de Melo
2015-02-27 20:54   ` David Ahern
2015-02-27 21:50     ` Arnaldo Carvalho de Melo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150218142718.GA30542@paralelels.com \
    --to=avagin-bzqdu9zft3wakbo8gow8eq@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=avagin-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=gorcunov-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    --cc=linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org \
    --cc=oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=rl-7uj+XXdSDtwfv37vnLkPlQ@public.gmane.org \
    --cc=xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.