Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

Linux Container Development
 help / color / mirror / Atom feed

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found] <1285249681.1837.28.camel@holzheu-laptop>
@ 2010-09-23 20:11 ` Andrew Morton
       [not found]   ` <20100923131136.356075f4.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Andrew Morton @ 2010-09-23 20:11 UTC (permalink / raw)
  To: holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Shailabh Nagar,
	Thomas-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Zijlstra,
	Heiko Carstens, Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Balbir Singh, Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Oleg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Martin Schwidefsky, Ingo Molnar,
	Peter-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Gleixner,
	Suresh Siddha

On Thu, 23 Sep 2010 15:48:01 +0200
Michael Holzheu <holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> Currently tools like "top" gather the task information by reading procfs
> files. This has several disadvantages:
> 
> * It is very CPU intensive, because a lot of system calls (readdir, open,
>   read, close) are necessary.
> * No real task snapshot can be provided, because while the procfs files are
>   read the system continues running.
> * The procfs times granularity is restricted to jiffies.
> 
> In parallel to procfs there exists the taskstats binary interface that uses
> netlink sockets as transport mechanism to deliver task information to
> user space. There exists a taskstats command "TASKSTATS_CMD_ATTR_PID"
> to get task information for a given PID. This command can already be used for
> tools like top, but has also several disadvantages:
> 
> * You first have to find out which PIDs are available in the system. Currently
>   we have to use procfs again to do this.
> * For each task two system calls have to be issued (First send the command and
>   then receive the reply).
> * No snapshot mechanism is available.
> 
> GOALS OF THIS PATCH SET
> -----------------------
> The intention of this patch set is to provide better support for tools like
> top. The goal is to:
> 
> * provide a task snapshot mechanism where we can get a consistent view of
>   all running tasks.
> * provide a transport mechanism that does not require a lot of system calls
>   and that allows implementing low CPU overhead task monitoring.
> * provide microsecond CPU time granularity.

This is a big change!  If this is done right then we're heading in the
direction of deprecating the longstanding way in which userspace
observes the state of Linux processes and we're recommending that the
whole world migrate to taskstats.  I think?

If so, much chin-scratching will be needed, coordination with
util-linux people, etc.

We'd need to think about the implications of taskstats versioning.  It
_is_ a versioned interface, so people can't just go and toss random new
stuff in there at will - it's not like adding a new procfs file, or
adding a new line to an existing one.  I don't know if that's likely to
be a significant problem.

I worry that there's a dependency on CONFIG_NET?  If so then that's a
big problem because in N years time, 99% of the world will be using
taskstats, but a few embedded losers will be stuck using (and having to
support) the old tools.


> FIRST RESULTS
> -------------
> Together with this kernel patch set also user space code for a new top
> utility (ptop) is provided that exploits the new kernel infrastructure. See
> patch 10 for more details.
> 
> TEST1: System with many sleeping tasks
> 
>   for ((i=0; i < 1000; i++))
>   do
>          sleep 1000000 &
>   done
> 
>   # ptop_new_proc
> 
>              VVVV
>   pid   user  sys  ste  total  Name
>   (#)    (%)  (%)  (%)    (%)  (str)
>   541   0.37 2.39 0.10   2.87  top
>   3743  0.03 0.05 0.00   0.07  ptop_new_proc
>              ^^^^
> 
> Compared to the old top command that has to scan more than 1000 proc
> directories the new ptop consumes much less CPU time (0.05% system time
> on my s390 system).

How many CPUs does that system have?

What's the `top' update period?  One second?

So we're saying that a `top -d 1' consumes 2.4% of this
mystery-number-of-CPUs machine?  That's quite a lot.

> PATCHSET OVERVIEW
> -----------------
> The code is not final and still has a few TODOs. But it is good enough for a
> first round of review. The following kernel patches are provided:
> 
> [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
>      more easily.
> [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
>      filling the taskstats.
> [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
>      tasks.
> [05] Add procfs interface for taskstats commands. This allows to get a complete
>      and consistent snapshot with all tasks using two system calls (ioctl and
>      read). Transferring a snapshot of all running tasks is not possible using
>      the existing netlink interface, because there we have the socket buffer
>      size as restricting factor.

So this is a binary interface which uses an ioctl.  People don't like
ioctls.  Could we have triggered it with a write() instead?

Does this have the potential to save us from the CONFIG_NET=n problem?

> [06] Add TGID to taskstats.
> [07] Add steal time per task accounting.
> [08] Add cumulative CPU time (user, system and steal) to taskstats.

These didn't update the taskstats version number.  Should they have?

> [09] Fix exit CPU time accounting.
> 
> [10] Besides of the kernel patches also user space code is provided that
>      exploits the new kernel infrastructure. The user space code provides the
>      following:
>      1. A proposal for a taskstats user space library:
>         1.1 Based on netlink (requires libnl-devel-1.1-5)
>         2.1 Based on the new /proc/taskstats interface (see [05])
>      2. A proposal for a task snapshot library based on taskstats library (1.1)

ooh, excellent.  A standardised userspace access library.

>      3. A new tool "ptop" (precise top) that uses the libraries

Talk to me about namespaces, please.  A lot of the new code involves
PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
namespace.  Does everything Just Work?  When userspace sends a PID to
the kernel, that PID is assumed to be within the sending process's PID
namespace?  If so, then please spell it all out in the changelogs.  If
not then that is a problem!

If I can only observe processes in my PID namespace then is that a
problem?  Should I be allowed to observe another PID namespace's
processes?  I assume so, because I might be root.  If so, how is that
to be done?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]   ` <20100923131136.356075f4.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-09-23 22:11     ` Matt Helsley
  2010-09-24  9:10     ` Michael Holzheu
  1 sibling, 0 replies; 10+ messages in thread
From: Matt Helsley @ 2010-09-23 22:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Suresh Siddha,
	Martin Schwidefsky, Ingo Molnar,
	holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Thomas Gleixner,
	Balbir Singh

On Thu, Sep 23, 2010 at 01:11:36PM -0700, Andrew Morton wrote:
> On Thu, 23 Sep 2010 15:48:01 +0200
> Michael Holzheu <holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> > Currently tools like "top" gather the task information by reading procfs
> > files. This has several disadvantages:
> > 

<snip>

> >      3. A new tool "ptop" (precise top) that uses the libraries
> 
> Talk to me about namespaces, please.  A lot of the new code involves
> PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> namespace.  Does everything Just Work?  When userspace sends a PID to
> the kernel, that PID is assumed to be within the sending process's PID
> namespace?  If so, then please spell it all out in the changelogs.  If
> not then that is a problem!

Good point.

The pid ought to be valid in the _receiving_ task's pid namespace. That
can be difficult or impossible if we're talking about netlink broadcasts.
In this regard process events connector is an example of what not to do.

> If I can only observe processes in my PID namespace then is that a
> problem?  Should I be allowed to observe another PID namespace's
> processes?  I assume so, because I might be root.  If so, how is that
> to be done?

I don't think even "root" can see/use pids outside its namespace (without
Eric's setns patches). If you want to see all the tasks then rely on root
being able to do stuff in the initial pid namespace. If you really want
to use/know pids in the child pid namespaces then setns is also a
nice solution.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]   ` <20100923131136.356075f4.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2010-09-23 22:11     ` Matt Helsley
@ 2010-09-24  9:10     ` Michael Holzheu
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Holzheu @ 2010-09-24  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	Ingo Molnar, Balbir Singh, Thomas Gleixner, Suresh Siddha

Hello Andrew,

On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > GOALS OF THIS PATCH SET
> > -----------------------
> > The intention of this patch set is to provide better support for tools like
> > top. The goal is to:
> > 
> > * provide a task snapshot mechanism where we can get a consistent view of
> >   all running tasks.
> > * provide a transport mechanism that does not require a lot of system calls
> >   and that allows implementing low CPU overhead task monitoring.
> > * provide microsecond CPU time granularity.
> 
> This is a big change!  If this is done right then we're heading in the
> direction of deprecating the longstanding way in which userspace
> observes the state of Linux processes and we're recommending that the
> whole world migrate to taskstats.  I think?

Or it can be used as alternative. Since procfs has its drawbacks (e.g.
performance) an alternative could be helpful. 

And the taskstats interface with the TASKSTATS_CMD_ATTR_PID command
already exists and can be used. So we already have a second mechanism to
query tasks accounting information besides of procfs.

> 
> If so, much chin-scratching will be needed, coordination with
> util-linux people, etc.

I agree.

> We'd need to think about the implications of taskstats versioning.  It
> _is_ a versioned interface, so people can't just go and toss random new
> stuff in there at will - it's not like adding a new procfs file, or
> adding a new line to an existing one.  I don't know if that's likely to
> be a significant problem.

I already thought about that problem. Another problem is that depending
on the kernel config options, some taskstats fields may be not
initialized. E.g. CONFIG_TASK_DELAY_ACCT or CONFIG_TASK_XACCT. Currently
there does not exist a good interface to userspace to query which fields
are valid.

Regarding the taskstats versions  I described a possible solution in the
userspace tarball in the README.libtaskstats file:

The "struct taskstats" structure contains accounting information for one
Linux task. This structure is defined in "/usr/include/linux/taskstats.h".
With new kernel versions new fields can be added to that structure.
In that case the kernel taskstats version number defined with the macro
TASKSTATS_VERSION will be increased.

The taskstats library distinguishes between two taskstats versions:
* Kernel taskstats version (KV)
* Program compile taskstats version (CV)

Depending on the taskstats version CV that is used for compiling the program,
this version numbers can be different:
* KV > CV:
  The libtaskstats library only copies the CV taskstats fields and the fields
  that belong to version > CV will be ignored.
* KV < CV:
  The libtaskstats library only copies the version KV fields and the fields
  that belong to version > KV remain uninitialized.

If a program wants to support multiple taskstats versions, this can be done
using the ts_version() function and process fields according to that version
number.

Example:

  if (ts_version() < 7) {
         fprintf(stderr, "Error: kernel taskstats version too low\n");
         exit(1);
  }
  if (ts_version() >= 7)
         print_attrs_v7();
  if (ts_version() >= 8)
         print_attrs_v8();

In this example the program has to be compiled with a taskstats.h header file
that has at least version 8.

> I worry that there's a dependency on CONFIG_NET?  If so then that's a
> big problem because in N years time, 99% of the world will be using
> taskstats, but a few embedded losers will be stuck using (and having to
> support) the old tools.

Sure, but if we could add the /proc/taskstats approach, this dependency
would not be there.

> 
> > FIRST RESULTS
> > -------------
> > Together with this kernel patch set also user space code for a new top
> > utility (ptop) is provided that exploits the new kernel infrastructure. See
> > patch 10 for more details.
> > 
> > TEST1: System with many sleeping tasks
> > 
> >   for ((i=0; i < 1000; i++))
> >   do
> >          sleep 1000000 &
> >   done
> > 
> >   # ptop_new_proc
> > 
> >              VVVV
> >   pid   user  sys  ste  total  Name
> >   (#)    (%)  (%)  (%)    (%)  (str)
> >   541   0.37 2.39 0.10   2.87  top
> >   3743  0.03 0.05 0.00   0.07  ptop_new_proc
> >              ^^^^
> > 
> > Compared to the old top command that has to scan more than 1000 proc
> > directories the new ptop consumes much less CPU time (0.05% system time
> > on my s390 system).
> 
> How many CPUs does that system have?

The system is a virtual machine and has three CPUs.

> What's the `top' update period?  One second?

The update period is two seconds.

> So we're saying that a `top -d 1' consumes 2.4% of this
> mystery-number-of-CPUs machine?  That's quite a lot.

When I run that testcase on my laptop, 2 CPUs (Intel Core 2 - 2.33GHz),
I get about 1-2% system time for top.

> > PATCHSET OVERVIEW
> > -----------------
> > The code is not final and still has a few TODOs. But it is good enough for a
> > first round of review. The following kernel patches are provided:
> > 
> > [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> > [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
> >      more easily.
> > [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
> >      filling the taskstats.
> > [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
> >      tasks.
> > [05] Add procfs interface for taskstats commands. This allows to get a complete
> >      and consistent snapshot with all tasks using two system calls (ioctl and
> >      read). Transferring a snapshot of all running tasks is not possible using
> >      the existing netlink interface, because there we have the socket buffer
> >      size as restricting factor.
> 
> So this is a binary interface which uses an ioctl.  People don't like
> ioctls.  Could we have triggered it with a write() instead?

The current idea is the following:

1. Open /proc/taskstats
2. Set the requested command (e.g. TASKSTATS_CMD_ATTR_PIDS) using
   an ioctl. For the TASKSTATS_CMD_ATTR_PIDS ioctl the following
   structure is sent:

   struct taskstats_cmd_pids {
        __u64   time_ns;
        __u32   pid;
        __u32   cnt;
   };

3. After the command is defined, with a read() the command is executed
   and the result is returned to the user's read buffer.

We could replace step 2 with a write, that transfers the command.

> Does this have the potential to save us from the CONFIG_NET=n problem?

Yes

> > [06] Add TGID to taskstats.
> > [07] Add steal time per task accounting.
> > [08] Add cumulative CPU time (user, system and steal) to taskstats.
> 
> These didn't update the taskstats version number.  Should they have?

Patch 04/10 updates the taskstats version number from 7 to 8.
I didn't want to update the version number with each patch.

> > [09] Fix exit CPU time accounting.
> > 
> > [10] Besides of the kernel patches also user space code is provided that
> >      exploits the new kernel infrastructure. The user space code provides the
> >      following:
> >      1. A proposal for a taskstats user space library:
> >         1.1 Based on netlink (requires libnl-devel-1.1-5)
> >         2.1 Based on the new /proc/taskstats interface (see [05])
> >      2. A proposal for a task snapshot library based on taskstats library (1.1)
> 
> ooh, excellent.  A standardised userspace access library.

Yes, at least a proposal for that.

> >      3. A new tool "ptop" (precise top) that uses the libraries
> 
> Talk to me about namespaces, please.  A lot of the new code involves
> PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> namespace.  Does everything Just Work?  When userspace sends a PID to
> the kernel, that PID is assumed to be within the sending process's PID
> namespace?  If so, then please spell it all out in the changelogs.  If
> not then that is a problem!

To be honest, I have not tested that. I assumed that the current
taskstats code does this correctly. E.g. it uses find_task_by_vpid() for
TASKSTATS_CMD_ATTR_PID and this function uses
"current->nsproxy->pid_ns". So I would assume that we get only tasks
from the caller's namespace. The new TASKSTATS_CMD_ATTR_PIDS command
also uses also only functions with "current->nsproxy->pid_ns".

> If I can only observe processes in my PID namespace then is that a
> problem?  Should I be allowed to observe another PID namespace's
> processes?  I assume so, because I might be root.  If so, how is that
> to be done?

Good question. Probably I have to learn a bit more about the PID
namespace implementation. Are PIDs over all namespaces unique?

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]     ` <20100923221139.GI23839-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-09-24 12:39       ` Michael Holzheu
  2010-09-25 18:19       ` Serge E. Hallyn
  1 sibling, 0 replies; 10+ messages in thread
From: Michael Holzheu @ 2010-09-24 12:39 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Suresh Siddha,
	Thomas Gleixner, Martin Schwidefsky, Andrew Morton, Ingo Molnar,
	Balbir Singh

Hello Matt,

On Thu, 2010-09-23 at 15:11 -0700, Matt Helsley wrote:
> > Talk to me about namespaces, please.  A lot of the new code involves
> > PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> > namespace.  Does everything Just Work?  When userspace sends a PID to
> > the kernel, that PID is assumed to be within the sending process's PID
> > namespace?  If so, then please spell it all out in the changelogs.  If
> > not then that is a problem!
> 
> Good point.
> 
> The pid ought to be valid in the _receiving_ task's pid namespace. That
> can be difficult or impossible if we're talking about netlink broadcasts.
> In this regard process events connector is an example of what not to do.

I think that the netlink taskstats commands are executed in the context
of the calling process (at least my printk shows me that). The command
collects the process data using "current->nsproxy->pid_ns" and creates a
netlink reply. So everything should be fine here. Shouldn't it?

Hmmm, but for exit events, this might be broken in taskstats. The code
looks to me that every exiting task independent from the namespace is
reported as event via taskstat_exit(). Maybe I am missing something...

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]   ` <1285319415.2179.116.camel@holzheu-laptop>
@ 2010-09-24 18:50     ` Andrew Morton
       [not found]       ` <20100924115002.fcb4385a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
       [not found]       ` <1285579127.2116.62.camel@holzheu-laptop>
  2010-09-27 10:49     ` Balbir Singh
  1 sibling, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2010-09-24 18:50 UTC (permalink / raw)
  To: holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Shailabh Nagar,
	Thomas-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Zijlstra,
	Heiko Carstens, Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Balbir Singh, Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Oleg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Martin Schwidefsky, Ingo Molnar,
	Peter-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Gleixner,
	Suresh Siddha

On Fri, 24 Sep 2010 11:10:15 +0200
Michael Holzheu <holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> Hello Andrew,
> 
> On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > > GOALS OF THIS PATCH SET
> > > -----------------------
> > > The intention of this patch set is to provide better support for tools like
> > > top. The goal is to:
> > > 
> > > * provide a task snapshot mechanism where we can get a consistent view of
> > >   all running tasks.
> > > * provide a transport mechanism that does not require a lot of system calls
> > >   and that allows implementing low CPU overhead task monitoring.
> > > * provide microsecond CPU time granularity.
> > 
> > This is a big change!  If this is done right then we're heading in the
> > direction of deprecating the longstanding way in which userspace
> > observes the state of Linux processes and we're recommending that the
> > whole world migrate to taskstats.  I think?
> 
> Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> performance) an alternative could be helpful. 

And it can be harmful.  More kernel code to maintain and test, more
userspace code to develop, maintain, etc.  Less user testing than if
there was a single interface.

> 
> > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > big problem because in N years time, 99% of the world will be using
> > taskstats, but a few embedded losers will be stuck using (and having to
> > support) the old tools.
> 
> Sure, but if we could add the /proc/taskstats approach, this dependency
> would not be there.

So why do we need to present the same info over netlink?

If the info is available via procfs then userspace code should use that
and not netlink, because that userspace code would also be applicable
to CONFIG_NET=n systems.

> 
> > Does this have the potential to save us from the CONFIG_NET=n problem?
> 
> Yes

Let's say that when it's all tested ;)

> Are PIDs over all namespaces unique?

Nope.  The same pid can be present in different namespaces at the same
time.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]     ` <20100923221139.GI23839-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-09-24 12:39       ` Michael Holzheu
@ 2010-09-25 18:19       ` Serge E. Hallyn
  1 sibling, 0 replies; 10+ messages in thread
From: Serge E. Hallyn @ 2010-09-25 18:19 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Venkatesh Pallipadi, linux-s390-u79uwXL29TY76Z2rM5mHXA,
	Peter Zijlstra, Shailabh Nagar, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Balbir Singh,
	Thomas Gleixner, Martin Schwidefsky, Andrew Morton,
	holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Ingo Molnar,
	Suresh Siddha

Quoting Matt Helsley (matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> I don't think even "root" can see/use pids outside its namespace (without

Just to be clear on this, you're right in what you say, but if a task in a child
pidns still has access to the /proc mount of the parent pidns, then it can see
the pids in there, and get information from them, i.e. /proc/pid/maps.  So
in that sense, some people could misinterpret "see/use pids" and think you
weren't right.

-serge

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]       ` <20100924115002.fcb4385a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-09-27  9:18         ` Michael Holzheu
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Holzheu @ 2010-09-27  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	Ingo Molnar, Balbir Singh, Thomas Gleixner, Suresh Siddha

Hello Andrew,

On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote:
> > > This is a big change!  If this is done right then we're heading in the
> > > direction of deprecating the longstanding way in which userspace
> > > observes the state of Linux processes and we're recommending that the
> > > whole world migrate to taskstats.  I think?
> > 
> > Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> > performance) an alternative could be helpful. 
> 
> And it can be harmful.  More kernel code to maintain and test, more
> userspace code to develop, maintain, etc.  Less user testing than if
> there was a single interface.

Sure, the value has to be big enough to justify the effort.

But as I said, with taskstats and procfs we already have two interfaces
for getting task information. Currently in procfs there is information
than you can't find in taskstats. But also the other way round in the
taskstats structure there is very useful information that you can't get
under proc. E.g. the task delay times, IO accounting, etc. So currently
tools have to use both interfaces to get all information, which is not
optimal.

> > 
> > > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > > big problem because in N years time, 99% of the world will be using
> > > taskstats, but a few embedded losers will be stuck using (and having to
> > > support) the old tools.
> > 
> > Sure, but if we could add the /proc/taskstats approach, this dependency
> > would not be there.
> 
> So why do we need to present the same info over netlink?

Good point. It is not really necessary. I started development using the
netlink code. Therefore I first added the new command in the netlink
code. I also thought, it would be a good idea to provide all netlink
commands over the procfs interface to be consistent.

> If the info is available via procfs then userspace code should use that
> and not netlink, because that userspace code would also be applicable
> to CONFIG_NET=n systems.
> 
> > 
> > > Does this have the potential to save us from the CONFIG_NET=n problem?
> > 
> > Yes
> 
> Let's say that when it's all tested ;)

That was more a theoretical statement :-)

I probably still have to ensure that the kernel config options
dependencies are done correctly.

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]   ` <1285319415.2179.116.camel@holzheu-laptop>
  2010-09-24 18:50     ` Andrew Morton
@ 2010-09-27 10:49     ` Balbir Singh
  1 sibling, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2010-09-27 10:49 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Gleixner,
	Martin Schwidefsky, Andrew Morton, Ingo Molnar, Suresh Siddha

* Michael Holzheu <holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> [2010-09-24 11:10:15]:

> Hello Andrew,
> 
> On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
> > > GOALS OF THIS PATCH SET
> > > -----------------------
> > > The intention of this patch set is to provide better support for tools like
> > > top. The goal is to:
> > > 
> > > * provide a task snapshot mechanism where we can get a consistent view of
> > >   all running tasks.
> > > * provide a transport mechanism that does not require a lot of system calls
> > >   and that allows implementing low CPU overhead task monitoring.
> > > * provide microsecond CPU time granularity.
> > 
> > This is a big change!  If this is done right then we're heading in the
> > direction of deprecating the longstanding way in which userspace
> > observes the state of Linux processes and we're recommending that the
> > whole world migrate to taskstats.  I think?
>

Wouldn't I love that :)
 
> Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> performance) an alternative could be helpful. 
> 
> And the taskstats interface with the TASKSTATS_CMD_ATTR_PID command
> already exists and can be used. So we already have a second mechanism to
> query tasks accounting information besides of procfs.
> 

Yes, an alternative for simple data extraction without having to write
network code to extract it.

> > 
> > If so, much chin-scratching will be needed, coordination with
> > util-linux people, etc.
> 
> I agree.
> 
> > We'd need to think about the implications of taskstats versioning.  It
> > _is_ a versioned interface, so people can't just go and toss random new
> > stuff in there at will - it's not like adding a new procfs file, or
> > adding a new line to an existing one.  I don't know if that's likely to
> > be a significant problem.
> 
> I already thought about that problem. Another problem is that depending
> on the kernel config options, some taskstats fields may be not
> initialized. E.g. CONFIG_TASK_DELAY_ACCT or CONFIG_TASK_XACCT. Currently
> there does not exist a good interface to userspace to query which fields
> are valid.
> 
> Regarding the taskstats versions  I described a possible solution in the
> userspace tarball in the README.libtaskstats file:
> 
> The "struct taskstats" structure contains accounting information for one
> Linux task. This structure is defined in "/usr/include/linux/taskstats.h".
> With new kernel versions new fields can be added to that structure.
> In that case the kernel taskstats version number defined with the macro
> TASKSTATS_VERSION will be increased.
>
> The taskstats library distinguishes between two taskstats versions:
> * Kernel taskstats version (KV)
> * Program compile taskstats version (CV)
> 
> Depending on the taskstats version CV that is used for compiling the program,
> this version numbers can be different:
> * KV > CV:
>   The libtaskstats library only copies the CV taskstats fields and the fields
>   that belong to version > CV will be ignored.
> * KV < CV:
>   The libtaskstats library only copies the version KV fields and the fields
>   that belong to version > KV remain uninitialized.
> 
> If a program wants to support multiple taskstats versions, this can be done
> using the ts_version() function and process fields according to that version
> number.
> 
> Example:
> 
>   if (ts_version() < 7) {
>          fprintf(stderr, "Error: kernel taskstats version too low\n");
>          exit(1);
>   }
>   if (ts_version() >= 7)
>          print_attrs_v7();
>   if (ts_version() >= 8)
>          print_attrs_v8();
> 
> In this example the program has to be compiled with a taskstats.h header file
> that has at least version 8.

Fair enough

> 
> > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > big problem because in N years time, 99% of the world will be using
> > taskstats, but a few embedded losers will be stuck using (and having to
> > support) the old tools.
> 
> Sure, but if we could add the /proc/taskstats approach, this dependency
> would not be there.
> 
> > 
> > > FIRST RESULTS
> > > -------------
> > > Together with this kernel patch set also user space code for a new top
> > > utility (ptop) is provided that exploits the new kernel infrastructure. See
> > > patch 10 for more details.
> > > 
> > > TEST1: System with many sleeping tasks
> > > 
> > >   for ((i=0; i < 1000; i++))
> > >   do
> > >          sleep 1000000 &
> > >   done
> > > 
> > >   # ptop_new_proc
> > > 
> > >              VVVV
> > >   pid   user  sys  ste  total  Name
> > >   (#)    (%)  (%)  (%)    (%)  (str)
> > >   541   0.37 2.39 0.10   2.87  top
> > >   3743  0.03 0.05 0.00   0.07  ptop_new_proc
> > >              ^^^^
> > > 
> > > Compared to the old top command that has to scan more than 1000 proc
> > > directories the new ptop consumes much less CPU time (0.05% system time
> > > on my s390 system).
> > 
> > How many CPUs does that system have?
> 
> The system is a virtual machine and has three CPUs.
> 
> > What's the `top' update period?  One second?
> 
> The update period is two seconds.
> 
> > So we're saying that a `top -d 1' consumes 2.4% of this
> > mystery-number-of-CPUs machine?  That's quite a lot.
> 
> When I run that testcase on my laptop, 2 CPUs (Intel Core 2 - 2.33GHz),
> I get about 1-2% system time for top.
> 
> > > PATCHSET OVERVIEW
> > > -----------------
> > > The code is not final and still has a few TODOs. But it is good enough for a
> > > first round of review. The following kernel patches are provided:
> > > 
> > > [01] Prepare-0: Use real microsecond granularity for taskstats CPU times.
> > > [02] Prepare-1: Restructure taskstats.c in order to be able to add new commands
> > >      more easily.
> > > [03] Prepare-2: Separate the finding of a task_struct by PID or TGID from
> > >      filling the taskstats.
> > > [04] Add new command "TASKSTATS_CMD_ATTR_PIDS" to get a snapshot of multiple
> > >      tasks.
> > > [05] Add procfs interface for taskstats commands. This allows to get a complete
> > >      and consistent snapshot with all tasks using two system calls (ioctl and
> > >      read). Transferring a snapshot of all running tasks is not possible using
> > >      the existing netlink interface, because there we have the socket buffer
> > >      size as restricting factor.
> > 
> > So this is a binary interface which uses an ioctl.  People don't like
> > ioctls.  Could we have triggered it with a write() instead?
> 
> The current idea is the following:
> 
> 1. Open /proc/taskstats
> 2. Set the requested command (e.g. TASKSTATS_CMD_ATTR_PIDS) using
>    an ioctl. For the TASKSTATS_CMD_ATTR_PIDS ioctl the following
>    structure is sent:
> 
>    struct taskstats_cmd_pids {
>         __u64   time_ns;
>         __u32   pid;
>         __u32   cnt;
>    };
> 
> 3. After the command is defined, with a read() the command is executed
>    and the result is returned to the user's read buffer.
> 
> We could replace step 2 with a write, that transfers the command.
>

I don't like ioctls either, write sounds interesting.
 
> > Does this have the potential to save us from the CONFIG_NET=n problem?
> 
> Yes
> 
> > > [06] Add TGID to taskstats.
> > > [07] Add steal time per task accounting.
> > > [08] Add cumulative CPU time (user, system and steal) to taskstats.
> > 
> > These didn't update the taskstats version number.  Should they have?
> 
> Patch 04/10 updates the taskstats version number from 7 to 8.
> I didn't want to update the version number with each patch.
> 
> > > [09] Fix exit CPU time accounting.
> > > 
> > > [10] Besides of the kernel patches also user space code is provided that
> > >      exploits the new kernel infrastructure. The user space code provides the
> > >      following:
> > >      1. A proposal for a taskstats user space library:
> > >         1.1 Based on netlink (requires libnl-devel-1.1-5)
> > >         2.1 Based on the new /proc/taskstats interface (see [05])
> > >      2. A proposal for a task snapshot library based on taskstats library (1.1)
> > 
> > ooh, excellent.  A standardised userspace access library.
> 
> Yes, at least a proposal for that.
> 
> > >      3. A new tool "ptop" (precise top) that uses the libraries
> > 
> > Talk to me about namespaces, please.  A lot of the new code involves
> > PIDs, but PIDs are not system-wide unique.  A PID is relative to a PID
> > namespace.  Does everything Just Work?  When userspace sends a PID to
> > the kernel, that PID is assumed to be within the sending process's PID
> > namespace?  If so, then please spell it all out in the changelogs.  If
> > not then that is a problem!
> 
> To be honest, I have not tested that. I assumed that the current
> taskstats code does this correctly. E.g. it uses find_task_by_vpid() for
> TASKSTATS_CMD_ATTR_PID and this function uses
> "current->nsproxy->pid_ns". So I would assume that we get only tasks
> from the caller's namespace. The new TASKSTATS_CMD_ATTR_PIDS command
> also uses also only functions with "current->nsproxy->pid_ns".
> 
> > If I can only observe processes in my PID namespace then is that a
> > problem?  Should I be allowed to observe another PID namespace's
> > processes?  I assume so, because I might be root.  If so, how is that
> > to be done?
> 
> Good question. Probably I have to learn a bit more about the PID
> namespace implementation. Are PIDs over all namespaces unique?
> 
>

I think the namespaces are OK, we might peep into namespaces nested
within the current one, but that is legal today. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]       ` <1285579127.2116.62.camel@holzheu-laptop>
@ 2010-09-27 20:02         ` Andrew Morton
       [not found]         ` <20100927130256.5d9a3db8.akpm@linux-foundation.org>
  1 sibling, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2010-09-27 20:02 UTC (permalink / raw)
  To: holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Shailabh Nagar,
	Thomas-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Zijlstra,
	Heiko Carstens, Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, Balbir Singh, Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Oleg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Martin Schwidefsky, Ingo Molnar,
	Peter-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Gleixner,
	Suresh Siddha

On Mon, 27 Sep 2010 11:18:47 +0200
Michael Holzheu <holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> Hello Andrew,
> 
> On Fri, 2010-09-24 at 11:50 -0700, Andrew Morton wrote:
> > > > This is a big change!  If this is done right then we're heading in the
> > > > direction of deprecating the longstanding way in which userspace
> > > > observes the state of Linux processes and we're recommending that the
> > > > whole world migrate to taskstats.  I think?
> > > 
> > > Or it can be used as alternative. Since procfs has its drawbacks (e.g.
> > > performance) an alternative could be helpful. 
> > 
> > And it can be harmful.  More kernel code to maintain and test, more
> > userspace code to develop, maintain, etc.  Less user testing than if
> > there was a single interface.
> 
> Sure, the value has to be big enough to justify the effort.
> 
> But as I said, with taskstats and procfs we already have two interfaces
> for getting task information.

That doesn't mean it was the right thing to do!  For the reasons I
outline above, it can be the wrong thing to do and strengthening one of
the alternatives worsens the problem.

> Currently in procfs there is information
> than you can't find in taskstats. But also the other way round in the
> taskstats structure there is very useful information that you can't get
> under proc. E.g. the task delay times, IO accounting, etc.

Sounds like a big screwup ;)

Look at it this way: if you were going to sit down and start to design
a new operating system from scratch, would you design the task status
reporting system as it currently stands in Linux?  Don't think so!

> So currently
> tools have to use both interfaces to get all information, which is not
> optimal.
> 
> > > 
> > > > I worry that there's a dependency on CONFIG_NET?  If so then that's a
> > > > big problem because in N years time, 99% of the world will be using
> > > > taskstats, but a few embedded losers will be stuck using (and having to
> > > > support) the old tools.
> > > 
> > > Sure, but if we could add the /proc/taskstats approach, this dependency
> > > would not be there.
> > 
> > So why do we need to present the same info over netlink?
> 
> Good point. It is not really necessary. I started development using the
> netlink code. Therefore I first added the new command in the netlink
> code. I also thought, it would be a good idea to provide all netlink
> commands over the procfs interface to be consistent.

Maybe we should have delivered taskstats over procfs from day one.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
       [not found]           ` <20100927130256.5d9a3db8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2010-09-28  8:17             ` Balbir Singh
  0 siblings, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2010-09-28  8:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shailabh Nagar, linux-s390-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	Venkatesh Pallipadi, John stultz,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Heiko Carstens, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Martin Schwidefsky,
	Ingo Molnar, holzheu-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Thomas Gleixner, Suresh Siddha

* Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [2010-09-27 13:02:56]:

> > Good point. It is not really necessary. I started development using the
> > netlink code. Therefore I first added the new command in the netlink
> > code. I also thought, it would be a good idea to provide all netlink
> > commands over the procfs interface to be consistent.
> 
> Maybe we should have delivered taskstats over procfs from day one.
>

The intention was to provide taskstats over a scalable backend to deal
with a large amount of data, including exit notifications. We provided
some information like blkioi delay data on proc, but not the whole structure. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-09-28  8:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1285249681.1837.28.camel@holzheu-laptop>
2010-09-23 20:11 ` [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting Andrew Morton
     [not found]   ` <20100923131136.356075f4.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-09-23 22:11     ` Matt Helsley
2010-09-24  9:10     ` Michael Holzheu
     [not found]   ` <20100923221139.GI23839@count0.beaverton.ibm.com>
     [not found]     ` <20100923221139.GI23839-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-09-24 12:39       ` Michael Holzheu
2010-09-25 18:19       ` Serge E. Hallyn
     [not found]   ` <1285319415.2179.116.camel@holzheu-laptop>
2010-09-24 18:50     ` Andrew Morton
     [not found]       ` <20100924115002.fcb4385a.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-09-27  9:18         ` Michael Holzheu
     [not found]       ` <1285579127.2116.62.camel@holzheu-laptop>
2010-09-27 20:02         ` Andrew Morton
     [not found]         ` <20100927130256.5d9a3db8.akpm@linux-foundation.org>
     [not found]           ` <20100927130256.5d9a3db8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-09-28  8:17             ` Balbir Singh
2010-09-27 10:49     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox