[RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
@ 2010-09-09 13:05 Stephane Eranian
  2010-09-21  9:38 ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Stephane Eranian @ 2010-09-09 13:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, paulus, davem, fweisbec, perfmon2-devel, eranian,
	eranian, robert.richter, acme

This series of patches adds per-container (cgroup) filtering capability
to per-cpu monitoring. In other words, we can monitor all threads belonging
to a specific cgroup and running on a specific CPU. 

This is useful to measure what is going on inside a cgroup. Something that
cannot easily and cheaply be achieved with either per-thread or per-cpu mode.
Cgroups can span multiple CPUs. CPUs can be shared between cgroups. Cgroups
can have lots of threads. Threads can come and go during a measurement.

To measure per-cgroup today requires using per-thread mode and attaching to
all the current threads inside a cgroup and tracking new threads. That would
require scanning of /proc/PID, which is subject to race conditions, and
creating an event for each thread, each event requiring kernel memory.

The approach taken by this patch is to leverage the per-cpu mode by simply
adding a filtering capability on context switch only when necessary. That
way the amount of kernel memory used remains bound by the number of CPUs.
We also do not have to scan /proc. We are only interested in cgroup level
counts, not per-thread.

The cgroup to monitor is designated by passing a file descriptor opened
on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
option must be activated by setting perf_event_attr.cgroup=1 and passing
a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
two ABI extensions.

The patch also includes changes to the perf tool to make use of cgroup
filtering. Both perf stat and perf record have been extended to support
cgroup via a new -G option. The cgroup is specified per event:

$ perf stat -B -a -e cycles:u,cycles:u,cycles:u -G test1,,test2 -- sleep 1
 Performance counter stats for 'sleep 1':

      2,368,667,414  cycles                   test1
      2,369,661,459  cycles                  
      <not counted>  cycles                   test2

        1.001856890  seconds time elapsed

Here, we measure cycles in 3 different cgroups. When a cgroup is omitted,
the "root" cgroup is used, i.e., all threads executing on the monitored
CPUs are measured.

In the second version, time tracking has been updated. In cgroup mode,
time_enabled tracks the time during which the cgroup was active, i.e., threads
from the cgroup executed on the monitored CPU. The meaning of time_running
is unchanged. In non-cgroup mode, time_enabled stills tracks wall-clock time
for per-cpu events. Here is an example:

In one shell I do:
$ echo $$ >/cgroup/test1/perf_events.perf
$ taskset -c 1 noploop 600
In another shell I do:
$ taskset -c 1 noploop 600

Both noploops are competing on CPU1 (part of cgroup test1)

$ perf stat -B -a -e cycles:u,cycles:u,cycles:u -G test1,,test2 -- sleep 1
 Performance counter stats for 'sleep 1':

      1,190,595,954  cycles                   test1
      2,372,471,023  cycles                  
      <not counted>  cycles                   test2

        1.001845567  seconds time elapsed

The second count reflects activity across all CPUs and cgroups.
The first count reflects was happened inside cgroup test1. As shown,
the noploop running inside test1, only got half the CPU time.

In the third version, we have dropped dependency on NR_CPUS
in favor of dynamic allocation with alloc_percpu(). We have
also renamed get_event_time() to something more explicit:
perf_event_time(). We cleaned the code so it compiles with
CONFIG_CGROUPS disabled. We have also fixed a bug in the
perf tool sampling module builtin-record.c

PATCH 0/2: introduction
PATCH 1/2: kernel changes
PATCH 2/2: perf tool changes

Signed-off-by: Stephane Eranian <eranian@google.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-09 13:05 [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3) Stephane Eranian
@ 2010-09-21  9:38 ` Peter Zijlstra
  2010-09-21  9:43   ` Peter Zijlstra
  2010-09-22  4:23   ` Balbir Singh
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-21  9:38 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Thu, 2010-09-09 at 15:05 +0200, Stephane Eranian wrote:
> The cgroup to monitor is designated by passing a file descriptor opened
> on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
> option must be activated by setting perf_event_attr.cgroup=1 and passing
> a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
> two ABI extensions. 

> +++ b/include/linux/perf_event.h
> @@ -215,8 +215,9 @@ struct perf_event_attr {
>                                  */
>                                 precise_ip     :  2, /* skid constraint       */
>                                 mmap_data      :  1, /* non-exec mmap data    */
> +                               cgroup         :  1, /* cgroup aggregation    */
>  
> -                               __reserved_1   : 46;
> +                               __reserved_1   : 45;
>  
>         union {
>                 __u32           wakeup_events;    /* wakeup every n events */
> @@ -226,6 +227,8 @@ struct perf_event_attr {
>         __u32                   bp_type;
>         __u64                   bp_addr;
>         __u64                   bp_len;
> +
> +       int                     cgroup_fd;
>  };
>  
>  /*

I'm not sure I like this much.. so we attach to {pid,cpu}, for nodes we
can use cpu_to_node(cpu), which would suggest to use
cgroup_of_task(pid), except that a task can be part of multiple cgroups,
so its not unique.

One thing we could do is pass this cgroup identifier in the pid field
and use PERF_FLAG_CGROUP or something. Currently the syscall signature
uses pid_t, but I think we can safely change that to int.

You create a special new file in the cgroup stuff, I'm not sure about
that either, but its not something I feel too strongly about, why
wouldn't a fd of any file or even directory of that cgroup work? Do the
cgroup people have an opinion?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21  9:38 ` Peter Zijlstra
@ 2010-09-21  9:43   ` Peter Zijlstra
  2010-09-21 11:48     ` Stephane Eranian
  2010-09-22  4:23   ` Balbir Singh
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-21  9:43 UTC (permalink / raw)
  To: eranian
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, 2010-09-21 at 11:38 +0200, Peter Zijlstra wrote:
> On Thu, 2010-09-09 at 15:05 +0200, Stephane Eranian wrote:
> > The cgroup to monitor is designated by passing a file descriptor opened
> > on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
> > option must be activated by setting perf_event_attr.cgroup=1 and passing
> > a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
> > two ABI extensions. 
> 
> > +++ b/include/linux/perf_event.h
> > @@ -215,8 +215,9 @@ struct perf_event_attr {
> >                                  */
> >                                 precise_ip     :  2, /* skid constraint       */
> >                                 mmap_data      :  1, /* non-exec mmap data    */
> > +                               cgroup         :  1, /* cgroup aggregation    */
> >  
> > -                               __reserved_1   : 46;
> > +                               __reserved_1   : 45;
> >  
> >         union {
> >                 __u32           wakeup_events;    /* wakeup every n events */
> > @@ -226,6 +227,8 @@ struct perf_event_attr {
> >         __u32                   bp_type;
> >         __u64                   bp_addr;
> >         __u64                   bp_len;
> > +
> > +       int                     cgroup_fd;
> >  };
> >  
> >  /*
> 
> I'm not sure I like this much.. so we attach to {pid,cpu}, for nodes we
> can use cpu_to_node(cpu), which would suggest to use
> cgroup_of_task(pid), except that a task can be part of multiple cgroups,
> so its not unique.
> 
> One thing we could do is pass this cgroup identifier in the pid field
> and use PERF_FLAG_CGROUP or something. Currently the syscall signature
> uses pid_t, but I think we can safely change that to int.
> 
> You create a special new file in the cgroup stuff, I'm not sure about
> that either, but its not something I feel too strongly about, why
> wouldn't a fd of any file or even directory of that cgroup work? Do the
> cgroup people have an opinion?

Ahh, I just read more of the patch, and you create a full perf cgroup,
in which case cgroup_of_task(pid) will work, simply pick the perf
cgroup's tasks.

No need to actually create that file, open it and pass fds around, just
pick a task from that cgroup and attach to the cgroup through that.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21  9:43   ` Peter Zijlstra
@ 2010-09-21 11:48     ` Stephane Eranian
  2010-09-21 12:42       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Stephane Eranian @ 2010-09-21 11:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

Peter,

On Tue, Sep 21, 2010 at 11:43 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-09-21 at 11:38 +0200, Peter Zijlstra wrote:
>> On Thu, 2010-09-09 at 15:05 +0200, Stephane Eranian wrote:
>> > The cgroup to monitor is designated by passing a file descriptor opened
>> > on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
>> > option must be activated by setting perf_event_attr.cgroup=1 and passing
>> > a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
>> > two ABI extensions.
>>
>> > +++ b/include/linux/perf_event.h
>> > @@ -215,8 +215,9 @@ struct perf_event_attr {
>> >                                  */
>> >                                 precise_ip     :  2, /* skid constraint       */
>> >                                 mmap_data      :  1, /* non-exec mmap data    */
>> > +                               cgroup         :  1, /* cgroup aggregation    */
>> >
>> > -                               __reserved_1   : 46;
>> > +                               __reserved_1   : 45;
>> >
>> >         union {
>> >                 __u32           wakeup_events;    /* wakeup every n events */
>> > @@ -226,6 +227,8 @@ struct perf_event_attr {
>> >         __u32                   bp_type;
>> >         __u64                   bp_addr;
>> >         __u64                   bp_len;
>> > +
>> > +       int                     cgroup_fd;
>> >  };
>> >
>> >  /*
>>
>> I'm not sure I like this much.. so we attach to {pid,cpu}, for nodes we
>> can use cpu_to_node(cpu), which would suggest to use
>> cgroup_of_task(pid), except that a task can be part of multiple cgroups,
>> so its not unique.
>>
>> One thing we could do is pass this cgroup identifier in the pid field
>> and use PERF_FLAG_CGROUP or something. Currently the syscall signature
>> uses pid_t, but I think we can safely change that to int.
>>
>> You create a special new file in the cgroup stuff, I'm not sure about
>> that either, but its not something I feel too strongly about, why
>> wouldn't a fd of any file or even directory of that cgroup work? Do the
>> cgroup people have an opinion?
>
> Ahh, I just read more of the patch, and you create a full perf cgroup,
> in which case cgroup_of_task(pid) will work, simply pick the perf
> cgroup's tasks.
>
> No need to actually create that file, open it and pass fds around, just
> pick a task from that cgroup and attach to the cgroup through that.
>
If I understand, you are proposing that we use the pid argument to the
syscall to designate the cgroup. A task belongs to only one cgroup at
a time. Thus with the pid you can identify a cgroup. No need for an
entry in cgroup_fs and therefore no need for cgroup_fd in perf_event_attr.
We would still need a flag somewhere to indicate that we don't want
per-thread mode but per-cpu per-cgroup. It could be a new field in the
bitfield in perf_event_attr. In fact, I already have such a field.

The main issue I see with this is that it relies on having at least one
task in the cgroup when you start the measurement. That is certainly
not always the case.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 11:48     ` Stephane Eranian
@ 2010-09-21 12:42       ` Peter Zijlstra
  2010-09-21 13:38         ` Stephane Eranian
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-21 12:42 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, 2010-09-21 at 13:48 +0200, Stephane Eranian wrote:
> The main issue I see with this is that it relies on having at least one
> task in the cgroup when you start the measurement. That is certainly
> not always the case. 

Hmm, indeed. One thing we can do about that is move perf into the
cgroup, create the counter (disabled) using self to identify the cgroup,
move perf back to where it came from, and enable the counter.

Its just that I prefer to keep the attach state in the syscall arguments
and not the perf_event_attr struct.

cgroups are task objects, so it makes sense to use the task attach to
indicate the cgroup. The empty cgroup case is indeed a tad unfortunate.
Not having to open more files and pass fds around was also a nice
benefit.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 12:42       ` Peter Zijlstra
@ 2010-09-21 13:38         ` Stephane Eranian
  2010-09-21 14:03           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Stephane Eranian @ 2010-09-21 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, Sep 21, 2010 at 2:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-09-21 at 13:48 +0200, Stephane Eranian wrote:
>> The main issue I see with this is that it relies on having at least one
>> task in the cgroup when you start the measurement. That is certainly
>> not always the case.
>
> Hmm, indeed. One thing we can do about that is move perf into the
> cgroup, create the counter (disabled) using self to identify the cgroup,
> move perf back to where it came from, and enable the counter.
>
Yes, that's another possibility. I wonder if there are any non-obvious
difficulties with this approach. Is it as simple as:
   FILE *fp;
   fp = fopen("/dev/cgroup/test/tasks", "w");
   fprintf(fp, "%d", gettid());
   close(fp):

> Its just that I prefer to keep the attach state in the syscall arguments
> and not the perf_event_attr struct.
>
I understand that.

> cgroups are task objects, so it makes sense to use the task attach to
> indicate the cgroup. The empty cgroup case is indeed a tad unfortunate.
> Not having to open more files and pass fds around was also a nice
> benefit.
>
Sure.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 13:38         ` Stephane Eranian
@ 2010-09-21 14:03           ` Peter Zijlstra
  2010-09-21 16:17             ` Stephane Eranian
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-21 14:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, 2010-09-21 at 15:38 +0200, Stephane Eranian wrote:
> > Hmm, indeed. One thing we can do about that is move perf into the
> > cgroup, create the counter (disabled) using self to identify the cgroup,
> > move perf back to where it came from, and enable the counter.
> >
> Yes, that's another possibility. I wonder if there are any non-obvious
> difficulties with this approach.

Yes, there is, but I think we can fix it. The problem with moving perf
itself around is that perf is not a fully dormant process and can thus
interact with the cgroup state.

If we were to fork a child that's simply sitting idle in waitpid() (or
any other blocking syscall) we can move that around cgroup without
affecting the cgroup itself.

>  Is it as simple as:
>    FILE *fp;
>    fp = fopen("/dev/cgroup/test/tasks", "w");
>    fprintf(fp, "%d", gettid());
>    close(fp): 

Except I've never in my life mounted a cgroup filesystem in /dev/ :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 14:03           ` Peter Zijlstra
@ 2010-09-21 16:17             ` Stephane Eranian
  2010-09-21 16:27               ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Stephane Eranian @ 2010-09-21 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, Sep 21, 2010 at 4:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-09-21 at 15:38 +0200, Stephane Eranian wrote:
>> > Hmm, indeed. One thing we can do about that is move perf into the
>> > cgroup, create the counter (disabled) using self to identify the cgroup,
>> > move perf back to where it came from, and enable the counter.
>> >
>> Yes, that's another possibility. I wonder if there are any non-obvious
>> difficulties with this approach.
>
> Yes, there is, but I think we can fix it. The problem with moving perf
> itself around is that perf is not a fully dormant process and can thus
> interact with the cgroup state.
>
I was thinking about memory accounting for instance.

> If we were to fork a child that's simply sitting idle in waitpid() (or
> any other blocking syscall) we can move that around cgroup without
> affecting the cgroup itself.

But then things get a bit more complicated because the perf_event_open()
has to be done in that child. File descriptors created in child processes
and not shared with their parent. You'd have to pass file descriptors around.
That seems overly complicated.

I wonder if overloading pid to be a file descriptor open to a cgroup_fs file
would not be easier in the end.


>>  Is it as simple as:
>>    FILE *fp;
>>    fp = fopen("/dev/cgroup/test/tasks", "w");
>>    fprintf(fp, "%d", gettid());
>>    close(fp):
>
> Except I've never in my life mounted a cgroup filesystem in /dev/ :-)
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 16:17             ` Stephane Eranian
@ 2010-09-21 16:27               ` Peter Zijlstra
  2010-09-21 16:33                 ` Stephane Eranian
  2010-09-22  4:34                 ` Balbir Singh
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-21 16:27 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, 2010-09-21 at 18:17 +0200, Stephane Eranian wrote:
> On Tue, Sep 21, 2010 at 4:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-09-21 at 15:38 +0200, Stephane Eranian wrote:
> >> > Hmm, indeed. One thing we can do about that is move perf into the
> >> > cgroup, create the counter (disabled) using self to identify the cgroup,
> >> > move perf back to where it came from, and enable the counter.
> >> >
> >> Yes, that's another possibility. I wonder if there are any non-obvious
> >> difficulties with this approach.
> >
> > Yes, there is, but I think we can fix it. The problem with moving perf
> > itself around is that perf is not a fully dormant process and can thus
> > interact with the cgroup state.
> >
> I was thinking about memory accounting for instance.

I think the memory controller only accounts things when the process
actually touches something. A process that never wakes will never touch
anything.

> > If we were to fork a child that's simply sitting idle in waitpid() (or
> > any other blocking syscall) we can move that around cgroup without
> > affecting the cgroup itself.
> 
> But then things get a bit more complicated because the perf_event_open()
> has to be done in that child. File descriptors created in child processes
> and not shared with their parent. You'd have to pass file descriptors around.
> That seems overly complicated.

Uhm, no the trick is that the child remains absolutely dormant and
therefore doesn't accrue any accounting, all you need is a known task in
the cgroup, the parent can then specify the child pid to identify the
group.

Once you've opened the counter, you can move the kid out and kill it.
Note that moving it out of the cgroup before killing it ensure it never
wakes up inside that cgroup.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 16:27               ` Peter Zijlstra
@ 2010-09-21 16:33                 ` Stephane Eranian
  2010-09-22  4:34                 ` Balbir Singh
  1 sibling, 0 replies; 18+ messages in thread
From: Stephane Eranian @ 2010-09-21 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, paulus, davem, fweisbec, perfmon2-devel,
	eranian, robert.richter, acme, Paul Menage, Li Zefan,
	Balbir Singh

On Tue, Sep 21, 2010 at 6:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-09-21 at 18:17 +0200, Stephane Eranian wrote:
>> On Tue, Sep 21, 2010 at 4:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-09-21 at 15:38 +0200, Stephane Eranian wrote:
>> >> > Hmm, indeed. One thing we can do about that is move perf into the
>> >> > cgroup, create the counter (disabled) using self to identify the cgroup,
>> >> > move perf back to where it came from, and enable the counter.
>> >> >
>> >> Yes, that's another possibility. I wonder if there are any non-obvious
>> >> difficulties with this approach.
>> >
>> > Yes, there is, but I think we can fix it. The problem with moving perf
>> > itself around is that perf is not a fully dormant process and can thus
>> > interact with the cgroup state.
>> >
>> I was thinking about memory accounting for instance.
>
> I think the memory controller only accounts things when the process
> actually touches something. A process that never wakes will never touch
> anything.
>
>> > If we were to fork a child that's simply sitting idle in waitpid() (or
>> > any other blocking syscall) we can move that around cgroup without
>> > affecting the cgroup itself.
>>
>> But then things get a bit more complicated because the perf_event_open()
>> has to be done in that child. File descriptors created in child processes
>> and not shared with their parent. You'd have to pass file descriptors around.
>> That seems overly complicated.
>
> Uhm, no the trick is that the child remains absolutely dormant and
> therefore doesn't accrue any accounting, all you need is a known task in
> the cgroup, the parent can then specify the child pid to identify the
> group.
>
> Once you've opened the counter, you can move the kid out and kill it.
> Note that moving it out of the cgroup before killing it ensure it never
> wakes up inside that cgroup.
>
>
Ok, I see. I got confused by the 'self' pid. The parent moves the dormant
child into the cgroup and uses its pid in perf_event_open(). The existing
logic inside the tool remains unmodified.

I will experiment with this approach.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21 16:27               ` Peter Zijlstra
  2010-09-21 16:33                 ` Stephane Eranian
@ 2010-09-22  4:34                 ` Balbir Singh
  2010-09-22  7:25                   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2010-09-22  4:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

* Peter Zijlstra <peterz@infradead.org> [2010-09-21 18:27:27]:

> On Tue, 2010-09-21 at 18:17 +0200, Stephane Eranian wrote:
> > On Tue, Sep 21, 2010 at 4:03 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, 2010-09-21 at 15:38 +0200, Stephane Eranian wrote:
> > >> > Hmm, indeed. One thing we can do about that is move perf into the
> > >> > cgroup, create the counter (disabled) using self to identify the cgroup,
> > >> > move perf back to where it came from, and enable the counter.
> > >> >
> > >> Yes, that's another possibility. I wonder if there are any non-obvious
> > >> difficulties with this approach.
> > >
> > > Yes, there is, but I think we can fix it. The problem with moving perf
> > > itself around is that perf is not a fully dormant process and can thus
> > > interact with the cgroup state.
> > >
> > I was thinking about memory accounting for instance.
> 
> I think the memory controller only accounts things when the process
> actually touches something. A process that never wakes will never touch
> anything.

That understanding is correct, but the whole approach sounds more
complex due to several subsystems involved, the expectation is that
we'll move perf to all the correct cgroups for each subsystem.

> 
> > > If we were to fork a child that's simply sitting idle in waitpid() (or
> > > any other blocking syscall) we can move that around cgroup without
> > > affecting the cgroup itself.
> > 
> > But then things get a bit more complicated because the perf_event_open()
> > has to be done in that child. File descriptors created in child processes
> > and not shared with their parent. You'd have to pass file descriptors around.
> > That seems overly complicated.
> 
> Uhm, no the trick is that the child remains absolutely dormant and
> therefore doesn't accrue any accounting, all you need is a known task in
> the cgroup, the parent can then specify the child pid to identify the
> group.
> 
> Once you've opened the counter, you can move the kid out and kill it.
> Note that moving it out of the cgroup before killing it ensure it never
> wakes up inside that cgroup.

What the benefits of this complexity, not chaning perf_event_attr?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-22  4:34                 ` Balbir Singh
@ 2010-09-22  7:25                   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-22  7:25 UTC (permalink / raw)
  To: balbir
  Cc: Stephane Eranian, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

On Wed, 2010-09-22 at 10:04 +0530, Balbir Singh wrote:

> That understanding is correct, but the whole approach sounds more
> complex due to several subsystems involved, the expectation is that
> we'll move perf to all the correct cgroups for each subsystem.

Well, we'll only move a completely dormant task to a cgroup and out
again.

> > > > If we were to fork a child that's simply sitting idle in waitpid() (or
> > > > any other blocking syscall) we can move that around cgroup without
> > > > affecting the cgroup itself.
> > > 
> > > But then things get a bit more complicated because the perf_event_open()
> > > has to be done in that child. File descriptors created in child processes
> > > and not shared with their parent. You'd have to pass file descriptors around.
> > > That seems overly complicated.
> > 
> > Uhm, no the trick is that the child remains absolutely dormant and
> > therefore doesn't accrue any accounting, all you need is a known task in
> > the cgroup, the parent can then specify the child pid to identify the
> > group.
> > 
> > Once you've opened the counter, you can move the kid out and kill it.
> > Note that moving it out of the cgroup before killing it ensure it never
> > wakes up inside that cgroup.
> 
> What the benefits of this complexity, not chaning perf_event_attr?

Yes, attach information should not be in _attr. And you avoid the hassle
of creating a special file and passing fds to it around.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-21  9:38 ` Peter Zijlstra
  2010-09-21  9:43   ` Peter Zijlstra
@ 2010-09-22  4:23   ` Balbir Singh
  2010-09-22  7:27     ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2010-09-22  4:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

* Peter Zijlstra <peterz@infradead.org> [2010-09-21 11:38:19]:

> On Thu, 2010-09-09 at 15:05 +0200, Stephane Eranian wrote:
> > The cgroup to monitor is designated by passing a file descriptor opened
> > on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
> > option must be activated by setting perf_event_attr.cgroup=1 and passing
> > a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
> > two ABI extensions. 
> 
> > +++ b/include/linux/perf_event.h
> > @@ -215,8 +215,9 @@ struct perf_event_attr {
> >                                  */
> >                                 precise_ip     :  2, /* skid constraint       */
> >                                 mmap_data      :  1, /* non-exec mmap data    */
> > +                               cgroup         :  1, /* cgroup aggregation    */
> >  
> > -                               __reserved_1   : 46;
> > +                               __reserved_1   : 45;
> >  
> >         union {
> >                 __u32           wakeup_events;    /* wakeup every n events */
> > @@ -226,6 +227,8 @@ struct perf_event_attr {
> >         __u32                   bp_type;
> >         __u64                   bp_addr;
> >         __u64                   bp_len;
> > +
> > +       int                     cgroup_fd;
> >  };
> >  
> >  /*
> 
> I'm not sure I like this much.. so we attach to {pid,cpu}, for nodes we
> can use cpu_to_node(cpu), which would suggest to use
> cgroup_of_task(pid), except that a task can be part of multiple cgroups,
> so its not unique.

Yes, a task can belong to multiple subsystems, hence multiple cgroups.
Ideally we'd want to use pid + subsystem

> 
> One thing we could do is pass this cgroup identifier in the pid field
> and use PERF_FLAG_CGROUP or something. Currently the syscall signature
> uses pid_t, but I think we can safely change that to int.

Or union it and overload the field to contain either pid_t or fd of the cgroup

> 
> You create a special new file in the cgroup stuff, I'm not sure about
> that either, but its not something I feel too strongly about, why
> wouldn't a fd of any file or even directory of that cgroup work? Do the
> cgroup people have an opinion?

No strong opinions either way at my end.


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-22  4:23   ` Balbir Singh
@ 2010-09-22  7:27     ` Peter Zijlstra
  2010-09-22  9:18       ` Balbir Singh
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-22  7:27 UTC (permalink / raw)
  To: balbir
  Cc: eranian, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

On Wed, 2010-09-22 at 09:53 +0530, Balbir Singh wrote:
> Yes, a task can belong to multiple subsystems, hence multiple cgroups.
> Ideally we'd want to use pid + subsystem

Apparently we create a perf subsystem, and we only care about that. So
pid will uniquely identify a cgroup, since for each subsystem a task can
only belong to one cgroup.

> > One thing we could do is pass this cgroup identifier in the pid field
> > and use PERF_FLAG_CGROUP or something. Currently the syscall signature
> > uses pid_t, but I think we can safely change that to int.
> 
> Or union it and overload the field to contain either pid_t or fd of the cgroup

Its not a field, its an argument.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-22  7:27     ` Peter Zijlstra
@ 2010-09-22  9:18       ` Balbir Singh
  2010-09-22 10:26         ` Stephane Eranian
  0 siblings, 1 reply; 18+ messages in thread
From: Balbir Singh @ 2010-09-22  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

* Peter Zijlstra <peterz@infradead.org> [2010-09-22 09:27:59]:

> On Wed, 2010-09-22 at 09:53 +0530, Balbir Singh wrote:
> > Yes, a task can belong to multiple subsystems, hence multiple cgroups.
> > Ideally we'd want to use pid + subsystem
> 
> Apparently we create a perf subsystem, and we only care about that. So
> pid will uniquely identify a cgroup, since for each subsystem a task can
> only belong to one cgroup.

Hmm.. I misread the intention to mean we care about monitoring all
data and aggregate it for each cgroup.

> 
> > > One thing we could do is pass this cgroup identifier in the pid field
> > > and use PERF_FLAG_CGROUP or something. Currently the syscall signature
> > > uses pid_t, but I think we can safely change that to int.
> > 
> > Or union it and overload the field to contain either pid_t or fd of the cgroup
> 
> Its not a field, its an argument.

Thanks!

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-22  9:18       ` Balbir Singh
@ 2010-09-22 10:26         ` Stephane Eranian
  2010-09-25  9:51           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Stephane Eranian @ 2010-09-22 10:26 UTC (permalink / raw)
  To: balbir
  Cc: Peter Zijlstra, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

Hi,

Ok, early testing shows that this seems to be working fine with the
pid approach.
Of course it is less convenient than just opening a file descriptor in
cgroup_fs.
There is more bookkeeping involved, incl. cleanup the child on exit.

The other thing is related to how to indicate we want cgroup and not per-thread.
For now, my patch is using a new attr.cgroup bit. The alternative is to use a
bit in the flags parameter to the syscall.

On Wed, Sep 22, 2010 at 11:18 AM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
> * Peter Zijlstra <peterz@infradead.org> [2010-09-22 09:27:59]:
>
>> On Wed, 2010-09-22 at 09:53 +0530, Balbir Singh wrote:
>> > Yes, a task can belong to multiple subsystems, hence multiple cgroups.
>> > Ideally we'd want to use pid + subsystem
>>
>> Apparently we create a perf subsystem, and we only care about that. So
>> pid will uniquely identify a cgroup, since for each subsystem a task can
>> only belong to one cgroup.
>
> Hmm.. I misread the intention to mean we care about monitoring all
> data and aggregate it for each cgroup.
>
>>
>> > > One thing we could do is pass this cgroup identifier in the pid field
>> > > and use PERF_FLAG_CGROUP or something. Currently the syscall signature
>> > > uses pid_t, but I think we can safely change that to int.
>> >
>> > Or union it and overload the field to contain either pid_t or fd of the cgroup
>>
>> Its not a field, its an argument.
>
> Thanks!
>
> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-22 10:26         ` Stephane Eranian
@ 2010-09-25  9:51           ` Peter Zijlstra
  2010-09-28  9:23             ` Stephane Eranian
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-09-25  9:51 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: balbir, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

On Wed, 2010-09-22 at 12:26 +0200, Stephane Eranian wrote:

> Ok, early testing shows that this seems to be working fine with the
> pid approach. Of course it is less convenient than just opening a file
> descriptor in cgroup_fs. There is more bookkeeping involved, incl.
> cleanup the child on exit.

Right, so another approach, similar to what you initially proposed,
would be to have this cgroup file, but instead of simply opening and
passing the fd around, have it provide a cgroup id and pass that around
as an integer (or pid_t) in the task argument to the SYSCALL (see the
flag bit below).

You can use the regular lib/idr.c bits to generate ids and maintain an
id->cgroup map.

> The other thing is related to how to indicate we want cgroup and not per-thread.
> For now, my patch is using a new attr.cgroup bit. The alternative is to use a
> bit in the flags parameter to the syscall. 

Right, I prefer a bit in the syscall flags argument and somehow using
the task argument.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3)
  2010-09-25  9:51           ` Peter Zijlstra
@ 2010-09-28  9:23             ` Stephane Eranian
  0 siblings, 0 replies; 18+ messages in thread
From: Stephane Eranian @ 2010-09-28  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: balbir, linux-kernel, mingo, paulus, davem, fweisbec,
	perfmon2-devel, eranian, robert.richter, acme, Paul Menage,
	Li Zefan

On Sat, Sep 25, 2010 at 11:51 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-09-22 at 12:26 +0200, Stephane Eranian wrote:
>
>> Ok, early testing shows that this seems to be working fine with the
>> pid approach. Of course it is less convenient than just opening a file
>> descriptor in cgroup_fs. There is more bookkeeping involved, incl.
>> cleanup the child on exit.
>
> Right, so another approach, similar to what you initially proposed,
> would be to have this cgroup file, but instead of simply opening and
> passing the fd around, have it provide a cgroup id and pass that around
> as an integer (or pid_t) in the task argument to the SYSCALL (see the
> flag bit below).
>
> You can use the regular lib/idr.c bits to generate ids and maintain an
> id->cgroup map.
>
What's the advantage of that compared to just passing the fd in the
pid argument?
You can close the fd right after returning from perf_event_open().

>> The other thing is related to how to indicate we want cgroup and not per-thread.
>> For now, my patch is using a new attr.cgroup bit. The alternative is to use a
>> bit in the flags parameter to the syscall.
>
> Right, I prefer a bit in the syscall flags argument and somehow using
> the task argument.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-09-28  9:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-09 13:05 [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring (v3) Stephane Eranian
2010-09-21  9:38 ` Peter Zijlstra
2010-09-21  9:43   ` Peter Zijlstra
2010-09-21 11:48     ` Stephane Eranian
2010-09-21 12:42       ` Peter Zijlstra
2010-09-21 13:38         ` Stephane Eranian
2010-09-21 14:03           ` Peter Zijlstra
2010-09-21 16:17             ` Stephane Eranian
2010-09-21 16:27               ` Peter Zijlstra
2010-09-21 16:33                 ` Stephane Eranian
2010-09-22  4:34                 ` Balbir Singh
2010-09-22  7:25                   ` Peter Zijlstra
2010-09-22  4:23   ` Balbir Singh
2010-09-22  7:27     ` Peter Zijlstra
2010-09-22  9:18       ` Balbir Singh
2010-09-22 10:26         ` Stephane Eranian
2010-09-25  9:51           ` Peter Zijlstra
2010-09-28  9:23             ` Stephane Eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox