Perf event operation with hotplug cpus and cgroups

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Perf event operation with hotplug cpus and cgroups
@ 2015-03-20 19:10 William Cohen
  2015-03-20 19:22 ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: William Cohen @ 2015-03-20 19:10 UTC (permalink / raw)
  To: a.p.zijlstra, paulus, Don Domingo, Arnaldo Carvalho de Melo, LKML

The current perf event interface avoids complexity in the kernel by
making the user-space responsible for opening a file descriptor for
each cpu to monitor performance events.  However, there are two use
cases where this approach has issues: handling system-wide
measurements with hotplug cpus and monitoring of cgroups.

hotplug cpus

hotplug cpus can dynamically change the number of cpus that are active
on the system.  If "perf stat -a ..." is started with some of the
processors offline and then additional processors are put online after
perf is started no data is gathered from those newly onlined
processors.

cgroup monitoring

The cgroup monitoring is built on the perf event per cpu monitoring.
If the cgroup is not pinned to a particular set of processors, then
systemwide monitoring for that cgroup needs to be done and a perf
event open is needed for every cpu in the system.  The issue with this
approach is if the cgroups are used for virtual machine guests where
each cgroup is allocated a single processor, the number of cgroups is
proportional to the number of processors in the machine.  The number
of files that need to be opened to monitor the cgroups on the system
is O(cpus^2).  For a large system with 80 cpus that would be 6400
files, much larger than the default ulimit settings and there are huge
number of syscalls to read out information.  If one limits the number
of files opened for performance monitoring by pinning cgroups to
particular processors, any changes in pinning of cgroups to processors
will make the measurement incorrect.

Given the issues with these uses cases is user-space setting up the
counters for each cpu in the system the best solution?  Would it be
better to to allow the system-wide data collection to selected with
one perf event open with pid==-1 and cpu==-1?  Is setup of per cpu
monitoring and aggregation of the counters across processors too
difficult to do in the kernel?

-Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Perf event operation with hotplug cpus and cgroups
  2015-03-20 19:10 Perf event operation with hotplug cpus and cgroups William Cohen
@ 2015-03-20 19:22 ` Peter Zijlstra
  2015-03-20 19:41   ` William Cohen
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-03-20 19:22 UTC (permalink / raw)
  To: William Cohen; +Cc: paulus, Don Domingo, Arnaldo Carvalho de Melo, LKML

On Fri, Mar 20, 2015 at 03:10:39PM -0400, William Cohen wrote:
> cgroup monitoring
> 
> The cgroup monitoring is built on the perf event per cpu monitoring.
> If the cgroup is not pinned to a particular set of processors, then
> systemwide monitoring for that cgroup needs to be done and a perf
> event open is needed for every cpu in the system. 

> The issue with this
> approach is if the cgroups are used for virtual machine guests where
> each cgroup is allocated a single processor, the number of cgroups is
> proportional to the number of processors in the machine.  The number
> of files that need to be opened to monitor the cgroups on the system
> is O(cpus^2).

That's what you get for doing silly thing like that, isn't it. Why would
you create a cgroup per vcpu and then measure that cgroup if you're
interested in the whole virtual machine?

Just measure the parent cgroup of the vcpu cgroups if you're really only
interested in the virtual machine crap thing.

> Given the issues with these uses cases is user-space setting up the
> counters for each cpu in the system the best solution?  Would it be
> better to to allow the system-wide data collection to selected with
> one perf event open with pid==-1 and cpu==-1?  Is setup of per cpu
> monitoring and aggregation of the counters across processors too
> difficult to do in the kernel?

Not hard at all, but useless, you need a fd per cpu in order to get your
data out. Remember that the ring buffers are strictly per cpu.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Perf event operation with hotplug cpus and cgroups
  2015-03-20 19:22 ` Peter Zijlstra
@ 2015-03-20 19:41   ` William Cohen
  2015-03-20 20:20     ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: William Cohen @ 2015-03-20 19:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulus, Don Domingo, Arnaldo Carvalho de Melo, LKML

On 03/20/2015 03:22 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2015 at 03:10:39PM -0400, William Cohen wrote:
>> cgroup monitoring
>>
>> The cgroup monitoring is built on the perf event per cpu monitoring.
>> If the cgroup is not pinned to a particular set of processors, then
>> systemwide monitoring for that cgroup needs to be done and a perf
>> event open is needed for every cpu in the system. 
> 
>> The issue with this
>> approach is if the cgroups are used for virtual machine guests where
>> each cgroup is allocated a single processor, the number of cgroups is
>> proportional to the number of processors in the machine.  The number
>> of files that need to be opened to monitor the cgroups on the system
>> is O(cpus^2).
> 
> That's what you get for doing silly thing like that, isn't it. Why would
> you create a cgroup per vcpu and then measure that cgroup if you're
> interested in the whole virtual machine?

Hi Peter,

There isn't any desire to aggregate the different cgroup data together. The desired grouping is measurements per cgroup, kind of like the pid scoping for perf but for a cgroup.  It is just that the way that the perf event measurements works for cgroups that the measurements need to be taken system-wide. 

> Just measure the parent cgroup of the vcpu cgroups if you're really only
> interested in the virtual machine crap thing.
> 
>> Given the issues with these uses cases is user-space setting up the
>> counters for each cpu in the system the best solution?  Would it be
>> better to to allow the system-wide data collection to selected with
>> one perf event open with pid==-1 and cpu==-1?  Is setup of per cpu
>> monitoring and aggregation of the counters across processors too
>> difficult to do in the kernel?
> 
> Not hard at all, but useless, you need a fd per cpu in order to get your
> data out. Remember that the ring buffers are strictly per cpu.
> 

Are the ring buffers needed just for the sampling or are they also needed "perf stat" type information?

-Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Perf event operation with hotplug cpus and cgroups
  2015-03-20 19:41   ` William Cohen
@ 2015-03-20 20:20     ` Peter Zijlstra
  2015-03-23 16:02       ` William Cohen
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2015-03-20 20:20 UTC (permalink / raw)
  To: William Cohen; +Cc: paulus, Don Domingo, Arnaldo Carvalho de Melo, LKML

On Fri, Mar 20, 2015 at 03:41:54PM -0400, William Cohen wrote:
> 
> There isn't any desire to aggregate the different cgroup data
> together. The desired grouping is measurements per cgroup, kind of
> like the pid scoping for perf but for a cgroup.  It is just that the
> way that the perf event measurements works for cgroups that the
> measurements need to be taken system-wide. 

Still doesn't make any sense; if you want to monitor just the vcpu
attach to the one task already.

Without the vcpu per cgroup thing you'll never end up with O(n^2). You
get cgroups * cpus, which is what it is.

Your specific complain was about this weird setup where you place
nr_cpus tasks in nr_cpus cgroups and then end up with O(n^2) fds.

Also this isn't perf specific, cgroups _are_ system wide, so obviously
it needs system-wide measurement.

> > Just measure the parent cgroup of the vcpu cgroups if you're really only
> > interested in the virtual machine crap thing.
> > 
> >> Given the issues with these uses cases is user-space setting up the
> >> counters for each cpu in the system the best solution?  Would it be
> >> better to to allow the system-wide data collection to selected with
> >> one perf event open with pid==-1 and cpu==-1?  Is setup of per cpu
> >> monitoring and aggregation of the counters across processors too
> >> difficult to do in the kernel?
> > 
> > Not hard at all, but useless, you need a fd per cpu in order to get your
> > data out. Remember that the ring buffers are strictly per cpu.
> > 
> 
> Are the ring buffers needed just for the sampling or are they also
> needed "perf stat" type information?

No counting could do this; but even there I'd worry about scalability.
We'd need to fold the value into the 'global' counter on every cgroup
switch, now imagine all 80 cpus context switching at high rates between
cgroups.

Also we'd need to somehow manage multiple events with a single fd,
that's complexity we really do not need.

When we started out with perf we had such global constructs and we had
to quickly kill them for much smaller systems than this 80 cpu machine
you talk about.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Perf event operation with hotplug cpus and cgroups
  2015-03-20 20:20     ` Peter Zijlstra
@ 2015-03-23 16:02       ` William Cohen
  0 siblings, 0 replies; 5+ messages in thread
From: William Cohen @ 2015-03-23 16:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: paulus, Arnaldo Carvalho de Melo, LKML

On 03/20/2015 04:20 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2015 at 03:41:54PM -0400, William Cohen wrote:
>>
>> There isn't any desire to aggregate the different cgroup data
>> together. The desired grouping is measurements per cgroup, kind of
>> like the pid scoping for perf but for a cgroup.  It is just that the
>> way that the perf event measurements works for cgroups that the
>> measurements need to be taken system-wide. 
> 
> Still doesn't make any sense; if you want to monitor just the vcpu
> attach to the one task already.
> 
> Without the vcpu per cgroup thing you'll never end up with O(n^2). You
> get cgroups * cpus, which is what it is.
> 
> Your specific complain was about this weird setup where you place
> nr_cpus tasks in nr_cpus cgroups and then end up with O(n^2) fds.
> 
> Also this isn't perf specific, cgroups _are_ system wide, so obviously
> it needs system-wide measurement.


Hi Peter,

Monitoring OpenShift gears is likely to encounter this situation where cgroup>=cpus.  Each OpenShift gear is collection of processes running in a cgroup that is not pinned to a particular processor. A gear is typically limited to a fraction of a processor's time, so there are multiple gears per processor.

http://docs.openshift.org/origin-m4/oo_administration_guide.html#managing-gear-capacity

> 
>>> Just measure the parent cgroup of the vcpu cgroups if you're really only
>>> interested in the virtual machine crap thing.
>>>
>>>> Given the issues with these uses cases is user-space setting up the
>>>> counters for each cpu in the system the best solution?  Would it be
>>>> better to to allow the system-wide data collection to selected with
>>>> one perf event open with pid==-1 and cpu==-1?  Is setup of per cpu
>>>> monitoring and aggregation of the counters across processors too
>>>> difficult to do in the kernel?
>>>
>>> Not hard at all, but useless, you need a fd per cpu in order to get your
>>> data out. Remember that the ring buffers are strictly per cpu.
>>>
>>
>> Are the ring buffers needed just for the sampling or are they also
>> needed "perf stat" type information?
> 
> No counting could do this; but even there I'd worry about scalability.
> We'd need to fold the value into the 'global' counter on every cgroup
> switch, now imagine all 80 cpus context switching at high rates between
> cgroups.
> 
> Also we'd need to somehow manage multiple events with a single fd,
> that's complexity we really do not need.
> 
> When we started out with perf we had such global constructs and we had
> to quickly kill them for much smaller systems than this 80 cpu machine
> you talk about.
> 

No question that doing frequent updates of global data structures kills performance. What about having the systemwide information information accumulated on a per cpu basis and making the read out be the slow operation having to gather the information from all the processors to avoid slowing the context switches?

What are the other complexities of managing multiple cpu performance events with a single fd?  Allocating and freeing the underlying data structures on each of the processors?  Starting and stopping the measurements?

-Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-03-23 16:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-20 19:10 Perf event operation with hotplug cpus and cgroups William Cohen
2015-03-20 19:22 ` Peter Zijlstra
2015-03-20 19:41   ` William Cohen
2015-03-20 20:20     ` Peter Zijlstra
2015-03-23 16:02       ` William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).