From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756188AbZJAK2x (ORCPT ); Thu, 1 Oct 2009 06:28:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756171AbZJAK2w (ORCPT ); Thu, 1 Oct 2009 06:28:52 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:32877 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756033AbZJAK2w (ORCPT ); Thu, 1 Oct 2009 06:28:52 -0400 Date: Thu, 1 Oct 2009 12:28:37 +0200 From: Ingo Molnar To: "K.Prasad" Cc: Arjan van de Ven , "Frank Ch. Eigler" , peterz@infradead.org, linux-kernel@vger.kernel.org, Frederic Weisbecker Subject: Re: [RFC PATCH] perf_core: provide a kernel-internal interface to get to performance counters Message-ID: <20091001102837.GA31794@elte.hu> References: <20090925122556.2f8bd939@infradead.org> <20090926183246.GA4141@in.ibm.com> <20090926204848.0b2b48d2@infradead.org> <20091001072518.GA1502@elte.hu> <20091001081616.GA3636@in.ibm.com> <20091001085330.GC15345@elte.hu> <20091001100109.GB3636@in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20091001100109.GB3636@in.ibm.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.5 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * K.Prasad wrote: > On Thu, Oct 01, 2009 at 10:53:30AM +0200, Ingo Molnar wrote: > > > > * K.Prasad wrote: > > > > > On Thu, Oct 01, 2009 at 09:25:18AM +0200, Ingo Molnar wrote: > > > > > > > > * Arjan van de Ven wrote: > > > > > > > > > On Sun, 27 Sep 2009 00:02:46 +0530 > > > > > "K.Prasad" wrote: > > > > > > > > > > > On Sat, Sep 26, 2009 at 12:03:28PM -0400, Frank Ch. Eigler wrote: > > > > > > > > > > > > For what it's worth, this sort of thing also looks useful from > > > > > > > systemtap's point of view. > > > > > > > > > > > > Wouldn't SystemTap be another user that desires support for > > > > > > multiple/all CPU perf-counters (apart from hw-breakpoints as a > > > > > > potential user)? As Arjan pointed out, perf's present design would > > > > > > support only a per-CPU or per-task counter; not both. > > > > > > > > > > I'm sorry but I think I am missing your point. "all cpu counters" > > > > > would be one small helper wrapper away, a helper I'm sure the > > > > > SystemTap people are happy to submit as part of their patch series > > > > > when they submit SystemTap to the kernel. > > > > > > > > Yes, and Frederic wrote that wrapper already for the hw-breakpoints > > > > patches. It's a non-issue and does not affect the design - we can always > > > > gang up an array of per cpu perf events, it's a straightforward use of > > > > the existing design. > > > > > > > > > > Such a design (iteratively invoking a per-CPU perf event for all > > > desired CPUs) isn't without issues, some of which are noted here: > > > (apart from http://lkml.org/lkml/2009/9/14/298). > > > > > > - It breaks the abstraction that a user of the exported interfaces would > > > enjoy w.r.t. having all CPU (or a cpumask of CPU) breakpoints. > > > > CPU offlining/onlining support would be interesting to add. > > > > > - (Un)Availability of debug registers on every requested CPU is not > > > known until request for that CPU fails. A failed request should be > > > followed by a rollback of the partially successful requests. > > > > Yes. > > > > > - Any breakpoint exceptions generated due to partially successful > > > requests (before a failed request is encountered) must be treated as > > > 'stray' and be ignored (by the end-user? or the wrapper code?). > > > > Such inatomicity is inherent in using more than one CPU and a disjoint > > set of hw-breakpoints. If the calling code cares then callbacks > > triggering while the registration has not returned yet can be ignored. > > > > It can be prevented through book-keeping for debug registers, and > takes a 'greedy' approach that writes values onto the physical registers > only if it is known that there are sufficient slots available on all > desired CPUs (as done by the register_kernel_hw_breakpoint() code in > -tip now). > > > > - Any CPUs that become online eventually have to be trapped and > > > populated with the appropriate debug register value (not something > > > that the end-user of breakpoints should be bothered with). > > > > > > - Modifying the characteristics of a kernel breakpoint (including the > > > valid CPUs) will be equally painful. > > > > > > - Races between the requests (also leading to temporary failure of > > > all CPU requests) presenting an unclear picture about free debug > > > registers (making it difficult to predict the need for a retry). > > > > > > So we either have a perf event infrastructure that is cognisant of > > > many/all CPU counters, or make perf as a user of hw-breakpoints layer > > > which already handles such requests in a deft manner (through > > > appropriate book-keeping). > > > > Given that these are all still in the add-on category not affecting the > > design, while the problems solved by perf events are definitely in the > > non-trivial category, i'd suggest you extend perf events with a 'system > > wide' event abstraction, which: > > > > - Enumerates such registered events (via a list) > > > > - Adds a CPU hotplug handler (which clones those events over to a new > > CPU and directs it back to the ring-buffer of the existing event(s) > > [if any]) > > > > - Plus a state field that allows the filtering out of stray/premature > > events. > > > > With some book-keeping (as stated before) in place, stray exceptions > due to premature events would be prevented since only successful > requests are written onto debug registers. There would be no need for > a rollback from the end-user too. > > But I'm not sure if such book-keeping variables/data-structures will > find uses in other hw/sw events in perf apart from breakpoints > (depends on whether there's a need for support for multiple instances > of a hw/sw perf counter for a given CPU). If yes, then, the existing > synchronisation mechanism (through spin-locks over hw_breakpoint_lock) > must be extended over other perf events (post integration). yes - i think the counter array currently used by 'perf top' could be changed to use that new event type. Also, 'perf record --all' could use it. SysProf (system-wide profiler) would also be potential users of it. So yes, there would be use for this well beyond hw-breakpoints. Ingo