From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755010Ab0ETJc0 (ORCPT ); Thu, 20 May 2010 05:32:26 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:39017 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754790Ab0ETJcX (ORCPT ); Thu, 20 May 2010 05:32:23 -0400 Date: Thu, 20 May 2010 11:31:31 +0200 From: Ingo Molnar To: Steven Rostedt Cc: LKML , Linus Torvalds , Andrew Morton , Peter Zijlstra , Frederic Weisbecker , Thomas Gleixner , Christoph Hellwig , Mathieu Desnoyers , Li Zefan , Lai Jiangshan , Johannes Berg , Masami Hiramatsu , Arnaldo Carvalho de Melo , Tom Zanussi , KOSAKI Motohiro , Andi Kleen , Masami Hiramatsu , Lin Ming , Cyrill Gorcunov , Mike Galbraith , Paul Mackerras , Hitoshi Mitake , Robert Richter Subject: [RFD] Future tracing/instrumentation directions Message-ID: <20100520093131.GA30929@elte.hu> References: <1274291514.26328.930.camel@gandalf.stny.rr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1274291514.26328.930.camel@gandalf.stny.rr.com> User-Agent: Mutt/1.5.20 (2009-08-17) X-ELTE-SpamScore: -1.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.0 required=5.9 tests=BAYES_20 autolearn=no SpamAssassin version=3.2.5 -1.0 BAYES_20 BODY: Bayesian spam probability is 5 to 20% [score: 0.1887] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Steven Rostedt wrote: > More than a year and a half ago (September 2008), at > Linux Plumbers, we had a meeting with several kernel > developers to come up with a unified ring buffer. A > generic ring buffer in the kernel that any subsystem > could use. After coming up with a set of requirements, I > worked on implementing it. One of the requirements was > to start off simple and work to become a more complete > buffering system. > > [...] The thing is, in tracing land and more broadly in instrumentation land we have _much_ more earthly problems these days: - Lets face it, performance of the ring-buffer sucks, in a big way. I've recently benchmarked it and it takes hundreds of instructions to trace a single event. Puh-lease ... - It has grown a lot of slack. It's complex and hard to read. - Over the last year or so the majority of bleeding-edge tracing developers have gradually migrated over to perf and 'perf trace' / 'perf probe' in particular. As far as the past two merge windows go they are out-developing the old ftrace UIs by a ratio of 4:1. So this angle is becoming a prime thing to improve and users and developers are hurting from the ftrace/perf duality. - [ While it's still a long way off, if this trend continues we eventually might even be able to get rid of the /debug/tracing/ temporary debug API and get rid of the ugly in-kernel pretty-printing bits. This is good: it may make Andrew very happy for a change ;-) The main detail here to be careful of is that lots of people are fond of the simplicity of the /debug/tracing/ debug UI, so when we replace it we want to do it by keeping that simple workflow (or best by making it even simpler). I have a few ideas how to do this. There's also the detail that in some cases we want to print events in the kernel in a human readable way: for example EDAC/MCE and other critical events, trace-on-oops, etc. This too can be solved. ] Regarding performance and complexity, which is our main worry atm, fortunately there's work going on in that direction - please see PeterZ's recent string of patches on lkml: 4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware ef60777: perf: Optimize the perf_output() path by removing IRQ-disables fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t 6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg() And it may sound harsh but at this stage i'm personally not at all interested in big design talk. This isnt rocket science, we have developers and users and we know what they are doing and we know what we need to do: we need to improve our crap and we need to reduce complexity. Less is more. So i'd like to see iterative, useful action first, and i am somewhat sceptical about yet another grand tracing design trying to match 100 requirements. Steve, Mathieu, if you are interested please help out Peter with the performance and simplification work. The last thing we need is yet another replace-everything event. If we really want to create a new ring-buffer abstraction i'd suggest we start with Peter's, it has a quite sane design and stayed simple and flexible - if then it could be factored out a bit. Here are more bits of what i see as the 'action' going forward, in no particular order: 1) Push the /debug/tracing/events/ event description into sysfs, as per this thread on lkml: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs http://groups.google.com/group/linux.kernel/msg/ab9aa075016c639e I.e. one more step towards integrating ftrace into perf. 2) Use 1) to unify the perf events and the ftrace ring-buffer. This, as things are standing is best done by factoring out Peter's ring-buffer in kernel/perf_event.c. It's properly abstracted and it _far_ simpler than kernel/tracing/ring_buffer.c, which has become a monstrosity. (but i'm open to other simplifications as well) 3) Add the function-tracer and function-graph tracer as an event and integrate it into perf. This will live-test the efficiency of the unification and brings over the last big ftrace plugin to perf. 4) Gradually convert/port/migrate all the remaining plugins over as well. We need to do this very gently because there are users - but stop piling new functionality on to the old ftrace side. This usually involves: - Conversion of an explicit tracing callback to TRACE_EVENT (for example in the case of mmiotrace), while keeping all tool functionality. - Migrate any 'special' ftrace feature to perf capabilities so that it's available via the syscall interface as well. (for example 'latency maximum tracking' is something that we probably want to do with kernel-side help - we probably dont want to implement it via tracing everything all the time and finding the maximum based on terabytes of data.) (And there are other complications here too, but you get the idea.) All in one, i think we can reuse more than 50% of all current ftrace code (possibly up to 70-80%) - and we are already reusing bits of it. Thanks, Ingo