From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753924Ab0JTPfL (ORCPT ); Wed, 20 Oct 2010 11:35:11 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:44615 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753816Ab0JTPfI (ORCPT ); Wed, 20 Oct 2010 11:35:08 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=Y/xoO5iWsuZEeeHsur+Nmb+qITbLT+OLr8+928KspZaaSIIbwgFDHxJ3bWW01nOZj5 Ec/cyBPgaZZIwtGS/JgAg0dUEIapn/Ll4pZH4423ZSYpYKBi7ua6PZlwzQvPHeF4zhQR PxbFd69nLEP7dhiSFCHLqbpvHNMHv/ENsubW4= Date: Wed, 20 Oct 2010 17:35:04 +0200 From: Frederic Weisbecker To: "Frank Ch. Eigler" Cc: LKML , Peter Zijlstra , Arnaldo Carvalho de Melo , Paul Mackerras , Stephane Eranian , Cyrill Gorcunov , Tom Zanussi , Masami Hiramatsu , Steven Rostedt , Robert Richter Subject: Re: [RFC] perf: Dwarf cfi based user callchains Message-ID: <20101020153500.GA5387@nowhere> References: <1286946421-32202-1-git-send-regression-fweisbec@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 13, 2010 at 11:13:27AM -0400, Frank Ch. Eigler wrote: > > Frederic Weisbecker writes: > > > [...] > > This brings dwarf cfi based callchain for userspace apps that don't have > > frame pointers. > > Interesting approach! > > > [...] > > - it's slow. A first improvement to make it faster is to support binary > > search from .eh_frame_hdr. > > In systemtap land, we did find a dramatic improvement from this too. Yeah, linear walking on .eh_frame is not a good thing other than for testing and debugging. > Have you measured the cost of transcribing of potentially large chunks > of the user stacks? We did not seriously evaluate this path, since we > encounter megabyte+ stacks in larger userspace apps, and copying THAT > out seemed absurd. Actually that's quite affordable. And you don't need to dump all the stack of a process on every samples, that would indeed be absurd. A small chunk, starting from the stack pointer, is enough. What is interesting is that you can play with several different sizes of dump, the higher it is and the deeper you'll be able to unwind, that also means more profiling overhead. I can't measure significant throughput issues with hackbench for example: Normal run: $ time ./hackbench 10 Time: 3.415 real 0m3.506s user 0m0.257s sys 0m6.519s With perf record (default is cycles counter with 1000 HZ samples frequency): $ time ./perf record ./hackbench 10 Time: 3.584 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.381 MB perf.data (~16654 samples) ] real 0m3.748s user 0m0.028s sys 0m0.022s With perf record + frame pointer based callchain capture $ time ./perf record -g fp ./hackbench 10 Time: 3.666 [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 1.426 MB perf.data (~62281 samples) ] real 0m3.834s user 0m0.033s sys 0m0.046s With perf record + 4096 bytes of stack dump in every sample $ time ./perf record -g dwarf,4096 ./hackbench 10 Time: 3.697 [ perf record: Woken up 15 times to write data ] [ perf record: Captured and wrote 5.156 MB perf.data (~225251 samples) ] real 0m3.931s user 0m0.026s sys 0m0.135s With perf record + 20000 bytes of stack dump in every sample $ time ./perf record -g dwarf,20000 ./hackbench 10 Time: 3.847 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 13.349 MB perf.data (~583219 samples) ] real 0m4.559s user 0m0.028s sys 0m0.329s So there is no big differences. May be hackbench is not a good example to highlight the possible impact though. And some tuning with higher frequencies would make the difference more visible. Perhaps the real impact is more on the amount to record, the data file tends to grow quickly, obviously.