All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	linux-kernel@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	paulus@samba.org, acme@redhat.com, efault@gmx.de,
	npiggin@suse.de, tglx@linutronix.de,
	linux-tip-commits@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
	torvalds@linux-foundation.org,
	Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [tip:perfcounters/core] x86: Add NMI types for kmap_atomic
Date: Mon, 15 Jun 2009 17:52:28 +0200	[thread overview]
Message-ID: <20090615155228.GA19529@elte.hu> (raw)
In-Reply-To: <1245080486.6800.561.camel@laptop>


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > I've not been following the background to this,
> 
> We need/want to do a user-space stack walk from IRQ/NMI context. 
> The NMI bit means we cannot simply use __copy_from_user_inatomic() 
> since that will still fault (albeit not page), and the fault 
> return path invokes IRET which will terminate the NMI context.

The goal is to allow 'perf' (see tools/perf/) non-flat 
categorizations like the sample output in the (pending) commit (see 
it attached further below). Here's the kind of output it allows:

 $ perf record -g -m 512 -f -- make -j32 kernel
 $ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi

     4.14%  [k] do_page_fault
     1.20%  [k] sys_write
     1.10%  [k] sys_open
     0.63%  [k] sys_exit_group
     0.48%  [k] smp_apic_timer_interrupt
     0.37%  [k] sys_read
     0.37%  [k] sys_execve
     0.20%  [k] sys_mmap
     0.18%  [k] sys_close
     0.14%  [k] sys_munmap
     0.13%  [k] sys_poll

Note that Oprofile uses the same method of __copy_user_inatomic() in 
arch/x86/oprofile/backtrace.c, but i believe that code is broken - i 
doubt the call-chain support for user-space stacks ever worked in 
oprofile - with perfcounters i can make this method crash under 
load. (we re-enter the NMI which due to IST executes over the exact 
same, still pending NMI frame. Kaboom.)

I saw you being involved with the Oprofile code 3 years ago:

| commit c34d1b4d165c67b966bca4aba026443d7ff161eb
| Author: Hugh Dickins <hugh@veritas.com>
| Date:   Sat Oct 29 18:16:32 2005 -0700
|
|    [PATCH] mm: kill check_user_page_readable

That method of __copy_user_inatomic(), while elegant, is subtly 
wrong in an NMI context. We really must avoid taking faults there.

	Ingo

------------>
>From 3dfabc74c65904c9e6cf952391312d16ea772ef5 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 15 Jun 2009 11:24:38 +0200
Subject: [PATCH] perf report: Add per system call overhead histogram

Take advantage of call-graph percounter sampling/recording to
display a non-trivial histogram: the true, collapsed/summarized
cost measurement, on a per system call total overhead basis:

 aldebaran:~/linux/linux/tools/perf> ./perf record -g -a -f ~/hackbench 10
 aldebaran:~/linux/linux/tools/perf> ./perf report -s symbol --syscalls | head -10
 #
 # (3536 samples)
 #
 # Overhead  Symbol
 # ........  ......
 #
     40.75%  [k] sys_write
     40.21%  [k] sys_read
      4.44%  [k] do_nmi
 ...

This is done by accounting each (reliable) call-chain that chains back
to a given system call to that system call function.

[ So in the above example we can see that hackbench spends about 40% of
  its total time somewhere in sys_write() and 40% somewhere in
  sys_read(), the rest of the time is spent in user-space. The time
  is not spent in sys_write() _itself_ but in one of its many child
  functions. ]

Or, a recording of a (source files are already in the page-cache) kernel build:

 $ perf record -g -m 512 -f -- make -j32 kernel
 $ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi

     4.14%  [k] do_page_fault
     1.20%  [k] sys_write
     1.10%  [k] sys_open
     0.63%  [k] sys_exit_group
     0.48%  [k] smp_apic_timer_interrupt
     0.37%  [k] sys_read
     0.37%  [k] sys_execve
     0.20%  [k] sys_mmap
     0.18%  [k] sys_close
     0.14%  [k] sys_munmap
     0.13%  [k] sys_poll
     0.09%  [k] sys_newstat
     0.07%  [k] sys_clone
     0.06%  [k] sys_newfstat
     0.05%  [k] sys_access
     0.05%  [k] schedule

Shows the true total cost of each syscall variant that gets used
during a kernel build. This profile reveals it that pagefaults are
the costliest, followed by read()/write().

An interesting detail: timer interrupts cost 0.5% - or 0.5 seconds
per 100 seconds of kernel build-time. (this was done with HZ=1000)

The summary is done in 'perf report', i.e. in the post-processing
stage - so once we have a good call-graph recording, this type of
non-trivial high-level analysis becomes possible.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 tools/perf/builtin-report.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index aebba56..1e2f5dd 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -40,6 +40,7 @@ static int		dump_trace = 0;
 
 static int		verbose;
 static int		full_paths;
+static int		collapse_syscalls;
 
 static unsigned long	page_size;
 static unsigned long	mmap_window = 32;
@@ -983,6 +984,15 @@ process_overflow_event(event_t *event, unsigned long offset, unsigned long head)
 			for (i = 0; i < chain->nr; i++)
 				dprintf("..... %2d: %p\n", i, (void *)chain->ips[i]);
 		}
+		if (collapse_syscalls) {
+			/*
+			 * Find the all-but-last kernel entry
+			 * amongst the call-chains - to get
+			 * to the level of system calls:
+			 */
+			if (chain->kernel >= 2)
+				ip = chain->ips[chain->kernel-2];
+		}
 	}
 
 	dprintf(" ... thread: %s:%d\n", thread->comm, thread->pid);
@@ -1343,6 +1353,8 @@ static const struct option options[] = {
 		   "sort by key(s): pid, comm, dso, symbol. Default: pid,symbol"),
 	OPT_BOOLEAN('P', "full-paths", &full_paths,
 		    "Don't shorten the pathnames taking into account the cwd"),
+	OPT_BOOLEAN('S', "syscalls", &collapse_syscalls,
+		    "show per syscall summary overhead, using call graph"),
 	OPT_END()
 };
 

  reply	other threads:[~2009-06-15 15:53 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <tip-3ff0141aa3a03ca3388b40b36167d0a37919f3fd@git.kernel.org>
2009-06-15 14:46 ` [tip:perfcounters/core] x86: Add NMI types for kmap_atomic Peter Zijlstra
2009-06-15 15:30   ` Hugh Dickins
2009-06-15 15:41     ` Peter Zijlstra
2009-06-15 15:52       ` Ingo Molnar [this message]
2009-06-15 16:02         ` Hugh Dickins
2009-06-15 18:04       ` Peter Zijlstra
2009-06-15 18:15         ` Ingo Molnar
2009-06-15 18:19           ` Peter Zijlstra
2009-06-15 18:25             ` Ingo Molnar
2009-06-15 18:30               ` Peter Zijlstra
2009-06-15 18:42                 ` Ingo Molnar
2009-06-15 18:47                   ` Peter Zijlstra
2009-06-15 18:52                     ` Ingo Molnar
2009-06-15 19:00                       ` Peter Zijlstra
2009-06-16  8:13                         ` Ingo Molnar
2009-06-16 12:38                           ` Hugh Dickins
2009-06-17  7:58                             ` Peter Zijlstra
2009-06-17  8:43                               ` Tejun Heo
2009-06-17  9:05                                 ` Peter Zijlstra
2009-06-17  7:44                           ` Peter Zijlstra
2009-06-17 12:28                             ` Ingo Molnar
2009-06-15 18:42         ` Andrew Morton
2009-06-15 18:45           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090615155228.GA19529@elte.hu \
    --to=mingo@elte.hu \
    --cc=a.p.zijlstra@chello.nl \
    --cc=acme@redhat.com \
    --cc=efault@gmx.de \
    --cc=hpa@zytor.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mathieu.desnoyers@polymtl.ca \
    --cc=mingo@redhat.com \
    --cc=npiggin@suse.de \
    --cc=paulus@samba.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.