From: Ingo Molnar <mingo@elte.hu>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>,
linux-kernel@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
paulus@samba.org, acme@redhat.com, efault@gmx.de,
npiggin@suse.de, tglx@linutronix.de,
linux-tip-commits@vger.kernel.org, mathieu.desnoyers@polymtl.ca,
torvalds@linux-foundation.org,
Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [tip:perfcounters/core] x86: Add NMI types for kmap_atomic
Date: Mon, 15 Jun 2009 17:52:28 +0200 [thread overview]
Message-ID: <20090615155228.GA19529@elte.hu> (raw)
In-Reply-To: <1245080486.6800.561.camel@laptop>
* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > I've not been following the background to this,
>
> We need/want to do a user-space stack walk from IRQ/NMI context.
> The NMI bit means we cannot simply use __copy_from_user_inatomic()
> since that will still fault (albeit not page), and the fault
> return path invokes IRET which will terminate the NMI context.
The goal is to allow 'perf' (see tools/perf/) non-flat
categorizations like the sample output in the (pending) commit (see
it attached further below). Here's the kind of output it allows:
$ perf record -g -m 512 -f -- make -j32 kernel
$ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi
4.14% [k] do_page_fault
1.20% [k] sys_write
1.10% [k] sys_open
0.63% [k] sys_exit_group
0.48% [k] smp_apic_timer_interrupt
0.37% [k] sys_read
0.37% [k] sys_execve
0.20% [k] sys_mmap
0.18% [k] sys_close
0.14% [k] sys_munmap
0.13% [k] sys_poll
Note that Oprofile uses the same method of __copy_user_inatomic() in
arch/x86/oprofile/backtrace.c, but i believe that code is broken - i
doubt the call-chain support for user-space stacks ever worked in
oprofile - with perfcounters i can make this method crash under
load. (we re-enter the NMI which due to IST executes over the exact
same, still pending NMI frame. Kaboom.)
I saw you being involved with the Oprofile code 3 years ago:
| commit c34d1b4d165c67b966bca4aba026443d7ff161eb
| Author: Hugh Dickins <hugh@veritas.com>
| Date: Sat Oct 29 18:16:32 2005 -0700
|
| [PATCH] mm: kill check_user_page_readable
That method of __copy_user_inatomic(), while elegant, is subtly
wrong in an NMI context. We really must avoid taking faults there.
Ingo
------------>
>From 3dfabc74c65904c9e6cf952391312d16ea772ef5 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 15 Jun 2009 11:24:38 +0200
Subject: [PATCH] perf report: Add per system call overhead histogram
Take advantage of call-graph percounter sampling/recording to
display a non-trivial histogram: the true, collapsed/summarized
cost measurement, on a per system call total overhead basis:
aldebaran:~/linux/linux/tools/perf> ./perf record -g -a -f ~/hackbench 10
aldebaran:~/linux/linux/tools/perf> ./perf report -s symbol --syscalls | head -10
#
# (3536 samples)
#
# Overhead Symbol
# ........ ......
#
40.75% [k] sys_write
40.21% [k] sys_read
4.44% [k] do_nmi
...
This is done by accounting each (reliable) call-chain that chains back
to a given system call to that system call function.
[ So in the above example we can see that hackbench spends about 40% of
its total time somewhere in sys_write() and 40% somewhere in
sys_read(), the rest of the time is spent in user-space. The time
is not spent in sys_write() _itself_ but in one of its many child
functions. ]
Or, a recording of a (source files are already in the page-cache) kernel build:
$ perf record -g -m 512 -f -- make -j32 kernel
$ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi
4.14% [k] do_page_fault
1.20% [k] sys_write
1.10% [k] sys_open
0.63% [k] sys_exit_group
0.48% [k] smp_apic_timer_interrupt
0.37% [k] sys_read
0.37% [k] sys_execve
0.20% [k] sys_mmap
0.18% [k] sys_close
0.14% [k] sys_munmap
0.13% [k] sys_poll
0.09% [k] sys_newstat
0.07% [k] sys_clone
0.06% [k] sys_newfstat
0.05% [k] sys_access
0.05% [k] schedule
Shows the true total cost of each syscall variant that gets used
during a kernel build. This profile reveals it that pagefaults are
the costliest, followed by read()/write().
An interesting detail: timer interrupts cost 0.5% - or 0.5 seconds
per 100 seconds of kernel build-time. (this was done with HZ=1000)
The summary is done in 'perf report', i.e. in the post-processing
stage - so once we have a good call-graph recording, this type of
non-trivial high-level analysis becomes possible.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
tools/perf/builtin-report.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index aebba56..1e2f5dd 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -40,6 +40,7 @@ static int dump_trace = 0;
static int verbose;
static int full_paths;
+static int collapse_syscalls;
static unsigned long page_size;
static unsigned long mmap_window = 32;
@@ -983,6 +984,15 @@ process_overflow_event(event_t *event, unsigned long offset, unsigned long head)
for (i = 0; i < chain->nr; i++)
dprintf("..... %2d: %p\n", i, (void *)chain->ips[i]);
}
+ if (collapse_syscalls) {
+ /*
+ * Find the all-but-last kernel entry
+ * amongst the call-chains - to get
+ * to the level of system calls:
+ */
+ if (chain->kernel >= 2)
+ ip = chain->ips[chain->kernel-2];
+ }
}
dprintf(" ... thread: %s:%d\n", thread->comm, thread->pid);
@@ -1343,6 +1353,8 @@ static const struct option options[] = {
"sort by key(s): pid, comm, dso, symbol. Default: pid,symbol"),
OPT_BOOLEAN('P', "full-paths", &full_paths,
"Don't shorten the pathnames taking into account the cwd"),
+ OPT_BOOLEAN('S', "syscalls", &collapse_syscalls,
+ "show per syscall summary overhead, using call graph"),
OPT_END()
};
next prev parent reply other threads:[~2009-06-15 15:53 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <tip-3ff0141aa3a03ca3388b40b36167d0a37919f3fd@git.kernel.org>
2009-06-15 14:46 ` [tip:perfcounters/core] x86: Add NMI types for kmap_atomic Peter Zijlstra
2009-06-15 15:30 ` Hugh Dickins
2009-06-15 15:41 ` Peter Zijlstra
2009-06-15 15:52 ` Ingo Molnar [this message]
2009-06-15 16:02 ` Hugh Dickins
2009-06-15 18:04 ` Peter Zijlstra
2009-06-15 18:15 ` Ingo Molnar
2009-06-15 18:19 ` Peter Zijlstra
2009-06-15 18:25 ` Ingo Molnar
2009-06-15 18:30 ` Peter Zijlstra
2009-06-15 18:42 ` Ingo Molnar
2009-06-15 18:47 ` Peter Zijlstra
2009-06-15 18:52 ` Ingo Molnar
2009-06-15 19:00 ` Peter Zijlstra
2009-06-16 8:13 ` Ingo Molnar
2009-06-16 12:38 ` Hugh Dickins
2009-06-17 7:58 ` Peter Zijlstra
2009-06-17 8:43 ` Tejun Heo
2009-06-17 9:05 ` Peter Zijlstra
2009-06-17 7:44 ` Peter Zijlstra
2009-06-17 12:28 ` Ingo Molnar
2009-06-15 18:42 ` Andrew Morton
2009-06-15 18:45 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090615155228.GA19529@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=acme@redhat.com \
--cc=efault@gmx.de \
--cc=hpa@zytor.com \
--cc=hugh.dickins@tiscali.co.uk \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=mathieu.desnoyers@polymtl.ca \
--cc=mingo@redhat.com \
--cc=npiggin@suse.de \
--cc=paulus@samba.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.