From: Ingo Molnar <mingo@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Johannes Weiner <hannes@cmpxchg.org>,
Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH 00/27] Latest numa/core release, v16
Date: Tue, 20 Nov 2012 10:06:37 +0100 [thread overview]
Message-ID: <20121120090637.GA14873@gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1211200001420.16449@chino.kir.corp.google.com>
* David Rientjes <rientjes@google.com> wrote:
> On Tue, 20 Nov 2012, Ingo Molnar wrote:
>
> > > This happened to be an Opteron (but not 83xx series), 2.4Ghz.
> >
> > Ok - roughly which family/model from /proc/cpuinfo?
>
> It's close enough, it's 23xx.
Ok - which family/model number in /proc/cpuinfo?
I'm asking because that will matter most to page fault
micro-characteristics and the 23xx series existed in Barcelona
form as well (family/model of 16/2) and it still exists in its
current Shanghai form as well.
My guess is Barcelona 16/2?
If that is correct then the closest I can get to your topology
is a 4-socket 32-way Opteron system with 32 GB of RAM - which
seems close enough for testing purposes.
But checking numa/core on such a system still keeps me
absolutely puzzled, as I get the following with a similar
16-warehouses SPECjbb 2005 test, using java -Xms8192m -Xmx8192m
-Xss256k sizing, THP enabled, 2x 240 seconds runs (I tried to
configure it all very close to yours), using -tip-a07005cbd847:
kernel warehouses transactions/sec
--------- ----------
v3.7-rc6: 16 197802
16 197997
numa/core: 16 203086
16 203967
So sadly numa/core is about 2%-3% faster on this 4x4 system too!
:-/
But I have to say, your SPECjbb score is uncharacteristically
low even for an oddball-topology Barcelona system - which is the
oldest/slowest system I can think of. So there might be more to
this.
To further characterise a "good" SPECjbb run, there's no
page_fault overhead visible in perf top:
Mainline profile:
94.99% perf-1244.map [.] 0x00007f04cd1aa523
2.52% libjvm.so [.] 0x00000000007004a1
0.62% [vdso] [.] 0x0000000000000972
0.31% [kernel] [k] clear_page_c
0.17% [kernel] [k] timekeeping_get_ns.constprop.7
0.11% [kernel] [k] rep_nop
0.09% [kernel] [k] ktime_get
0.08% [kernel] [k] get_cycles
0.06% [kernel] [k] read_tsc
0.05% libc-2.15.so [.] __strcmp_sse2
numa/core profile:
95.66% perf-1201.map [.] 0x00007fe4ad1c8fc7
1.70% libjvm.so [.] 0x0000000000381581
0.59% [vdso] [.] 0x0000000000000607
0.19% [kernel] [k] do_raw_spin_lock
0.11% [kernel] [k] generic_smp_call_function_interrupt
0.11% [kernel] [k] timekeeping_get_ns.constprop.7
0.08% [kernel] [k] ktime_get
0.06% [kernel] [k] get_cycles
0.05% [kernel] [k] __native_flush_tlb
0.05% [kernel] [k] rep_nop
0.04% perf [.] add_hist_entry.isra.9
0.04% [kernel] [k] rcu_check_callbacks
0.04% [kernel] [k] ktime_get_update_offsets
0.04% libc-2.15.so [.] __strcmp_sse2
No page fault overhead (see the page fault rate further below) -
the NUMA scanning overhead shows up only through some mild TLB
flush activity (which I'll fix btw).
[ Stupid question: cpufreq is configured to always-2.4GHz,
right? If you could send me your kernel config (you can do
that privately as well) then I can try to boot it and see. ]
> > > It's perf top -U, the benchmark itself was unchanged so I
> > > didn't think it was interesting to gather the user
> > > symbols. If that would be helpful, let me know!
> >
> > Yeah, regular perf top output would be very helpful to get a
> > general sense of proportion. Thanks!
>
> Ok, here it is:
>
> 91.24% perf-10971.map [.] 0x00007f116a6c6fb8
> 1.19% libjvm.so [.] instanceKlass::oop_push_contents(PSPromotionMa
> 1.04% libjvm.so [.] PSPromotionManager::drain_stacks_depth(bool)
> 0.79% libjvm.so [.] PSPromotionManager::copy_to_survivor_space(oop
> 0.60% libjvm.so [.] PSPromotionManager::claim_or_forward_internal_
> 0.58% [kernel] [k] page_fault
> 0.28% libc-2.3.6.so [.] __gettimeofday
> 0.26% libjvm.so [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
> 0.22% [kernel] [k] getnstimeofday
> 0.18% libjvm.so [.] CardTableExtension::scavenge_contents_parallel(ObjectS
> 0.15% [kernel] [k] _raw_spin_lock
> 0.12% [kernel] [k] ktime_get_update_offsets
> 0.11% [kernel] [k] ktime_get
> 0.11% [kernel] [k] rcu_check_callbacks
> 0.10% [kernel] [k] generic_smp_call_function_interrupt
> 0.10% [kernel] [k] read_tsc
> 0.10% [kernel] [k] clear_page_c
> 0.10% [kernel] [k] __do_page_fault
> 0.08% [kernel] [k] handle_mm_fault
> 0.08% libjvm.so [.] os::javaTimeMillis()
> 0.08% [kernel] [k] emulate_vsyscall
Oh, finally a clue: you seem to have vsyscall emulation
overhead!
Vsyscall emulation is fundamentally page fault driven - which
might explain why you are seeing page fault overhead. It might
also interact with other sources of faults - such as numa/core's
working set probing ...
Many JVMs try to be smart with the vsyscall. As a test, does the
vsyscall=native boot option change the results/behavior in any
way?
Stupid question, if you apply the patch attached below and if
you do page fault profiling while the run is in steady state:
perf record -e faults -g -a sleep 10
do you see it often coming from the vsyscall page?
Also, this:
perf stat -e faults -a --repeat 10 sleep 1
should normally report something like this during SPECjbb steady
state, numa/core:
warmup: 3,895 faults/sec ( +- 12.11% )
steady state: 3,910 faults/sec ( +- 6.72% )
Which is about 250 faults/sec/CPU - i.e. it should be barely
recognizable in profiles - let alone be prominent as in yours.
Thanks,
Ingo
---
arch/x86/mm/fault.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -1030,6 +1030,9 @@ __do_page_fault(struct pt_regs *regs, un
/* Get the faulting address: */
address = read_cr2();
+ /* Instrument as early as possible: */
+ perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
/*
* Detect and handle instructions that would cause a page fault for
* both a tracked kernel page and a userspace page.
@@ -1107,8 +1110,6 @@ __do_page_fault(struct pt_regs *regs, un
}
}
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
/*
* If we're in an interrupt, have no user context or are running
* in an atomic region then we must not take the fault:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-20 9:06 UTC|newest]
Thread overview: 101+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-19 2:14 [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-19 2:14 ` [PATCH 01/27] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-19 2:14 ` [PATCH 02/27] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-19 2:14 ` [PATCH 03/27] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-19 2:14 ` [PATCH 04/27] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-19 2:14 ` [PATCH 05/27] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-19 2:14 ` [PATCH 06/27] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-19 2:14 ` [PATCH 07/27] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-19 2:14 ` [PATCH 08/27] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-19 2:14 ` [PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
2012-11-19 2:14 ` [PATCH 10/27] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-19 2:14 ` [PATCH 11/27] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-19 2:14 ` [PATCH 12/27] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-19 2:14 ` [PATCH 13/27] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-19 2:14 ` [PATCH 14/27] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-19 2:14 ` [PATCH 15/27] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-19 2:14 ` [PATCH 16/27] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-19 2:14 ` [PATCH 17/27] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-11-19 2:14 ` [PATCH 18/27] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-19 2:14 ` [PATCH 19/27] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-19 2:14 ` [PATCH 20/27] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-19 2:14 ` [PATCH 21/27] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-19 2:14 ` [PATCH 22/27] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-19 2:14 ` [PATCH 23/27] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-19 2:14 ` [PATCH 24/27] sched: Improve convergence Ingo Molnar
2012-11-19 2:14 ` [PATCH 25/27] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-19 2:14 ` [PATCH 26/27] sched: Track groups of shared tasks Ingo Molnar
2012-11-19 2:14 ` [PATCH 27/27] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-19 16:29 ` [PATCH 00/27] Latest numa/core release, v16 Mel Gorman
2012-11-19 19:13 ` Ingo Molnar
2012-11-19 21:18 ` Mel Gorman
2012-11-19 22:36 ` Ingo Molnar
2012-11-19 23:00 ` Mel Gorman
2012-11-20 0:41 ` Rik van Riel
2012-11-21 10:58 ` Mel Gorman
2012-11-20 1:02 ` Linus Torvalds
2012-11-20 7:17 ` Ingo Molnar
2012-11-20 7:37 ` David Rientjes
2012-11-20 7:48 ` Ingo Molnar
2012-11-20 8:01 ` Ingo Molnar
2012-11-20 8:11 ` David Rientjes
2012-11-21 11:14 ` Mel Gorman
2012-11-20 10:20 ` Mel Gorman
2012-11-20 10:47 ` Mel Gorman
2012-11-20 15:29 ` [PATCH] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-20 16:09 ` [PATCH, v2] " Ingo Molnar
2012-11-20 16:31 ` Rik van Riel
2012-11-20 16:52 ` Ingo Molnar
2012-11-21 12:08 ` Mel Gorman
2012-11-21 8:12 ` Ingo Molnar
2012-11-21 2:41 ` David Rientjes
2012-11-21 9:34 ` Ingo Molnar
2012-11-21 11:40 ` Mel Gorman
2012-11-23 1:26 ` Alex Shi
2012-11-20 17:56 ` numa/core regressions fixed - more testers wanted Ingo Molnar
2012-11-21 1:54 ` Andrew Theurer
2012-11-21 3:22 ` Rik van Riel
2012-11-21 4:10 ` Hugh Dickins
2012-11-21 17:59 ` Andrew Theurer
2012-11-21 11:52 ` Mel Gorman
2012-11-21 22:15 ` Andrew Theurer
2012-11-21 3:33 ` David Rientjes
2012-11-21 9:38 ` Ingo Molnar
2012-11-21 11:06 ` Ingo Molnar
2012-11-21 8:39 ` Alex Shi
2012-11-22 1:21 ` Ingo Molnar
2012-11-23 13:31 ` Ingo Molnar
2012-11-23 15:23 ` Alex Shi
2012-11-26 2:11 ` Alex Shi
2012-11-28 14:21 ` Alex Shi
2012-11-20 10:40 ` [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-20 11:40 ` Mel Gorman
2012-11-21 10:38 ` Mel Gorman
2012-11-21 19:37 ` Andrea Arcangeli
2012-11-21 19:56 ` Mel Gorman
2012-11-19 20:07 ` Ingo Molnar
2012-11-19 21:37 ` Mel Gorman
2012-11-20 0:50 ` David Rientjes
2012-11-20 1:05 ` David Rientjes
2012-11-20 6:00 ` Ingo Molnar
2012-11-20 6:20 ` David Rientjes
2012-11-20 7:44 ` Ingo Molnar
2012-11-20 7:48 ` Paul Turner
2012-11-20 8:20 ` David Rientjes
2012-11-20 9:06 ` Ingo Molnar [this message]
2012-11-20 9:41 ` [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it Ingo Molnar
2012-11-20 23:01 ` Andy Lutomirski
2012-11-21 0:43 ` David Rientjes
2012-11-20 12:02 ` [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups Ingo Molnar
2012-11-20 12:31 ` Ingo Molnar
2012-11-21 11:47 ` Mel Gorman
2012-11-21 1:22 ` David Rientjes
2012-11-21 17:02 ` [PATCH 00/27] Latest numa/core release, v16 Linus Torvalds
2012-11-21 17:10 ` Ingo Molnar
2012-11-21 17:20 ` Ingo Molnar
2012-11-22 4:31 ` David Rientjes
2012-11-21 17:40 ` Ingo Molnar
2012-11-21 22:04 ` Linus Torvalds
2012-11-21 22:46 ` Ingo Molnar
2012-11-21 17:45 ` Rik van Riel
2012-11-21 18:04 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121120090637.GA14873@gmail.com \
--to=mingo@kernel.org \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).