Re: [PATCH 00/27] Latest numa/core release, v16

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH 00/27] Latest numa/core release, v16
Date: Tue, 20 Nov 2012 10:06:37 +0100	[thread overview]
Message-ID: <20121120090637.GA14873@gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1211200001420.16449@chino.kir.corp.google.com>


* David Rientjes <rientjes@google.com> wrote:

> On Tue, 20 Nov 2012, Ingo Molnar wrote:
> 
> > > This happened to be an Opteron (but not 83xx series), 2.4Ghz.  
> > 
> > Ok - roughly which family/model from /proc/cpuinfo?
> 
> It's close enough, it's 23xx.

Ok - which family/model number in /proc/cpuinfo?

I'm asking because that will matter most to page fault 
micro-characteristics and the 23xx series existed in Barcelona 
form as well (family/model of 16/2) and it still exists in its 
current Shanghai form as well.

My guess is Barcelona 16/2?

If that is correct then the closest I can get to your topology 
is a 4-socket 32-way Opteron system with 32 GB of RAM - which 
seems close enough for testing purposes.

But checking numa/core on such a system still keeps me 
absolutely puzzled, as I get the following with a similar 
16-warehouses SPECjbb 2005 test, using java -Xms8192m -Xmx8192m 
-Xss256k sizing, THP enabled, 2x 240 seconds runs (I tried to 
configure it all very close to yours), using -tip-a07005cbd847:

         kernel             warehouses    transactions/sec
         ---------          ----------
         v3.7-rc6:          16            197802
                            16            197997

         numa/core:         16            203086
                            16            203967

So sadly numa/core is about 2%-3% faster on this 4x4 system too!
:-/

But I have to say, your SPECjbb score is uncharacteristically 
low even for an oddball-topology Barcelona system - which is the 
oldest/slowest system I can think of. So there might be more to 
this.

To further characterise a "good" SPECjbb run, there's no 
page_fault overhead visible in perf top:

Mainline profile:

    94.99%  perf-1244.map     [.] 0x00007f04cd1aa523                 
     2.52%  libjvm.so         [.] 0x00000000007004a1                 
     0.62%  [vdso]            [.] 0x0000000000000972                 
     0.31%  [kernel]          [k] clear_page_c                       
     0.17%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.11%  [kernel]          [k] rep_nop                            
     0.09%  [kernel]          [k] ktime_get                          
     0.08%  [kernel]          [k] get_cycles                         
     0.06%  [kernel]          [k] read_tsc                           
     0.05%  libc-2.15.so      [.] __strcmp_sse2                      

numa/core profile:

    95.66%  perf-1201.map     [.] 0x00007fe4ad1c8fc7                 
     1.70%  libjvm.so         [.] 0x0000000000381581                 
     0.59%  [vdso]            [.] 0x0000000000000607                 
     0.19%  [kernel]          [k] do_raw_spin_lock                   
     0.11%  [kernel]          [k] generic_smp_call_function_interrupt
     0.11%  [kernel]          [k] timekeeping_get_ns.constprop.7     
     0.08%  [kernel]          [k] ktime_get                          
     0.06%  [kernel]          [k] get_cycles                         
     0.05%  [kernel]          [k] __native_flush_tlb                 
     0.05%  [kernel]          [k] rep_nop                            
     0.04%  perf              [.] add_hist_entry.isra.9              
     0.04%  [kernel]          [k] rcu_check_callbacks                
     0.04%  [kernel]          [k] ktime_get_update_offsets           
     0.04%  libc-2.15.so      [.] __strcmp_sse2                      

No page fault overhead (see the page fault rate further below) - 
the NUMA scanning overhead shows up only through some mild TLB 
flush activity (which I'll fix btw).

[ Stupid question: cpufreq is configured to always-2.4GHz, 
  right? If you could send me your kernel config (you can do 
  that privately as well) then I can try to boot it and see. ]

> > > It's perf top -U, the benchmark itself was unchanged so I 
> > > didn't think it was interesting to gather the user 
> > > symbols.  If that would be helpful, let me know!
> > 
> > Yeah, regular perf top output would be very helpful to get a 
> > general sense of proportion. Thanks!
> 
> Ok, here it is:
> 
>     91.24%  perf-10971.map    [.] 0x00007f116a6c6fb8                            
>      1.19%  libjvm.so         [.] instanceKlass::oop_push_contents(PSPromotionMa
>      1.04%  libjvm.so         [.] PSPromotionManager::drain_stacks_depth(bool)  
>      0.79%  libjvm.so         [.] PSPromotionManager::copy_to_survivor_space(oop
>      0.60%  libjvm.so         [.] PSPromotionManager::claim_or_forward_internal_
>      0.58%  [kernel]          [k] page_fault                                    
>      0.28%  libc-2.3.6.so     [.] __gettimeofday                                        
>      0.26%  libjvm.so         [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
>      0.22%  [kernel]          [k] getnstimeofday                                        
>      0.18%  libjvm.so         [.] CardTableExtension::scavenge_contents_parallel(ObjectS
>      0.15%  [kernel]          [k] _raw_spin_lock                                        
>      0.12%  [kernel]          [k] ktime_get_update_offsets                              
>      0.11%  [kernel]          [k] ktime_get                                             
>      0.11%  [kernel]          [k] rcu_check_callbacks                                   
>      0.10%  [kernel]          [k] generic_smp_call_function_interrupt                   
>      0.10%  [kernel]          [k] read_tsc                                              
>      0.10%  [kernel]          [k] clear_page_c                                          
>      0.10%  [kernel]          [k] __do_page_fault                                       
>      0.08%  [kernel]          [k] handle_mm_fault                                       
>      0.08%  libjvm.so         [.] os::javaTimeMillis()                                  
>      0.08%  [kernel]          [k] emulate_vsyscall                                      

Oh, finally a clue: you seem to have vsyscall emulation 
overhead!

Vsyscall emulation is fundamentally page fault driven - which 
might explain why you are seeing page fault overhead. It might 
also interact with other sources of faults - such as numa/core's 
working set probing ...

Many JVMs try to be smart with the vsyscall. As a test, does the 
vsyscall=native boot option change the results/behavior in any 
way?

Stupid question, if you apply the patch attached below and if 
you do page fault profiling while the run is in steady state:

   perf record -e faults -g -a sleep 10

do you see it often coming from the vsyscall page?

Also, this:

   perf stat -e faults -a --repeat 10 sleep 1

should normally report something like this during SPECjbb steady 
state, numa/core:

          warmup:   3,895 faults/sec                ( +- 12.11% )
    steady state:   3,910 faults/sec                ( +-  6.72% )

Which is about 250 faults/sec/CPU - i.e. it should be barely 
recognizable in profiles - let alone be prominent as in yours.

Thanks,

	Ingo

---
 arch/x86/mm/fault.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -1030,6 +1030,9 @@ __do_page_fault(struct pt_regs *regs, un
 	/* Get the faulting address: */
 	address = read_cr2();
 
+	/* Instrument as early as possible: */
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
 	/*
 	 * Detect and handle instructions that would cause a page fault for
 	 * both a tracked kernel page and a userspace page.
@@ -1107,8 +1110,6 @@ __do_page_fault(struct pt_regs *regs, un
 		}
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	/*
 	 * If we're in an interrupt, have no user context or are running
 	 * in an atomic region then we must not take the fault:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-11-20  9:06 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19  2:14 [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-19  2:14 ` [PATCH 01/27] mm/generic: Only flush the local TLB in ptep_set_access_flags() Ingo Molnar
2012-11-19  2:14 ` [PATCH 02/27] x86/mm: Only do a local tlb flush " Ingo Molnar
2012-11-19  2:14 ` [PATCH 03/27] x86/mm: Introduce pte_accessible() Ingo Molnar
2012-11-19  2:14 ` [PATCH 04/27] mm: Only flush the TLB when clearing an accessible pte Ingo Molnar
2012-11-19  2:14 ` [PATCH 05/27] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Ingo Molnar
2012-11-19  2:14 ` [PATCH 06/27] mm: Count the number of pages affected in change_protection() Ingo Molnar
2012-11-19  2:14 ` [PATCH 07/27] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Ingo Molnar
2012-11-19  2:14 ` [PATCH 08/27] sched, numa, mm: Add last_cpu to page flags Ingo Molnar
2012-11-19  2:14 ` [PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides Ingo Molnar
2012-11-19  2:14 ` [PATCH 10/27] sched: Make find_busiest_queue() a method Ingo Molnar
2012-11-19  2:14 ` [PATCH 11/27] sched, numa, mm: Describe the NUMA scheduling problem formally Ingo Molnar
2012-11-19  2:14 ` [PATCH 12/27] numa, mm: Support NUMA hinting page faults from gup/gup_fast Ingo Molnar
2012-11-19  2:14 ` [PATCH 13/27] mm/migrate: Introduce migrate_misplaced_page() Ingo Molnar
2012-11-19  2:14 ` [PATCH 14/27] sched, numa, mm, arch: Add variable locality exception Ingo Molnar
2012-11-19  2:14 ` [PATCH 15/27] sched, numa, mm: Add credits for NUMA placement Ingo Molnar
2012-11-19  2:14 ` [PATCH 16/27] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag Ingo Molnar
2012-11-19  2:14 ` [PATCH 17/27] sched, numa, mm: Add the scanning page fault machinery Ingo Molnar
2012-11-19  2:14 ` [PATCH 18/27] sched: Add adaptive NUMA affinity support Ingo Molnar
2012-11-19  2:14 ` [PATCH 19/27] sched: Implement constant, per task Working Set Sampling (WSS) rate Ingo Molnar
2012-11-19  2:14 ` [PATCH 20/27] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Ingo Molnar
2012-11-19  2:14 ` [PATCH 21/27] sched: Implement slow start for working set sampling Ingo Molnar
2012-11-19  2:14 ` [PATCH 22/27] sched, numa, mm: Interleave shared tasks Ingo Molnar
2012-11-19  2:14 ` [PATCH 23/27] sched: Implement NUMA scanning backoff Ingo Molnar
2012-11-19  2:14 ` [PATCH 24/27] sched: Improve convergence Ingo Molnar
2012-11-19  2:14 ` [PATCH 25/27] sched: Introduce staged average NUMA faults Ingo Molnar
2012-11-19  2:14 ` [PATCH 26/27] sched: Track groups of shared tasks Ingo Molnar
2012-11-19  2:14 ` [PATCH 27/27] sched: Use the best-buddy 'ideal cpu' in balancing decisions Ingo Molnar
2012-11-19 16:29 ` [PATCH 00/27] Latest numa/core release, v16 Mel Gorman
2012-11-19 19:13   ` Ingo Molnar
2012-11-19 21:18     ` Mel Gorman
2012-11-19 22:36       ` Ingo Molnar
2012-11-19 23:00         ` Mel Gorman
2012-11-20  0:41           ` Rik van Riel
2012-11-21 10:58             ` Mel Gorman
2012-11-20  1:02         ` Linus Torvalds
2012-11-20  7:17           ` Ingo Molnar
2012-11-20  7:37             ` David Rientjes
2012-11-20  7:48               ` Ingo Molnar
2012-11-20  8:01               ` Ingo Molnar
2012-11-20  8:11                 ` David Rientjes
2012-11-21 11:14               ` Mel Gorman
2012-11-20 10:20             ` Mel Gorman
2012-11-20 10:47               ` Mel Gorman
2012-11-20 15:29             ` [PATCH] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-20 16:09               ` [PATCH, v2] " Ingo Molnar
2012-11-20 16:31                 ` Rik van Riel
2012-11-20 16:52                   ` Ingo Molnar
2012-11-21 12:08                     ` Mel Gorman
2012-11-21  8:12                   ` Ingo Molnar
2012-11-21  2:41                 ` David Rientjes
2012-11-21  9:34                   ` Ingo Molnar
2012-11-21 11:40                 ` Mel Gorman
2012-11-23  1:26                 ` Alex Shi
2012-11-20 17:56               ` numa/core regressions fixed - more testers wanted Ingo Molnar
2012-11-21  1:54                 ` Andrew Theurer
2012-11-21  3:22                   ` Rik van Riel
2012-11-21  4:10                     ` Hugh Dickins
2012-11-21 17:59                       ` Andrew Theurer
2012-11-21 11:52                   ` Mel Gorman
2012-11-21 22:15                     ` Andrew Theurer
2012-11-21  3:33                 ` David Rientjes
2012-11-21  9:38                   ` Ingo Molnar
2012-11-21 11:06                   ` Ingo Molnar
2012-11-21  8:39                 ` Alex Shi
2012-11-22  1:21                   ` Ingo Molnar
2012-11-23 13:31                     ` Ingo Molnar
2012-11-23 15:23                       ` Alex Shi
2012-11-26  2:11                       ` Alex Shi
2012-11-28 14:21                         ` Alex Shi
2012-11-20 10:40         ` [PATCH 00/27] Latest numa/core release, v16 Ingo Molnar
2012-11-20 11:40           ` Mel Gorman
2012-11-21 10:38     ` Mel Gorman
2012-11-21 19:37       ` Andrea Arcangeli
2012-11-21 19:56         ` Mel Gorman
2012-11-19 20:07   ` Ingo Molnar
2012-11-19 21:37     ` Mel Gorman
2012-11-20  0:50   ` David Rientjes
2012-11-20  1:05     ` David Rientjes
2012-11-20  6:00       ` Ingo Molnar
2012-11-20  6:20         ` David Rientjes
2012-11-20  7:44           ` Ingo Molnar
2012-11-20  7:48             ` Paul Turner
2012-11-20  8:20             ` David Rientjes
2012-11-20  9:06               ` Ingo Molnar [this message]
2012-11-20  9:41                 ` [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it Ingo Molnar
2012-11-20 23:01                   ` Andy Lutomirski
2012-11-21  0:43                   ` David Rientjes
2012-11-20 12:02                 ` [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups Ingo Molnar
2012-11-20 12:31                   ` Ingo Molnar
2012-11-21 11:47                     ` Mel Gorman
2012-11-21  1:22                   ` David Rientjes
2012-11-21 17:02                 ` [PATCH 00/27] Latest numa/core release, v16 Linus Torvalds
2012-11-21 17:10                   ` Ingo Molnar
2012-11-21 17:20                     ` Ingo Molnar
2012-11-22  4:31                       ` David Rientjes
2012-11-21 17:40                     ` Ingo Molnar
2012-11-21 22:04                     ` Linus Torvalds
2012-11-21 22:46                       ` Ingo Molnar
2012-11-21 17:45                   ` Rik van Riel
2012-11-21 18:04                   ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121120090637.GA14873@gmail.com \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).