NUMA? bisected performance regression 3.11->3.12

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* NUMA? bisected performance regression 3.11->3.12
@ 2013-11-21 22:57 Dave Hansen
  2013-11-22  5:22 ` Johannes Weiner
  2013-11-26 10:32 ` Mel Gorman
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2013-11-21 22:57 UTC (permalink / raw)
  To: Johannes Weiner, Linus Torvalds
  Cc: Linux-MM, Mel Gorman, Rik van Riel, Kevin Hilman,
	Andrea Arcangeli, Paul Bolle, Zlatko Calusic, Andrew Morton,
	Tim Chen, Andi Kleen

Hey Johannes,

I'm running an open/close microbenchmark from the will-it-scale set:
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/open1.c

I was seeing some weird symptoms on 3.12 vs 3.11.  The throughput in
that test was going from down from 50 million to 35 million.

The profiles show an increase in cpu time in _raw_spin_lock_irq.  The
profiles pointed to slub code that hasn't been touched in quite a while.
 I bisected it down to:

81c0a2bb515fd4daae8cab64352877480792b515 is the first bad commit
commit 81c0a2bb515fd4daae8cab64352877480792b515
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Sep 11 14:20:47 2013 -0700

Which also seems a bit weird, but I've tested with this and its
preceding commit enough times to be fairly sure that I did it right.

__slab_free() and free_one_page() both seem to be spending more time
spinning on their respective spinlocks, even though the throughput went
down and we should have been doing fewer actual allocations/frees.  The
best explanation for this would be if CPUs are tending to go after and
contending for remote cachelines more often once this patch is applied.

Any ideas?

It's a 8-socket/160-thread (one NUMA node per socket) system that is not
under memory pressure during the test.  The latencies are also such that
vm.zone_reclaim_mode=0.

Raw perf profiles and .config are in here:
http://www.sr71.net/~dave/intel/201311-wisregress0/

Here's a chunk of the 'perf diff':
>     17.65%   +3.47%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave           
>     13.80%   -0.31%  [kernel.kallsyms]  [k] _raw_spin_lock                   
>      7.21%   -0.51%  [unknown]          [.] 0x00007f7849058640               
>      3.43%   +0.15%  [kernel.kallsyms]  [k] setup_object                     
>      2.99%   -0.31%  [kernel.kallsyms]  [k] file_free_rcu                    
>      2.71%   -0.13%  [kernel.kallsyms]  [k] rcu_process_callbacks            
>      2.26%   -0.09%  [kernel.kallsyms]  [k] get_empty_filp                   
>      2.06%   -0.09%  [kernel.kallsyms]  [k] kmem_cache_alloc                 
>      1.65%   -0.08%  [kernel.kallsyms]  [k] link_path_walk                   
>      1.53%   -0.08%  [kernel.kallsyms]  [k] memset                           
>      1.46%   -0.09%  [kernel.kallsyms]  [k] do_dentry_open                   
>      1.44%   -0.04%  [kernel.kallsyms]  [k] __d_lookup_rcu                   
>      1.27%   -0.04%  [kernel.kallsyms]  [k] do_last                          
>      1.18%   -0.04%  [kernel.kallsyms]  [k] ext4_release_file                
>      1.16%   -0.04%  [kernel.kallsyms]  [k] __call_rcu.constprop.11          

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-21 22:57 NUMA? bisected performance regression 3.11->3.12 Dave Hansen
@ 2013-11-22  5:22 ` Johannes Weiner
  2013-11-22  6:18   ` Dave Hansen
  2013-11-26 10:32 ` Mel Gorman
  1 sibling, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-22  5:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Linux-MM, Mel Gorman, Rik van Riel, Kevin Hilman,
	Andrea Arcangeli, Paul Bolle, Zlatko Calusic, Andrew Morton,
	Tim Chen, Andi Kleen

Hi Dave,

On Thu, Nov 21, 2013 at 02:57:18PM -0800, Dave Hansen wrote:
> Hey Johannes,
> 
> I'm running an open/close microbenchmark from the will-it-scale set:
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/open1.c
> 
> I was seeing some weird symptoms on 3.12 vs 3.11.  The throughput in
> that test was going from down from 50 million to 35 million.
> 
> The profiles show an increase in cpu time in _raw_spin_lock_irq.  The
> profiles pointed to slub code that hasn't been touched in quite a while.
>  I bisected it down to:
> 
> 81c0a2bb515fd4daae8cab64352877480792b515 is the first bad commit
> commit 81c0a2bb515fd4daae8cab64352877480792b515
> Author: Johannes Weiner <hannes@cmpxchg.org>
> Date:   Wed Sep 11 14:20:47 2013 -0700
> 
> Which also seems a bit weird, but I've tested with this and its
> preceding commit enough times to be fairly sure that I did it right.
> 
> __slab_free() and free_one_page() both seem to be spending more time
> spinning on their respective spinlocks, even though the throughput went
> down and we should have been doing fewer actual allocations/frees.  The
> best explanation for this would be if CPUs are tending to go after and
> contending for remote cachelines more often once this patch is applied.
> 
> Any ideas?
> 
> It's a 8-socket/160-thread (one NUMA node per socket) system that is not
> under memory pressure during the test.  The latencies are also such that
> vm.zone_reclaim_mode=0.

The change will definitely spread allocations out to all nodes then
and it's plausible that the remote references will hurt kernel object
allocations in a tight loop.  Just to confirm, could you rerun the
test with zone_reclaim_mode enabled to make the allocator stay in the
local zones?

The fairness code was written for reclaimable memory, which is
longer-lived, and the only memory where it matters.  I might have to
be bypass it for unreclaimable allocations...

> Raw perf profiles and .config are in here:
> http://www.sr71.net/~dave/intel/201311-wisregress0/
> 
> Here's a chunk of the 'perf diff':
> >     17.65%   +3.47%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave           
> >     13.80%   -0.31%  [kernel.kallsyms]  [k] _raw_spin_lock                   
> >      7.21%   -0.51%  [unknown]          [.] 0x00007f7849058640               
> >      3.43%   +0.15%  [kernel.kallsyms]  [k] setup_object                     
> >      2.99%   -0.31%  [kernel.kallsyms]  [k] file_free_rcu                    
> >      2.71%   -0.13%  [kernel.kallsyms]  [k] rcu_process_callbacks            
> >      2.26%   -0.09%  [kernel.kallsyms]  [k] get_empty_filp                   
> >      2.06%   -0.09%  [kernel.kallsyms]  [k] kmem_cache_alloc                 
> >      1.65%   -0.08%  [kernel.kallsyms]  [k] link_path_walk                   
> >      1.53%   -0.08%  [kernel.kallsyms]  [k] memset                           
> >      1.46%   -0.09%  [kernel.kallsyms]  [k] do_dentry_open                   
> >      1.44%   -0.04%  [kernel.kallsyms]  [k] __d_lookup_rcu                   
> >      1.27%   -0.04%  [kernel.kallsyms]  [k] do_last                          
> >      1.18%   -0.04%  [kernel.kallsyms]  [k] ext4_release_file                
> >      1.16%   -0.04%  [kernel.kallsyms]  [k] __call_rcu.constprop.11          

Thanks for the detailed report.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-22  5:22 ` Johannes Weiner
@ 2013-11-22  6:18   ` Dave Hansen
  2013-11-22  6:38     ` Johannes Weiner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2013-11-22  6:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Linux-MM, Mel Gorman, Rik van Riel, Kevin Hilman,
	Andrea Arcangeli, Paul Bolle, Zlatko Calusic, Andrew Morton,
	Tim Chen, Andi Kleen

On 11/21/2013 09:22 PM, Johannes Weiner wrote:
>> > It's a 8-socket/160-thread (one NUMA node per socket) system that is not
>> > under memory pressure during the test.  The latencies are also such that
>> > vm.zone_reclaim_mode=0.
> The change will definitely spread allocations out to all nodes then
> and it's plausible that the remote references will hurt kernel object
> allocations in a tight loop.  Just to confirm, could you rerun the
> test with zone_reclaim_mode enabled to make the allocator stay in the
> local zones?

Yeah, setting vm.zone_reclaim_mode=1 fixes it pretty instantaneously.

For what it's worth, I'm pretty convinced that the numbers folks put in
the SLIT tables are, at best, horribly inconsistent from system to
system.  At worst, they're utter fabrications not linked at all to the
reality of the actual latencies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-22  6:18   ` Dave Hansen
@ 2013-11-22  6:38     ` Johannes Weiner
  2013-11-22 16:57       ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-22  6:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Linux-MM, Mel Gorman, Rik van Riel, Kevin Hilman,
	Andrea Arcangeli, Paul Bolle, Zlatko Calusic, Andrew Morton,
	Tim Chen, Andi Kleen

On Thu, Nov 21, 2013 at 10:18:44PM -0800, Dave Hansen wrote:
> On 11/21/2013 09:22 PM, Johannes Weiner wrote:
> >> > It's a 8-socket/160-thread (one NUMA node per socket) system that is not
> >> > under memory pressure during the test.  The latencies are also such that
> >> > vm.zone_reclaim_mode=0.
> > The change will definitely spread allocations out to all nodes then
> > and it's plausible that the remote references will hurt kernel object
> > allocations in a tight loop.  Just to confirm, could you rerun the
> > test with zone_reclaim_mode enabled to make the allocator stay in the
> > local zones?
> 
> Yeah, setting vm.zone_reclaim_mode=1 fixes it pretty instantaneously.
> 
> For what it's worth, I'm pretty convinced that the numbers folks put in
> the SLIT tables are, at best, horribly inconsistent from system to
> system.  At worst, they're utter fabrications not linked at all to the
> reality of the actual latencies.

You mean the reported distances should probably be bigger on this
particular machine?

But even when correct, zone_reclaim_mode might not be the best
predictor.  Just because it's not worth yet to invest direct reclaim
efforts to stay local does not mean that remote references are free.

I'm currently running some tests with the below draft to see if this
would still leave us with enough fairness.  Does the patch restore
performance even with zone_reclaim_mode disabled?

---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..c77cead 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1926,7 +1926,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-22  6:38     ` Johannes Weiner
@ 2013-11-22 16:57       ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-11-22 16:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Linux-MM, Mel Gorman, Rik van Riel, Kevin Hilman,
	Andrea Arcangeli, Paul Bolle, Zlatko Calusic, Andrew Morton,
	Tim Chen, Andi Kleen

On 11/21/2013 10:38 PM, Johannes Weiner wrote:
> On Thu, Nov 21, 2013 at 10:18:44PM -0800, Dave Hansen wrote:
>> For what it's worth, I'm pretty convinced that the numbers folks put in
>> the SLIT tables are, at best, horribly inconsistent from system to
>> system.  At worst, they're utter fabrications not linked at all to the
>> reality of the actual latencies.
> 
> You mean the reported distances should probably be bigger on this
> particular machine?

Yeah, or smaller on the others that made us switch zone_reclaim_mode at
the place where we do.

> But even when correct, zone_reclaim_mode might not be the best
> predictor.  Just because it's not worth yet to invest direct reclaim
> efforts to stay local does not mean that remote references are free.
> 
> I'm currently running some tests with the below draft to see if this
> would still leave us with enough fairness.  Does the patch restore
> performance even with zone_reclaim_mode disabled?

Yeah, that at least works for the one test where it's been causing the
most trouble.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-21 22:57 NUMA? bisected performance regression 3.11->3.12 Dave Hansen
  2013-11-22  5:22 ` Johannes Weiner
@ 2013-11-26 10:32 ` Mel Gorman
  2013-12-06 17:43   ` Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2013-11-26 10:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Linus Torvalds, Linux-MM, Rik van Riel,
	Kevin Hilman, Andrea Arcangeli, Paul Bolle, Zlatko Calusic,
	Andrew Morton, Tim Chen, Andi Kleen, Vlastimil Babka

On Thu, Nov 21, 2013 at 02:57:18PM -0800, Dave Hansen wrote:
> Hey Johannes,
> 
> I'm running an open/close microbenchmark from the will-it-scale set:
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/open1.c
> 
> I was seeing some weird symptoms on 3.12 vs 3.11.  The throughput in
> that test was going from down from 50 million to 35 million.
> 
> The profiles show an increase in cpu time in _raw_spin_lock_irq.  The
> profiles pointed to slub code that hasn't been touched in quite a while.
>  I bisected it down to:
> 

Dave, do you mind retesting this against "[RFC PATCH 0/5] Memory compaction
efficiency improvements" please? I have not finished reviewing the series
yet but patch 3 mentions lower allocation success rates with Johannes'
patch and notes that it is unlikely to be a bug with the patch itself.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NUMA? bisected performance regression 3.11->3.12
  2013-11-26 10:32 ` Mel Gorman
@ 2013-12-06 17:43   ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2013-12-06 17:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Linus Torvalds, Linux-MM, Rik van Riel,
	Kevin Hilman, Andrea Arcangeli, Paul Bolle, Zlatko Calusic,
	Andrew Morton, Tim Chen, Andi Kleen, Vlastimil Babka

On 11/26/2013 02:32 AM, Mel Gorman wrote:
> On Thu, Nov 21, 2013 at 02:57:18PM -0800, Dave Hansen wrote:
>> I'm running an open/close microbenchmark from the will-it-scale set:
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/open1.c
>>
>> I was seeing some weird symptoms on 3.12 vs 3.11.  The throughput in
>> that test was going from down from 50 million to 35 million.
>>
>> The profiles show an increase in cpu time in _raw_spin_lock_irq.  The
>> profiles pointed to slub code that hasn't been touched in quite a while.
>>  I bisected it down to:
> 
> Dave, do you mind retesting this against "[RFC PATCH 0/5] Memory compaction
> efficiency improvements" please? I have not finished reviewing the series
> yet but patch 3 mentions lower allocation success rates with Johannes'
> patch and notes that it is unlikely to be a bug with the patch itself.

Sorry for the delay.  I lost monster box for a few days...

That series didn't look to have much of an effect.  Before/after numbers
coming out of that open1 test were both ~35M.  If it helped, it was in
the noise.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-12-06 17:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-21 22:57 NUMA? bisected performance regression 3.11->3.12 Dave Hansen
2013-11-22  5:22 ` Johannes Weiner
2013-11-22  6:18   ` Dave Hansen
2013-11-22  6:38     ` Johannes Weiner
2013-11-22 16:57       ` Dave Hansen
2013-11-26 10:32 ` Mel Gorman
2013-12-06 17:43   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).