All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: tree rcu: call_rcu scalability problem?
Date: Wed, 2 Sep 2009 08:19:27 -0700	[thread overview]
Message-ID: <20090902151927.GA6774@linux.vnet.ibm.com> (raw)
In-Reply-To: <20090902122756.GC12251@wotan.suse.de>

On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote:
> On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote:
> > Hi Paul,
> > 
> > I'm testing out scalability of some vfs code paths, and I'm seeing
> > a problem with call_rcu. This is a 2s8c opteron system, so nothing
> > crazy.
> > 
> > I'll show you the profile results for 1-8 threads:
> > 
> > 1:
> >  29768 total                                      0.0076
> >  15550 default_idle                              48.5938
> >   1340 __d_lookup                                 3.6413
> >    954 __link_path_walk                           0.2559
> >    816 system_call_after_swapgs                   8.0792
> >    680 kmem_cache_alloc                           1.4167
> >    669 dput                                       1.1946
> >    591 __call_rcu                                 2.0521
> > 
> > 2:
> >  56733 total                                      0.0145
> >  20074 default_idle                              62.7313
> >   3075 __call_rcu                                10.6771
> >   2650 __d_lookup                                 7.2011
> >   2019 dput                                       3.6054
> > 
> > 4:
> >  98889 total                                      0.0253
> >  21759 default_idle                              67.9969
> >  10994 __call_rcu                                38.1736
> >   5185 __d_lookup                                14.0897
> >   4475 dput                                       7.9911

Four threads runs on one socket but 8 threads runs on two sockets,
I take it?

> > 8:
> > 170391 total                                      0.0437
> >  31815 __call_rcu                               110.4688
> >  12958 dput                                      23.1393
> >  10417 __d_lookup                                28.3071
> > 
> > Of course there are other scalability factors involved too, but
> > __call_rcu is taking 54 times more CPU to do 8 times the amount
> > of work from 1-8 threads, or a factor of 6.7 slowdown.
> > 
> > This is with tree RCU.
> 
> It seems like nearly 2/3 of the cost is here:
>         /* Add the callback to our list. */
>         *rdp->nxttail[RCU_NEXT_TAIL] = head; <<<
>         rdp->nxttail[RCU_NEXT_TAIL] = &head->next;

Hmmm...  That certainly is not the first list of code in call_rcu() that
would come to mind...

> In loading the pointer to the next tail pointer. If I'm reading the profile
> correctly. Can't see why that should be a probem though...

The usual diagnosis would be false sharing.

Hmmm...  What is the workload?  CPU-bound?  If CONFIG_PREEMPT=n, I might
expect interference from force_quiescent_state(), except that it should
run only every few clock ticks.  So this seems quite unlikely.

Could you please try padding the beginning and end of struct rcu_data
with a few hundred bytes and rerunning?  Just in case there is a shared
per-CPU variable either before or after rcu_data in your memory layout?

							Thanx, Paul

> ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */
>    697  0.2172 :ffffffff8107dee0:       push   %r12
>    228  0.0710 :ffffffff8107dee2:       push   %rbp
>    133  0.0414 :ffffffff8107dee3:       mov    %rdx,%rbp
>    918  0.2860 :ffffffff8107dee6:       push   %rbx
>    316  0.0985 :ffffffff8107dee7:       mov    %rsi,0x8(%rdi)
>    257  0.0801 :ffffffff8107deeb:       movq   $0x0,(%rdi)
>   1660  0.5172 :ffffffff8107def2:       mfence
>  27730  8.6394 :ffffffff8107def5:       pushfq
>  13153  4.0979 :ffffffff8107def6:       pop    %r12
>    903  0.2813 :ffffffff8107def8:       cli
>   2562  0.7982 :ffffffff8107def9:       mov    %gs:0xde68,%eax
>   1784  0.5558 :ffffffff8107df01:       cltq
>                :ffffffff8107df03:       mov    0x60(%rdx,%rax,8),%rbx
>                :ffffffff8107df08:       pushfq
>   3494  1.0886 :ffffffff8107df09:       pop    %rdx
>    896  0.2792 :ffffffff8107df0a:       cli
>   2655  0.8272 :ffffffff8107df0b:       mov    0xd0(%rbp),%rcx
>   1800  0.5608 :ffffffff8107df12:       cmp    (%rbx),%rcx
>     21  0.0065 :ffffffff8107df15:       je     ffffffff8107df32 <__call_rcu+0x52
>                :ffffffff8107df17:       mov    0x40(%rbx),%rax
>     81  0.0252 :ffffffff8107df1b:       mov    %rcx,(%rbx)
>      3 9.3e-04 :ffffffff8107df1e:       mov    %rax,0x38(%rbx)
>                :ffffffff8107df22:       mov    0x48(%rbx),%rax
>                :ffffffff8107df26:       mov    %rax,0x40(%rbx)
>                :ffffffff8107df2a:       mov    0x50(%rbx),%rax
>                :ffffffff8107df2e:       mov    %rax,0x48(%rbx)
>                :ffffffff8107df32:       push   %rdx
>   1194  0.3720 :ffffffff8107df33:       popfq
>   9518  2.9654 :ffffffff8107df34:       pushfq
>   4179  1.3020 :ffffffff8107df35:       pop    %rdx
>   1277  0.3979 :ffffffff8107df36:       cli
>   2546  0.7932 :ffffffff8107df37:       mov    0xc8(%rbp),%rax
>   1748  0.5446 :ffffffff8107df3e:       cmp    %rax,0x8(%rbx)
>      5  0.0016 :ffffffff8107df42:       je     ffffffff8107df57 <__call_rcu+0x77
>                :ffffffff8107df44:       movb   $0x1,0x19(%rbx)
>      2 6.2e-04 :ffffffff8107df48:       movb   $0x0,0x18(%rbx)
>                :ffffffff8107df4c:       mov    0xc8(%rbp),%rax
>                :ffffffff8107df53:       mov    %rax,0x8(%rbx)
>    921  0.2869 :ffffffff8107df57:       push   %rdx
>    151  0.0470 :ffffffff8107df58:       popfq
> 183507 57.1725 :ffffffff8107df59:       mov    0x50(%rbx),%rax
>    995  0.3100 :ffffffff8107df5d:       mov    %rdi,(%rax)
>      2 6.2e-04 :ffffffff8107df60:       mov    %rdi,0x50(%rbx)
>     18  0.0056 :ffffffff8107df64:       mov    0xd0(%rbp),%rdx
>    940  0.2929 :ffffffff8107df6b:       mov    0xc8(%rbp),%rax
>     15  0.0047 :ffffffff8107df72:       cmp    %rax,%rdx
>      1 3.1e-04 :ffffffff8107df75:       je     ffffffff8107dfb0 <__call_rcu+0xd0
>    787  0.2452 :ffffffff8107df77:       mov    0x58(%rbx),%rax
>     58  0.0181 :ffffffff8107df7b:       inc    %rax
>      2 6.2e-04 :ffffffff8107df7e:       mov    %rax,0x58(%rbx)
>   1679  0.5231 :ffffffff8107df82:       movslq 0x4988fb(%rip),%rdx        # ffff
>     40  0.0125 :ffffffff8107df89:       cmp    %rdx,%rax
>      5  0.0016 :ffffffff8107df8c:       jg     ffffffff8107dfd7 <__call_rcu+0xf7
>    588  0.1832 :ffffffff8107df8e:       mov    0xe0(%rbp),%rdx
>     84  0.0262 :ffffffff8107df95:       mov    0x51f924(%rip),%rax        # ffff
>      5  0.0016 :ffffffff8107df9c:       cmp    %rax,%rdx
>    505  0.1573 :ffffffff8107df9f:       js     ffffffff8107dfc8 <__call_rcu+0xe8
>  17580  5.4771 :ffffffff8107dfa1:       push   %r12
>   1671  0.5206 :ffffffff8107dfa3:       popfq
>  24201  7.5399 :ffffffff8107dfa4:       pop    %rbx
>   1367  0.4259 :ffffffff8107dfa5:       pop    %rbp
>    377  0.1175 :ffffffff8107dfa6:       pop    %r12
>                :ffffffff8107dfa8:       retq
>                :ffffffff8107dfa9:       nopl   0x0(%rax)
>                :ffffffff8107dfb0:       mov    %rbp,%rdi
>                :ffffffff8107dfb3:       callq  ffffffff813be930 <_spin_lock_irqs
>     12  0.0037 :ffffffff8107dfb8:       mov    %rbp,%rdi
>                :ffffffff8107dfbb:       mov    %rax,%rsi
>                :ffffffff8107dfbe:       callq  ffffffff8107d8e0 <rcu_start_gp>
>                :ffffffff8107dfc3:       jmp    ffffffff8107df77 <__call_rcu+0x97
>                :ffffffff8107dfc5:       nopl   (%rax)
>                :ffffffff8107dfc8:       mov    $0x1,%esi
>     10  0.0031 :ffffffff8107dfcd:       mov    %rbp,%rdi
>                :ffffffff8107dfd0:       callq  ffffffff8107dd50 <force_quiescent
>      1 3.1e-04 :ffffffff8107dfd5:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
>    451  0.1405 :ffffffff8107dfd7:       mov    $0x7fffffffffffffff,%rdx
>    411  0.1280 :ffffffff8107dfe1:       xor    %esi,%esi
>                :ffffffff8107dfe3:       mov    %rbp,%rdi
>                :ffffffff8107dfe6:       mov    %rdx,0x60(%rbx)
>    317  0.0988 :ffffffff8107dfea:       callq  ffffffff8107dd50 <force_quiescent
>   4510  1.4051 :ffffffff8107dfef:       jmp    ffffffff8107dfa1 <__call_rcu+0xc1
>                :ffffffff8107dff1:       nopw   %cs:0x0(%rax,%rax,1)
> 
> 

  reply	other threads:[~2009-09-02 15:59 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-02  9:48 tree rcu: call_rcu scalability problem? Nick Piggin
2009-09-02 12:27 ` Nick Piggin
2009-09-02 15:19   ` Paul E. McKenney [this message]
2009-09-02 16:24     ` Nick Piggin
2009-09-02 16:37       ` Paul E. McKenney
2009-09-02 16:45         ` Nick Piggin
2009-09-02 16:48           ` Paul E. McKenney
2009-09-02 17:50         ` Nick Piggin
2009-09-02 19:17   ` Peter Zijlstra
2009-09-03  5:14     ` Paul E. McKenney
2009-09-03  7:45       ` Nick Piggin
2009-09-03  9:01       ` Nick Piggin
2009-09-03 13:28         ` Paul E. McKenney
2009-09-03  7:14     ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090902151927.GA6774@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.