* tree rcu: call_rcu scalability problem? @ 2009-09-02 9:48 Nick Piggin 2009-09-02 12:27 ` Nick Piggin 0 siblings, 1 reply; 14+ messages in thread From: Nick Piggin @ 2009-09-02 9:48 UTC (permalink / raw) To: Paul McKenney, Linux Kernel Mailing List Hi Paul, I'm testing out scalability of some vfs code paths, and I'm seeing a problem with call_rcu. This is a 2s8c opteron system, so nothing crazy. I'll show you the profile results for 1-8 threads: 1: 29768 total 0.0076 15550 default_idle 48.5938 1340 __d_lookup 3.6413 954 __link_path_walk 0.2559 816 system_call_after_swapgs 8.0792 680 kmem_cache_alloc 1.4167 669 dput 1.1946 591 __call_rcu 2.0521 2: 56733 total 0.0145 20074 default_idle 62.7313 3075 __call_rcu 10.6771 2650 __d_lookup 7.2011 2019 dput 3.6054 4: 98889 total 0.0253 21759 default_idle 67.9969 10994 __call_rcu 38.1736 5185 __d_lookup 14.0897 4475 dput 7.9911 8: 170391 total 0.0437 31815 __call_rcu 110.4688 12958 dput 23.1393 10417 __d_lookup 28.3071 Of course there are other scalability factors involved too, but __call_rcu is taking 54 times more CPU to do 8 times the amount of work from 1-8 threads, or a factor of 6.7 slowdown. This is with tree RCU. # # RCU Subsystem # # CONFIG_CLASSIC_RCU is not set CONFIG_TREE_RCU=y # CONFIG_PREEMPT_RCU is not set # CONFIG_RCU_TRACE is not set CONFIG_RCU_FANOUT=64 # CONFIG_RCU_FANOUT_EXACT is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_PREEMPT_RCU_TRACE is not set Testing classic RCU showed its call_rcu seemed to scale better, only getting up to about 10 000 at 8 threads. You'd need my vfs scalability patches to reproduce this exactly, but the workload is just close(open(fd)), which rcu frees a lot of file structs. I can certainly get more detailed profiles or test patches for you though if you have any ideas. Thanks, Nick ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 9:48 tree rcu: call_rcu scalability problem? Nick Piggin @ 2009-09-02 12:27 ` Nick Piggin 2009-09-02 15:19 ` Paul E. McKenney 2009-09-02 19:17 ` Peter Zijlstra 0 siblings, 2 replies; 14+ messages in thread From: Nick Piggin @ 2009-09-02 12:27 UTC (permalink / raw) To: Paul McKenney, Linux Kernel Mailing List On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > Hi Paul, > > I'm testing out scalability of some vfs code paths, and I'm seeing > a problem with call_rcu. This is a 2s8c opteron system, so nothing > crazy. > > I'll show you the profile results for 1-8 threads: > > 1: > 29768 total 0.0076 > 15550 default_idle 48.5938 > 1340 __d_lookup 3.6413 > 954 __link_path_walk 0.2559 > 816 system_call_after_swapgs 8.0792 > 680 kmem_cache_alloc 1.4167 > 669 dput 1.1946 > 591 __call_rcu 2.0521 > > 2: > 56733 total 0.0145 > 20074 default_idle 62.7313 > 3075 __call_rcu 10.6771 > 2650 __d_lookup 7.2011 > 2019 dput 3.6054 > > 4: > 98889 total 0.0253 > 21759 default_idle 67.9969 > 10994 __call_rcu 38.1736 > 5185 __d_lookup 14.0897 > 4475 dput 7.9911 > > 8: > 170391 total 0.0437 > 31815 __call_rcu 110.4688 > 12958 dput 23.1393 > 10417 __d_lookup 28.3071 > > Of course there are other scalability factors involved too, but > __call_rcu is taking 54 times more CPU to do 8 times the amount > of work from 1-8 threads, or a factor of 6.7 slowdown. > > This is with tree RCU. It seems like nearly 2/3 of the cost is here: /* Add the callback to our list. */ *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< rdp->nxttail[RCU_NEXT_TAIL] = &head->next; In loading the pointer to the next tail pointer. If I'm reading the profile correctly. Can't see why that should be a probem though... ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ 697 0.2172 :ffffffff8107dee0: push %r12 228 0.0710 :ffffffff8107dee2: push %rbp 133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp 918 0.2860 :ffffffff8107dee6: push %rbx 316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi) 257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi) 1660 0.5172 :ffffffff8107def2: mfence 27730 8.6394 :ffffffff8107def5: pushfq 13153 4.0979 :ffffffff8107def6: pop %r12 903 0.2813 :ffffffff8107def8: cli 2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax 1784 0.5558 :ffffffff8107df01: cltq :ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx :ffffffff8107df08: pushfq 3494 1.0886 :ffffffff8107df09: pop %rdx 896 0.2792 :ffffffff8107df0a: cli 2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx 1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx 21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52 :ffffffff8107df17: mov 0x40(%rbx),%rax 81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx) 3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx) :ffffffff8107df22: mov 0x48(%rbx),%rax :ffffffff8107df26: mov %rax,0x40(%rbx) :ffffffff8107df2a: mov 0x50(%rbx),%rax :ffffffff8107df2e: mov %rax,0x48(%rbx) :ffffffff8107df32: push %rdx 1194 0.3720 :ffffffff8107df33: popfq 9518 2.9654 :ffffffff8107df34: pushfq 4179 1.3020 :ffffffff8107df35: pop %rdx 1277 0.3979 :ffffffff8107df36: cli 2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax 1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx) 5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77 :ffffffff8107df44: movb $0x1,0x19(%rbx) 2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx) :ffffffff8107df4c: mov 0xc8(%rbp),%rax :ffffffff8107df53: mov %rax,0x8(%rbx) 921 0.2869 :ffffffff8107df57: push %rdx 151 0.0470 :ffffffff8107df58: popfq 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) 2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx) 18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx 940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax 15 0.0047 :ffffffff8107df72: cmp %rax,%rdx 1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0 787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax 58 0.0181 :ffffffff8107df7b: inc %rax 2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx) 1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff 40 0.0125 :ffffffff8107df89: cmp %rdx,%rax 5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7 588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx 84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff 5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx 505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8 17580 5.4771 :ffffffff8107dfa1: push %r12 1671 0.5206 :ffffffff8107dfa3: popfq 24201 7.5399 :ffffffff8107dfa4: pop %rbx 1367 0.4259 :ffffffff8107dfa5: pop %rbp 377 0.1175 :ffffffff8107dfa6: pop %r12 :ffffffff8107dfa8: retq :ffffffff8107dfa9: nopl 0x0(%rax) :ffffffff8107dfb0: mov %rbp,%rdi :ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs 12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi :ffffffff8107dfbb: mov %rax,%rsi :ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp> :ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97 :ffffffff8107dfc5: nopl (%rax) :ffffffff8107dfc8: mov $0x1,%esi 10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi :ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent 1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1 451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx 411 0.1280 :ffffffff8107dfe1: xor %esi,%esi :ffffffff8107dfe3: mov %rbp,%rdi :ffffffff8107dfe6: mov %rdx,0x60(%rbx) 317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent 4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1 :ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 12:27 ` Nick Piggin @ 2009-09-02 15:19 ` Paul E. McKenney 2009-09-02 16:24 ` Nick Piggin 2009-09-02 19:17 ` Peter Zijlstra 1 sibling, 1 reply; 14+ messages in thread From: Paul E. McKenney @ 2009-09-02 15:19 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote: > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > > Hi Paul, > > > > I'm testing out scalability of some vfs code paths, and I'm seeing > > a problem with call_rcu. This is a 2s8c opteron system, so nothing > > crazy. > > > > I'll show you the profile results for 1-8 threads: > > > > 1: > > 29768 total 0.0076 > > 15550 default_idle 48.5938 > > 1340 __d_lookup 3.6413 > > 954 __link_path_walk 0.2559 > > 816 system_call_after_swapgs 8.0792 > > 680 kmem_cache_alloc 1.4167 > > 669 dput 1.1946 > > 591 __call_rcu 2.0521 > > > > 2: > > 56733 total 0.0145 > > 20074 default_idle 62.7313 > > 3075 __call_rcu 10.6771 > > 2650 __d_lookup 7.2011 > > 2019 dput 3.6054 > > > > 4: > > 98889 total 0.0253 > > 21759 default_idle 67.9969 > > 10994 __call_rcu 38.1736 > > 5185 __d_lookup 14.0897 > > 4475 dput 7.9911 Four threads runs on one socket but 8 threads runs on two sockets, I take it? > > 8: > > 170391 total 0.0437 > > 31815 __call_rcu 110.4688 > > 12958 dput 23.1393 > > 10417 __d_lookup 28.3071 > > > > Of course there are other scalability factors involved too, but > > __call_rcu is taking 54 times more CPU to do 8 times the amount > > of work from 1-8 threads, or a factor of 6.7 slowdown. > > > > This is with tree RCU. > > It seems like nearly 2/3 of the cost is here: > /* Add the callback to our list. */ > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; Hmmm... That certainly is not the first list of code in call_rcu() that would come to mind... > In loading the pointer to the next tail pointer. If I'm reading the profile > correctly. Can't see why that should be a probem though... The usual diagnosis would be false sharing. Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might expect interference from force_quiescent_state(), except that it should run only every few clock ticks. So this seems quite unlikely. Could you please try padding the beginning and end of struct rcu_data with a few hundred bytes and rerunning? Just in case there is a shared per-CPU variable either before or after rcu_data in your memory layout? Thanx, Paul > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ > 697 0.2172 :ffffffff8107dee0: push %r12 > 228 0.0710 :ffffffff8107dee2: push %rbp > 133 0.0414 :ffffffff8107dee3: mov %rdx,%rbp > 918 0.2860 :ffffffff8107dee6: push %rbx > 316 0.0985 :ffffffff8107dee7: mov %rsi,0x8(%rdi) > 257 0.0801 :ffffffff8107deeb: movq $0x0,(%rdi) > 1660 0.5172 :ffffffff8107def2: mfence > 27730 8.6394 :ffffffff8107def5: pushfq > 13153 4.0979 :ffffffff8107def6: pop %r12 > 903 0.2813 :ffffffff8107def8: cli > 2562 0.7982 :ffffffff8107def9: mov %gs:0xde68,%eax > 1784 0.5558 :ffffffff8107df01: cltq > :ffffffff8107df03: mov 0x60(%rdx,%rax,8),%rbx > :ffffffff8107df08: pushfq > 3494 1.0886 :ffffffff8107df09: pop %rdx > 896 0.2792 :ffffffff8107df0a: cli > 2655 0.8272 :ffffffff8107df0b: mov 0xd0(%rbp),%rcx > 1800 0.5608 :ffffffff8107df12: cmp (%rbx),%rcx > 21 0.0065 :ffffffff8107df15: je ffffffff8107df32 <__call_rcu+0x52 > :ffffffff8107df17: mov 0x40(%rbx),%rax > 81 0.0252 :ffffffff8107df1b: mov %rcx,(%rbx) > 3 9.3e-04 :ffffffff8107df1e: mov %rax,0x38(%rbx) > :ffffffff8107df22: mov 0x48(%rbx),%rax > :ffffffff8107df26: mov %rax,0x40(%rbx) > :ffffffff8107df2a: mov 0x50(%rbx),%rax > :ffffffff8107df2e: mov %rax,0x48(%rbx) > :ffffffff8107df32: push %rdx > 1194 0.3720 :ffffffff8107df33: popfq > 9518 2.9654 :ffffffff8107df34: pushfq > 4179 1.3020 :ffffffff8107df35: pop %rdx > 1277 0.3979 :ffffffff8107df36: cli > 2546 0.7932 :ffffffff8107df37: mov 0xc8(%rbp),%rax > 1748 0.5446 :ffffffff8107df3e: cmp %rax,0x8(%rbx) > 5 0.0016 :ffffffff8107df42: je ffffffff8107df57 <__call_rcu+0x77 > :ffffffff8107df44: movb $0x1,0x19(%rbx) > 2 6.2e-04 :ffffffff8107df48: movb $0x0,0x18(%rbx) > :ffffffff8107df4c: mov 0xc8(%rbp),%rax > :ffffffff8107df53: mov %rax,0x8(%rbx) > 921 0.2869 :ffffffff8107df57: push %rdx > 151 0.0470 :ffffffff8107df58: popfq > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) > 2 6.2e-04 :ffffffff8107df60: mov %rdi,0x50(%rbx) > 18 0.0056 :ffffffff8107df64: mov 0xd0(%rbp),%rdx > 940 0.2929 :ffffffff8107df6b: mov 0xc8(%rbp),%rax > 15 0.0047 :ffffffff8107df72: cmp %rax,%rdx > 1 3.1e-04 :ffffffff8107df75: je ffffffff8107dfb0 <__call_rcu+0xd0 > 787 0.2452 :ffffffff8107df77: mov 0x58(%rbx),%rax > 58 0.0181 :ffffffff8107df7b: inc %rax > 2 6.2e-04 :ffffffff8107df7e: mov %rax,0x58(%rbx) > 1679 0.5231 :ffffffff8107df82: movslq 0x4988fb(%rip),%rdx # ffff > 40 0.0125 :ffffffff8107df89: cmp %rdx,%rax > 5 0.0016 :ffffffff8107df8c: jg ffffffff8107dfd7 <__call_rcu+0xf7 > 588 0.1832 :ffffffff8107df8e: mov 0xe0(%rbp),%rdx > 84 0.0262 :ffffffff8107df95: mov 0x51f924(%rip),%rax # ffff > 5 0.0016 :ffffffff8107df9c: cmp %rax,%rdx > 505 0.1573 :ffffffff8107df9f: js ffffffff8107dfc8 <__call_rcu+0xe8 > 17580 5.4771 :ffffffff8107dfa1: push %r12 > 1671 0.5206 :ffffffff8107dfa3: popfq > 24201 7.5399 :ffffffff8107dfa4: pop %rbx > 1367 0.4259 :ffffffff8107dfa5: pop %rbp > 377 0.1175 :ffffffff8107dfa6: pop %r12 > :ffffffff8107dfa8: retq > :ffffffff8107dfa9: nopl 0x0(%rax) > :ffffffff8107dfb0: mov %rbp,%rdi > :ffffffff8107dfb3: callq ffffffff813be930 <_spin_lock_irqs > 12 0.0037 :ffffffff8107dfb8: mov %rbp,%rdi > :ffffffff8107dfbb: mov %rax,%rsi > :ffffffff8107dfbe: callq ffffffff8107d8e0 <rcu_start_gp> > :ffffffff8107dfc3: jmp ffffffff8107df77 <__call_rcu+0x97 > :ffffffff8107dfc5: nopl (%rax) > :ffffffff8107dfc8: mov $0x1,%esi > 10 0.0031 :ffffffff8107dfcd: mov %rbp,%rdi > :ffffffff8107dfd0: callq ffffffff8107dd50 <force_quiescent > 1 3.1e-04 :ffffffff8107dfd5: jmp ffffffff8107dfa1 <__call_rcu+0xc1 > 451 0.1405 :ffffffff8107dfd7: mov $0x7fffffffffffffff,%rdx > 411 0.1280 :ffffffff8107dfe1: xor %esi,%esi > :ffffffff8107dfe3: mov %rbp,%rdi > :ffffffff8107dfe6: mov %rdx,0x60(%rbx) > 317 0.0988 :ffffffff8107dfea: callq ffffffff8107dd50 <force_quiescent > 4510 1.4051 :ffffffff8107dfef: jmp ffffffff8107dfa1 <__call_rcu+0xc1 > :ffffffff8107dff1: nopw %cs:0x0(%rax,%rax,1) > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 15:19 ` Paul E. McKenney @ 2009-09-02 16:24 ` Nick Piggin 2009-09-02 16:37 ` Paul E. McKenney 0 siblings, 1 reply; 14+ messages in thread From: Nick Piggin @ 2009-09-02 16:24 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote: > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote: > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > > > Hi Paul, > > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing > > > crazy. > > > > > > I'll show you the profile results for 1-8 threads: > > > > > > 1: > > > 29768 total 0.0076 > > > 15550 default_idle 48.5938 > > > 1340 __d_lookup 3.6413 > > > 954 __link_path_walk 0.2559 > > > 816 system_call_after_swapgs 8.0792 > > > 680 kmem_cache_alloc 1.4167 > > > 669 dput 1.1946 > > > 591 __call_rcu 2.0521 > > > > > > 2: > > > 56733 total 0.0145 > > > 20074 default_idle 62.7313 > > > 3075 __call_rcu 10.6771 > > > 2650 __d_lookup 7.2011 > > > 2019 dput 3.6054 > > > > > > 4: > > > 98889 total 0.0253 > > > 21759 default_idle 67.9969 > > > 10994 __call_rcu 38.1736 > > > 5185 __d_lookup 14.0897 > > > 4475 dput 7.9911 > > Four threads runs on one socket but 8 threads runs on two sockets, > I take it? Yes. > > > 8: > > > 170391 total 0.0437 > > > 31815 __call_rcu 110.4688 > > > 12958 dput 23.1393 > > > 10417 __d_lookup 28.3071 > > > > > > Of course there are other scalability factors involved too, but > > > __call_rcu is taking 54 times more CPU to do 8 times the amount > > > of work from 1-8 threads, or a factor of 6.7 slowdown. > > > > > > This is with tree RCU. > > > > It seems like nearly 2/3 of the cost is here: > > /* Add the callback to our list. */ > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > Hmmm... That certainly is not the first list of code in call_rcu() that > would come to mind... It's weird. I *think* I read the asm right, but oprofile maybe is not attributing the cost to the right instruction. > > In loading the pointer to the next tail pointer. If I'm reading the profile > > correctly. Can't see why that should be a probem though... > > The usual diagnosis would be false sharing. Hmm that's possible yes. > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might > expect interference from force_quiescent_state(), except that it should > run only every few clock ticks. So this seems quite unlikely. It's CPU bound and preempt=y. Workload is just 8 processes running a loop of close(open("file$i")) as I said though you probably won't be able to reproduce it on a vanilla kernel. > Could you please try padding the beginning and end of struct rcu_data > with a few hundred bytes and rerunning? Just in case there is a shared > per-CPU variable either before or after rcu_data in your memory layout? OK I'll try that. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 16:24 ` Nick Piggin @ 2009-09-02 16:37 ` Paul E. McKenney 2009-09-02 16:45 ` Nick Piggin 2009-09-02 17:50 ` Nick Piggin 0 siblings, 2 replies; 14+ messages in thread From: Paul E. McKenney @ 2009-09-02 16:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote: > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote: > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote: > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > > > > Hi Paul, > > > > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing > > > > crazy. > > > > > > > > I'll show you the profile results for 1-8 threads: > > > > > > > > 1: > > > > 29768 total 0.0076 > > > > 15550 default_idle 48.5938 > > > > 1340 __d_lookup 3.6413 > > > > 954 __link_path_walk 0.2559 > > > > 816 system_call_after_swapgs 8.0792 > > > > 680 kmem_cache_alloc 1.4167 > > > > 669 dput 1.1946 > > > > 591 __call_rcu 2.0521 > > > > > > > > 2: > > > > 56733 total 0.0145 > > > > 20074 default_idle 62.7313 > > > > 3075 __call_rcu 10.6771 > > > > 2650 __d_lookup 7.2011 > > > > 2019 dput 3.6054 > > > > > > > > 4: > > > > 98889 total 0.0253 > > > > 21759 default_idle 67.9969 > > > > 10994 __call_rcu 38.1736 > > > > 5185 __d_lookup 14.0897 > > > > 4475 dput 7.9911 > > > > Four threads runs on one socket but 8 threads runs on two sockets, > > I take it? > > Yes. > > > > > > 8: > > > > 170391 total 0.0437 > > > > 31815 __call_rcu 110.4688 > > > > 12958 dput 23.1393 > > > > 10417 __d_lookup 28.3071 > > > > > > > > Of course there are other scalability factors involved too, but > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount > > > > of work from 1-8 threads, or a factor of 6.7 slowdown. > > > > > > > > This is with tree RCU. > > > > > > It seems like nearly 2/3 of the cost is here: > > > /* Add the callback to our list. */ > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > Hmmm... That certainly is not the first list of code in call_rcu() that > > would come to mind... > > It's weird. I *think* I read the asm right, but oprofile maybe is > not attributing the cost to the right instruction. > > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > > correctly. Can't see why that should be a probem though... > > > > The usual diagnosis would be false sharing. > > Hmm that's possible yes. > > > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might > > expect interference from force_quiescent_state(), except that it should > > run only every few clock ticks. So this seems quite unlikely. > > It's CPU bound and preempt=y. > > Workload is just 8 processes running a loop of close(open("file$i")) as > I said though you probably won't be able to reproduce it on a vanilla > kernel. OK, so you are executing call_rcu() a -lot-!!! Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of the files in the "rcu" subdirectory in debugfs? Please take a snapshot of these files, run your test for a fixed time interval (perhaps ten seconds, but please tell me how long), then take a second snapshot. Thanx, Paul > > Could you please try padding the beginning and end of struct rcu_data > > with a few hundred bytes and rerunning? Just in case there is a shared > > per-CPU variable either before or after rcu_data in your memory layout? > > OK I'll try that. > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 16:37 ` Paul E. McKenney @ 2009-09-02 16:45 ` Nick Piggin 2009-09-02 16:48 ` Paul E. McKenney 2009-09-02 17:50 ` Nick Piggin 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2009-09-02 16:45 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote: > On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote: > > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote: > > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote: > > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > > > > > Hi Paul, > > > > > > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing > > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing > > > > > crazy. > > > > > > > > > > I'll show you the profile results for 1-8 threads: > > > > > > > > > > 1: > > > > > 29768 total 0.0076 > > > > > 15550 default_idle 48.5938 > > > > > 1340 __d_lookup 3.6413 > > > > > 954 __link_path_walk 0.2559 > > > > > 816 system_call_after_swapgs 8.0792 > > > > > 680 kmem_cache_alloc 1.4167 > > > > > 669 dput 1.1946 > > > > > 591 __call_rcu 2.0521 > > > > > > > > > > 2: > > > > > 56733 total 0.0145 > > > > > 20074 default_idle 62.7313 > > > > > 3075 __call_rcu 10.6771 > > > > > 2650 __d_lookup 7.2011 > > > > > 2019 dput 3.6054 > > > > > > > > > > 4: > > > > > 98889 total 0.0253 > > > > > 21759 default_idle 67.9969 > > > > > 10994 __call_rcu 38.1736 > > > > > 5185 __d_lookup 14.0897 > > > > > 4475 dput 7.9911 > > > > > > Four threads runs on one socket but 8 threads runs on two sockets, > > > I take it? > > > > Yes. > > > > > > > > > 8: > > > > > 170391 total 0.0437 > > > > > 31815 __call_rcu 110.4688 > > > > > 12958 dput 23.1393 > > > > > 10417 __d_lookup 28.3071 > > > > > > > > > > Of course there are other scalability factors involved too, but > > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount > > > > > of work from 1-8 threads, or a factor of 6.7 slowdown. > > > > > > > > > > This is with tree RCU. > > > > > > > > It seems like nearly 2/3 of the cost is here: > > > > /* Add the callback to our list. */ > > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > > > Hmmm... That certainly is not the first list of code in call_rcu() that > > > would come to mind... > > > > It's weird. I *think* I read the asm right, but oprofile maybe is > > not attributing the cost to the right instruction. > > > > > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > > > correctly. Can't see why that should be a probem though... > > > > > > The usual diagnosis would be false sharing. > > > > Hmm that's possible yes. > > > > > > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might > > > expect interference from force_quiescent_state(), except that it should > > > run only every few clock ticks. So this seems quite unlikely. > > > > It's CPU bound and preempt=y. > > > > Workload is just 8 processes running a loop of close(open("file$i")) as > > I said though you probably won't be able to reproduce it on a vanilla > > kernel. > > OK, so you are executing call_rcu() a -lot-!!! Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so nearly half a million times per second per core. It's not *slow* by any means, but it is increasing much faster than other functions on the profile so I just want to understand what is happening. > Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of > the files in the "rcu" subdirectory in debugfs? Please take a snapshot > of these files, run your test for a fixed time interval (perhaps ten > seconds, but please tell me how long), then take a second snapshot. Will do. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 16:45 ` Nick Piggin @ 2009-09-02 16:48 ` Paul E. McKenney 0 siblings, 0 replies; 14+ messages in thread From: Paul E. McKenney @ 2009-09-02 16:48 UTC (permalink / raw) To: Nick Piggin; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 06:45:04PM +0200, Nick Piggin wrote: > On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote: > > On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote: > > > On Wed, Sep 02, 2009 at 08:19:27AM -0700, Paul E. McKenney wrote: > > > > On Wed, Sep 02, 2009 at 02:27:56PM +0200, Nick Piggin wrote: > > > > > On Wed, Sep 02, 2009 at 11:48:35AM +0200, Nick Piggin wrote: > > > > > > Hi Paul, > > > > > > > > > > > > I'm testing out scalability of some vfs code paths, and I'm seeing > > > > > > a problem with call_rcu. This is a 2s8c opteron system, so nothing > > > > > > crazy. > > > > > > > > > > > > I'll show you the profile results for 1-8 threads: > > > > > > > > > > > > 1: > > > > > > 29768 total 0.0076 > > > > > > 15550 default_idle 48.5938 > > > > > > 1340 __d_lookup 3.6413 > > > > > > 954 __link_path_walk 0.2559 > > > > > > 816 system_call_after_swapgs 8.0792 > > > > > > 680 kmem_cache_alloc 1.4167 > > > > > > 669 dput 1.1946 > > > > > > 591 __call_rcu 2.0521 > > > > > > > > > > > > 2: > > > > > > 56733 total 0.0145 > > > > > > 20074 default_idle 62.7313 > > > > > > 3075 __call_rcu 10.6771 > > > > > > 2650 __d_lookup 7.2011 > > > > > > 2019 dput 3.6054 > > > > > > > > > > > > 4: > > > > > > 98889 total 0.0253 > > > > > > 21759 default_idle 67.9969 > > > > > > 10994 __call_rcu 38.1736 > > > > > > 5185 __d_lookup 14.0897 > > > > > > 4475 dput 7.9911 > > > > > > > > Four threads runs on one socket but 8 threads runs on two sockets, > > > > I take it? > > > > > > Yes. > > > > > > > > > > > > 8: > > > > > > 170391 total 0.0437 > > > > > > 31815 __call_rcu 110.4688 > > > > > > 12958 dput 23.1393 > > > > > > 10417 __d_lookup 28.3071 > > > > > > > > > > > > Of course there are other scalability factors involved too, but > > > > > > __call_rcu is taking 54 times more CPU to do 8 times the amount > > > > > > of work from 1-8 threads, or a factor of 6.7 slowdown. > > > > > > > > > > > > This is with tree RCU. > > > > > > > > > > It seems like nearly 2/3 of the cost is here: > > > > > /* Add the callback to our list. */ > > > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > > > > > Hmmm... That certainly is not the first list of code in call_rcu() that > > > > would come to mind... > > > > > > It's weird. I *think* I read the asm right, but oprofile maybe is > > > not attributing the cost to the right instruction. > > > > > > > > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > > > > correctly. Can't see why that should be a probem though... > > > > > > > > The usual diagnosis would be false sharing. > > > > > > Hmm that's possible yes. > > > > > > > > > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might > > > > expect interference from force_quiescent_state(), except that it should > > > > run only every few clock ticks. So this seems quite unlikely. > > > > > > It's CPU bound and preempt=y. > > > > > > Workload is just 8 processes running a loop of close(open("file$i")) as > > > I said though you probably won't be able to reproduce it on a vanilla > > > kernel. > > > > OK, so you are executing call_rcu() a -lot-!!! > > Oh yeah. The combined frequency in the 8 proc case is once per 270ns, so > nearly half a million times per second per core. Woo-hoo!!! ;-) > It's not *slow* by any means, but it is increasing much faster than other > functions on the profile so I just want to understand what is happening. Understood and agreed. > > Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of > > the files in the "rcu" subdirectory in debugfs? Please take a snapshot > > of these files, run your test for a fixed time interval (perhaps ten > > seconds, but please tell me how long), then take a second snapshot. > > Will do. Thanx, Paul ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 16:37 ` Paul E. McKenney 2009-09-02 16:45 ` Nick Piggin @ 2009-09-02 17:50 ` Nick Piggin 1 sibling, 0 replies; 14+ messages in thread From: Nick Piggin @ 2009-09-02 17:50 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Linux Kernel Mailing List On Wed, Sep 02, 2009 at 09:37:05AM -0700, Paul E. McKenney wrote: > On Wed, Sep 02, 2009 at 06:24:51PM +0200, Nick Piggin wrote: > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > > > correctly. Can't see why that should be a probem though... > > > > > > The usual diagnosis would be false sharing. > > > > Hmm that's possible yes. OK, padding 64 bytes (cacheline size) at the start and end of struct rcu_data does not help. I wonder if the cycles aren't being attributed to the right instruction? Interesting thing is this queueing part seems to be the same in rcuclassic too, which seems to run faster. I'll try to run it on a bigger machine and see if it becomes more pronounced. But I might not get around to that tonight. > > > Hmmm... What is the workload? CPU-bound? If CONFIG_PREEMPT=n, I might > > > expect interference from force_quiescent_state(), except that it should > > > run only every few clock ticks. So this seems quite unlikely. > > > > It's CPU bound and preempt=y. > > > > Workload is just 8 processes running a loop of close(open("file$i")) as > > I said though you probably won't be able to reproduce it on a vanilla > > kernel. > > OK, so you are executing call_rcu() a -lot-!!! > > Could you also please try CONFIG_RCU_TRACE=y, and send me the contents of > the files in the "rcu" subdirectory in debugfs? Please take a snapshot > of these files, run your test for a fixed time interval (perhaps ten > seconds, but please tell me how long), then take a second snapshot. Attached, old/* vs new/*. Interval was 22s. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 12:27 ` Nick Piggin 2009-09-02 15:19 ` Paul E. McKenney @ 2009-09-02 19:17 ` Peter Zijlstra 2009-09-03 5:14 ` Paul E. McKenney 2009-09-03 7:14 ` Nick Piggin 1 sibling, 2 replies; 14+ messages in thread From: Peter Zijlstra @ 2009-09-02 19:17 UTC (permalink / raw) To: Nick Piggin; +Cc: Paul McKenney, Linux Kernel Mailing List On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote: > It seems like nearly 2/3 of the cost is here: > /* Add the callback to our list. */ > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > In loading the pointer to the next tail pointer. If I'm reading the profile > correctly. Can't see why that should be a probem though... > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ > 697 0.2172 :ffffffff8107dee0: push %r12 > 921 0.2869 :ffffffff8107df57: push %rdx > 151 0.0470 :ffffffff8107df58: popfq > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) I'd guess at popfq to be the expensive op here.. skid usually causes the attribution to be a few ops down the line. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 19:17 ` Peter Zijlstra @ 2009-09-03 5:14 ` Paul E. McKenney 2009-09-03 7:45 ` Nick Piggin 2009-09-03 9:01 ` Nick Piggin 2009-09-03 7:14 ` Nick Piggin 1 sibling, 2 replies; 14+ messages in thread From: Paul E. McKenney @ 2009-09-03 5:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Nick Piggin, Linux Kernel Mailing List On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote: > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote: > > > It seems like nearly 2/3 of the cost is here: > > /* Add the callback to our list. */ > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > correctly. Can't see why that should be a probem though... > > > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ > > 697 0.2172 :ffffffff8107dee0: push %r12 > > > 921 0.2869 :ffffffff8107df57: push %rdx > > 151 0.0470 :ffffffff8107df58: popfq > > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax > > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) > > I'd guess at popfq to be the expensive op here.. skid usually causes the > attribution to be a few ops down the line. I believe that Nick's workload is routinely driving the number of callbacks queued on a given CPU above 10,000, which would provoke numerous (and possibly inlined) calls to force_quiescent_state(). Like about 400,000 such calls per second. Hey, I was naively assuming that no one would see more than 10,000 callbacks queued on a single CPU unless there was some sort of major emergency underway, and coded accordingly. ;-) I offer the attached experimental (untested, might not even compile) patch. Thanx, Paul ------------------------------------------------------------------------ >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Wed, 2 Sep 2009 22:01:50 -0700 Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu() The grace-period machinery in __call_rcu() was a failed attempt to avoid implementing synchronize_rcu_expedited(). But now that this attempt has failed, try removing the machinery. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- kernel/rcutree.c | 12 ------------ 1 files changed, 0 insertions(+), 12 deletions(-) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index d2a372f..104de9e 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), */ local_irq_save(flags); rdp = rsp->rda[smp_processor_id()]; - rcu_process_gp_end(rsp, rdp); - check_for_new_grace_period(rsp, rdp); /* Add the callback to our list. */ *rdp->nxttail[RCU_NEXT_TAIL] = head; rdp->nxttail[RCU_NEXT_TAIL] = &head->next; - /* Start a new grace period if one not already started. */ - if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) { - unsigned long nestflag; - struct rcu_node *rnp_root = rcu_get_root(rsp); - - spin_lock_irqsave(&rnp_root->lock, nestflag); - rcu_start_gp(rsp, nestflag); /* releases rnp_root->lock. */ - } - /* Force the grace period if too many callbacks or too long waiting. */ if (unlikely(++rdp->qlen > qhimark)) { rdp->blimit = LONG_MAX; - force_quiescent_state(rsp, 0); } else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0) force_quiescent_state(rsp, 1); local_irq_restore(flags); -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-03 5:14 ` Paul E. McKenney @ 2009-09-03 7:45 ` Nick Piggin 2009-09-03 9:01 ` Nick Piggin 1 sibling, 0 replies; 14+ messages in thread From: Nick Piggin @ 2009-09-03 7:45 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote: > On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote: > > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote: > > > > > It seems like nearly 2/3 of the cost is here: > > > /* Add the callback to our list. */ > > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > > correctly. Can't see why that should be a probem though... > > > > > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ > > > 697 0.2172 :ffffffff8107dee0: push %r12 > > > > > 921 0.2869 :ffffffff8107df57: push %rdx > > > 151 0.0470 :ffffffff8107df58: popfq > > > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax > > > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) > > > > I'd guess at popfq to be the expensive op here.. skid usually causes the > > attribution to be a few ops down the line. > > I believe that Nick's workload is routinely driving the number of > callbacks queued on a given CPU above 10,000, which would provoke numerous > (and possibly inlined) calls to force_quiescent_state(). Like about > 400,000 such calls per second. Hey, I was naively assuming that no one > would see more than 10,000 callbacks queued on a single CPU unless there > was some sort of major emergency underway, and coded accordingly. ;-) > > I offer the attached experimental (untested, might not even compile) patch. Not only does it compile, but __call_rcu is now taking 1/10th the cycles and absolute performance up nearly 20%. Looks like it is now better than classic RCU. I'll collect and post some more detailed numbers and profiles. Do you want some new rcu trace results too? Thanks, Nick ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-03 5:14 ` Paul E. McKenney 2009-09-03 7:45 ` Nick Piggin @ 2009-09-03 9:01 ` Nick Piggin 2009-09-03 13:28 ` Paul E. McKenney 1 sibling, 1 reply; 14+ messages in thread From: Nick Piggin @ 2009-09-03 9:01 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote: > >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001 > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Date: Wed, 2 Sep 2009 22:01:50 -0700 > Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu() > > The grace-period machinery in __call_rcu() was a failed attempt to avoid > implementing synchronize_rcu_expedited(). But now that this attempt has > failed, try removing the machinery. OK, the workload is parallel processes performing a close(open()) loop in a tmpfs filesystem within different cwds (to avoid contention on the cwd dentry). The kernel is first patched with my vfs scalability patches, so the comparison is with/without Paul's rcu patch. System is 2s8c opteron, with processes bound to CPUs (first within the same socket, then over both sockets as count increases). procs tput-base tput-rcu 1 595238 (x1.00) 645161 (x1.00) 2 1041666 (x1.75) 1136363 (x1.76) 4 1960784 (x3.29) 2298850 (x3.56) 8 3636363 (x6.11) 4545454 (x7.05) Scalability is improved (from 2-8 way it is now actually linear), and single thread performance is significantly improved too. oprofile results collecting clk unhalted samples shows the following results for __call_rcu symbol: procs samples % app name symbol name tput-base 1 12153 3.8122 vmlinux __call_rcu 2 29253 3.9899 vmlinux __call_rcu 4 84503 5.4667 vmlinux __call_rcu 8 312816 9.5287 vmlinux __call_rcu tput-rcu 1 8722 2.8770 vmlinux __call_rcu 2 17275 2.5804 vmlinux __call_rcu 4 33848 2.6015 vmlinux __call_rcu 8 67158 2.5561 vmlinux __call_rcu Scaling is cearly much better (it is more important to look at absolute samples because %age is dependent on other parts of the kernel too). Feel free to add any of this to your changelog if you think it's important. Thanks, Nick > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > --- > kernel/rcutree.c | 12 ------------ > 1 files changed, 0 insertions(+), 12 deletions(-) > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > index d2a372f..104de9e 100644 > --- a/kernel/rcutree.c > +++ b/kernel/rcutree.c > @@ -1201,26 +1201,14 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), > */ > local_irq_save(flags); > rdp = rsp->rda[smp_processor_id()]; > - rcu_process_gp_end(rsp, rdp); > - check_for_new_grace_period(rsp, rdp); > > /* Add the callback to our list. */ > *rdp->nxttail[RCU_NEXT_TAIL] = head; > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > - /* Start a new grace period if one not already started. */ > - if (ACCESS_ONCE(rsp->completed) == ACCESS_ONCE(rsp->gpnum)) { > - unsigned long nestflag; > - struct rcu_node *rnp_root = rcu_get_root(rsp); > - > - spin_lock_irqsave(&rnp_root->lock, nestflag); > - rcu_start_gp(rsp, nestflag); /* releases rnp_root->lock. */ > - } > - > /* Force the grace period if too many callbacks or too long waiting. */ > if (unlikely(++rdp->qlen > qhimark)) { > rdp->blimit = LONG_MAX; > - force_quiescent_state(rsp, 0); > } else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0) > force_quiescent_state(rsp, 1); > local_irq_restore(flags); > -- > 1.5.2.5 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-03 9:01 ` Nick Piggin @ 2009-09-03 13:28 ` Paul E. McKenney 0 siblings, 0 replies; 14+ messages in thread From: Paul E. McKenney @ 2009-09-03 13:28 UTC (permalink / raw) To: Nick Piggin; +Cc: Peter Zijlstra, Linux Kernel Mailing List On Thu, Sep 03, 2009 at 11:01:26AM +0200, Nick Piggin wrote: > On Wed, Sep 02, 2009 at 10:14:27PM -0700, Paul E. McKenney wrote: > > >From 0544d2da54bad95556a320e57658e244cb2ae8c6 Mon Sep 17 00:00:00 2001 > > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > Date: Wed, 2 Sep 2009 22:01:50 -0700 > > Subject: [PATCH] Remove grace-period machinery from rcutree __call_rcu() > > > > The grace-period machinery in __call_rcu() was a failed attempt to avoid > > implementing synchronize_rcu_expedited(). But now that this attempt has > > failed, try removing the machinery. > > OK, the workload is parallel processes performing a close(open()) loop > in a tmpfs filesystem within different cwds (to avoid contention on the > cwd dentry). The kernel is first patched with my vfs scalability patches, > so the comparison is with/without Paul's rcu patch. > > System is 2s8c opteron, with processes bound to CPUs (first within the > same socket, then over both sockets as count increases). > > procs tput-base tput-rcu > 1 595238 (x1.00) 645161 (x1.00) > 2 1041666 (x1.75) 1136363 (x1.76) > 4 1960784 (x3.29) 2298850 (x3.56) > 8 3636363 (x6.11) 4545454 (x7.05) > > Scalability is improved (from 2-8 way it is now actually linear), and > single thread performance is significantly improved too. > > oprofile results collecting clk unhalted samples shows the following > results for __call_rcu symbol: > > procs samples % app name symbol name > tput-base > 1 12153 3.8122 vmlinux __call_rcu > 2 29253 3.9899 vmlinux __call_rcu > 4 84503 5.4667 vmlinux __call_rcu > 8 312816 9.5287 vmlinux __call_rcu > > tput-rcu > 1 8722 2.8770 vmlinux __call_rcu > 2 17275 2.5804 vmlinux __call_rcu > 4 33848 2.6015 vmlinux __call_rcu > 8 67158 2.5561 vmlinux __call_rcu > > Scaling is cearly much better (it is more important to look at absolute > samples because %age is dependent on other parts of the kernel too). > > Feel free to add any of this to your changelog if you think it's important. Very cool!!! I got a dissenting view from the people trying to get rid of interrupts in computational workloads. But I believe that it is possible to split the difference, getting you almost all the performance benefits while still permitting them to turn off the scheduling-clock interrupt. The reason that I believe it should get you the performance benefits is that deleting the rcu_process_gp_end() and check_for_new_grace_period() didn't do much for you. Their overhead is quite small compared to hammering the system with a full set of IPIs every ten microseconds or so. ;-) So could you please give the following experimental patch a go? If it works for you, I will put together a production-ready patch along these lines. Thanx, Paul ------------------------------------------------------------------------ >From 57b7f98303a5c5aa50648c71758760006af49bab Mon Sep 17 00:00:00 2001 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Thu, 3 Sep 2009 06:19:45 -0700 Subject: [PATCH] Reduce grace-period-encouragement impact on rcutree __call_rcu() Remove only the emergency force_quiescent_state() from __call_rcu(), which should get most of the reduction in overhead while still allowing the tick to be turned off when non-idle, as proposed in http://lkml.org/lkml/2009/9/1/229, and which reduced interrupts to one per ten seconds in a CPU-bound computational workload according to http://lkml.org/lkml/2009/9/3/7. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- kernel/rcutree.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index d2a372f..4c8e0d2 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -1220,7 +1220,6 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), /* Force the grace period if too many callbacks or too long waiting. */ if (unlikely(++rdp->qlen > qhimark)) { rdp->blimit = LONG_MAX; - force_quiescent_state(rsp, 0); } else if ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0) force_quiescent_state(rsp, 1); local_irq_restore(flags); -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: tree rcu: call_rcu scalability problem? 2009-09-02 19:17 ` Peter Zijlstra 2009-09-03 5:14 ` Paul E. McKenney @ 2009-09-03 7:14 ` Nick Piggin 1 sibling, 0 replies; 14+ messages in thread From: Nick Piggin @ 2009-09-03 7:14 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Paul McKenney, Linux Kernel Mailing List On Wed, Sep 02, 2009 at 09:17:44PM +0200, Peter Zijlstra wrote: > On Wed, 2009-09-02 at 14:27 +0200, Nick Piggin wrote: > > > It seems like nearly 2/3 of the cost is here: > > /* Add the callback to our list. */ > > *rdp->nxttail[RCU_NEXT_TAIL] = head; <<< > > rdp->nxttail[RCU_NEXT_TAIL] = &head->next; > > > > In loading the pointer to the next tail pointer. If I'm reading the profile > > correctly. Can't see why that should be a probem though... > > > > ffffffff8107dee0 <__call_rcu>: /* __call_rcu total: 320971 100.000 */ > > 697 0.2172 :ffffffff8107dee0: push %r12 > > > 921 0.2869 :ffffffff8107df57: push %rdx > > 151 0.0470 :ffffffff8107df58: popfq > > 183507 57.1725 :ffffffff8107df59: mov 0x50(%rbx),%rax > > 995 0.3100 :ffffffff8107df5d: mov %rdi,(%rax) > > I'd guess at popfq to be the expensive op here.. skid usually causes the > attribution to be a few ops down the line. Well it doesn't really explain why it is getting more costly per unit of work as core count increases. The popfq should run completely out of L1 cache (or even from store forwarding) and it is expensive, but those costs should remain the same regardless of how many cores in the system. But I am getting some strange looking profiles, so I'm going to get a bigger system to run it on to see if we can get really pronounced scalability problem. Maybe even make a microbenchmark to eliminate other elements of the workload. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-09-03 13:28 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-09-02 9:48 tree rcu: call_rcu scalability problem? Nick Piggin 2009-09-02 12:27 ` Nick Piggin 2009-09-02 15:19 ` Paul E. McKenney 2009-09-02 16:24 ` Nick Piggin 2009-09-02 16:37 ` Paul E. McKenney 2009-09-02 16:45 ` Nick Piggin 2009-09-02 16:48 ` Paul E. McKenney 2009-09-02 17:50 ` Nick Piggin 2009-09-02 19:17 ` Peter Zijlstra 2009-09-03 5:14 ` Paul E. McKenney 2009-09-03 7:45 ` Nick Piggin 2009-09-03 9:01 ` Nick Piggin 2009-09-03 13:28 ` Paul E. McKenney 2009-09-03 7:14 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox