From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Paul E. McKenney" Subject: Re: [PATCH 00/46] rcu-walk and dcache scaling Date: Sun, 12 Dec 2010 06:54:07 -0800 Message-ID: <20101212145407.GA1969@linux.vnet.ibm.com> References: <20101207215653.GA25864@dastard> <20101208033212.GF29333@dastard> <20101208042816.GA32766@dastard> <20101208070909.GB14846@amd> <20101210203257.GI2125@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Dave Chinner , Nick Piggin , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Nick Piggin Return-path: Received: from e8.ny.us.ibm.com ([32.97.182.138]:52280 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751812Ab0LLOyO (ORCPT ); Sun, 12 Dec 2010 09:54:14 -0500 Content-Disposition: inline In-Reply-To: <20101210203257.GI2125@linux.vnet.ibm.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Dec 10, 2010 at 12:32:57PM -0800, Paul E. McKenney wrote: > On Wed, Dec 08, 2010 at 06:09:09PM +1100, Nick Piggin wrote: > > On Wed, Dec 08, 2010 at 03:28:16PM +1100, Dave Chinner wrote: > > > On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote: > > > > On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote: > > > > > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner wrote: > > > > > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote= : > > > > > >> > > > > > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linu= x-npiggin.git vfs-scale-working > > > > > >> > > > > > >> Here is an new set of vfs patches for review, not that the= re was much interest > > > > > >> last time they were posted. It is structured like: > > > > > >> > > > > > >> * preparation patches > > > > > >> * introduce new locks to take over dcache_lock, then remov= e it > > > > > >> * cleaning up and reworking things for new locks > > > > > >> * rcu-walk path walking > > > > > >> * start on some fine grained locking steps > > > > > > > > > > > > Stress test doing: > > > > > > > > > > > > =A0 =A0 =A0 =A0single thread 50M inode create > > > > > > =A0 =A0 =A0 =A0single thread rm -rf > > > > > > =A0 =A0 =A0 =A02-way 50M inode create > > > > > > =A0 =A0 =A0 =A02-way rm -rf > > > > > > =A0 =A0 =A0 =A04-way 50M inode create > > > > > > =A0 =A0 =A0 =A04-way rm -rf > > > > > > =A0 =A0 =A0 =A08-way 50M inode create > > > > > > =A0 =A0 =A0 =A08-way rm -rf > > > > > > =A0 =A0 =A0 =A08-way 250M inode create > > > > > > =A0 =A0 =A0 =A08-way rm -rf > > > > > > > > > > > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours in= to the test) > > > > > > with a CPU stuck spinning on here: > > > > > > > > > > > > [37372.084012] NMI backtrace for cpu 5 > > > > > > [37372.084012] CPU 5 > > > > > > [37372.084012] Modules linked in: > > > > > > [37372.084012] > > > > > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-= dgc+ #797 /Bochs > > > > > > [37372.084012] RIP: 0010:[] =A0[] __ticket_spin_lock+0x14/0x20 > > > > > > [37372.084012] RSP: 0018:ffff880114643c98 =A0EFLAGS: 000002= 13 > > > > > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 = RCX: ffff8800c4eb2688 > > > > > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 = RDI: ffff880114643d14 > > > > > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 = R09: 0000000000000000 > > > > > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 = R12: ffff880114643d14 > > > > > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 = R15: ffff8800687be71c > > > > > > [37372.084012] FS: =A000007fd6d7c93700(0000) GS:ffff8800dfd= 40000(0000) knlGS:0000000000000000 > > > > > > [37372.084012] CS: =A00010 DS: 0000 ES: 0000 CR0: 000000008= 005003b > > > > > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 = CR4: 00000000000006e0 > > > > > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 = DR2: 0000000000000000 > > > > > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 = DR7: 0000000000000400 > > > > > > [37372.084012] Process rm (pid: 15214, threadinfo ffff88011= 4642000, task ffff88011b16f890) > > > > > > [37372.084012] Stack: > > > > > > [37372.084012] =A0ffff880114643ca8 ffffffff81ad044e ffff880= 114643cf8 ffffffff81167ae7 > > > > > > [37372.084012] =A00000000000000000 ffff880114643d38 0000000= 00000000e ffff88011901d800 > > > > > > [37372.084012] =A0ffff8800cdb7cf5c ffff88011901d8e0 0000000= 000000000 0000000000000000 > > > > > > [37372.084012] Call Trace: > > > > > > [37372.084012] =A0[] _raw_spin_lock+0xe/0= x20 > > > > > > [37372.084012] =A0[] shrink_dentry_list+0= x47/0x370 > > > > > > [37372.084012] =A0[] __shrink_dcache_sb+0= x14e/0x1e0 > > > > > > [37372.084012] =A0[] shrink_dcache_parent= +0x276/0x2d0 > > > > > > [37372.084012] =A0[] ? _raw_spin_lock+0xe= /0x20 > > > > > > [37372.084012] =A0[] dentry_unhash+0x42/0= x80 > > > > > > [37372.084012] =A0[] vfs_rmdir+0x68/0x100 > > > > > > [37372.084012] =A0[] do_rmdir+0x113/0x130 > > > > > > [37372.084012] =A0[] ? filp_close+0x5d/0x= 90 > > > > > > [37372.084012] =A0[] sys_unlinkat+0x35/0x= 40 > > > > > > [37372.084012] =A0[] system_call_fastpath= +0x16/0x1b > > > > >=20 > > > > > OK good, with any luck, that's the same bug. > > > > >=20 > > > > > Is this XFS? > > > >=20 > > > > Yes. > > > >=20 > > > > > Is there any concurrent activity happening on the same dentri= es? > > > >=20 > > > > Not from an application perspective. > > > >=20 > > > > > Ie. are the rm -rf threads running on the same directories, > > > >=20 > > > > No, each thread operating on a different directory. > >=20 > > This is probably fixed by the same patch as the lockdep splat trace= =2E > >=20 > >=20 > > > > > or is there any reclaim happening in the background? > > > >=20 > > > > IIRC, kswapd was consuming about 5-10% of a CPU during parallel > > > > unlink tests. Mainly reclaiming XFS inodes, I think, but there = may > > > > be dentry cache reclaim going as well. > > >=20 > > > Turns out that the kswapd peaks are upwards of 50% of a CPU for a > > > few seconds, then idle for 10-15s. Typical perf top output of ksw= apd > > > while it is active during unlinks is: > > >=20 > > > samples pcnt function DSO > > > _______ _____ ___________________________ __________= _______ > > >=20 > > > 17168.00 10.2% __call_rcu [kernel.ka= llsyms] > > > 13223.00 7.8% kmem_cache_free [kernel.ka= llsyms] > > > 12917.00 7.6% down_write [kernel.ka= llsyms] > > > 12665.00 7.5% xfs_iunlock [kernel.ka= llsyms] > > > 10493.00 6.2% xfs_reclaim_inode_grab [kernel.ka= llsyms] > > > 9314.00 5.5% __lookup_tag [kernel.ka= llsyms] > > > 9040.00 5.4% radix_tree_delete [kernel.ka= llsyms] > > > 8694.00 5.1% is_bad_inode [kernel.ka= llsyms] > > > 7639.00 4.5% __ticket_spin_lock [kernel.ka= llsyms] > > > 6821.00 4.0% _raw_spin_unlock_irqrestore [kernel.ka= llsyms] > > > 5484.00 3.2% __d_drop [kernel.ka= llsyms] > > > 5114.00 3.0% xfs_reclaim_inode [kernel.ka= llsyms] > > > 4626.00 2.7% __rcu_process_callbacks [kernel.ka= llsyms] > > > 3556.00 2.1% up_write [kernel.ka= llsyms] > > > 3206.00 1.9% _cond_resched [kernel.ka= llsyms] > > > 3129.00 1.9% xfs_qm_dqdetach [kernel.ka= llsyms] > > > 2327.00 1.4% radix_tree_tag_clear [kernel.ka= llsyms] > > > 2327.00 1.4% call_rcu_sched [kernel.ka= llsyms] > > > 2262.00 1.3% __ticket_spin_unlock [kernel.ka= llsyms] > > > 2215.00 1.3% xfs_ilock [kernel.ka= llsyms] > > > 2200.00 1.3% radix_tree_gang_lookup_tag [kernel.ka= llsyms] > > > 1982.00 1.2% xfs_reclaim_inodes_ag [kernel.ka= llsyms] > > > 1736.00 1.0% xfs_trans_unlocked_item [kernel.ka= llsyms] > > > 1707.00 1.0% __ticket_spin_trylock [kernel.ka= llsyms] > > > 1688.00 1.0% xfs_perag_get_tag [kernel.ka= llsyms] > > > 1660.00 1.0% flat_send_IPI_mask [kernel.ka= llsyms] > > > 1538.00 0.9% xfs_inode_item_destroy [kernel.ka= llsyms] > > > 1312.00 0.8% __shrink_dcache_sb [kernel.ka= llsyms] > > > 940.00 0.6% xfs_perag_put [kernel.ka= llsyms] > > >=20 > > > So there is some dentry cache reclaim going on.=20 > > >=20 > > > FWIW, it appears there is quite a lot of RCU freeing overhead (~1= 5% > > > more CPU time) in the work kswapd is doing during these unlinks, = too. > > > I just had a look at kswapd when a 8-way create is running - it's= running at > > > 50-60% of a cpu for seconds at a time. I caught this while it was= doing pure > > > XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes= ): > > >=20 > > > samples pcnt function DSO > > > _______ _____ ___________________________ __________= _______ > > >=20 > > > 27171.00 9.0% __call_rcu [kernel.ka= llsyms] > > > 21491.00 7.1% down_write [kernel.ka= llsyms] > > > 20916.00 6.9% xfs_reclaim_inode [kernel.ka= llsyms] > > > 20313.00 6.7% radix_tree_delete [kernel.ka= llsyms] > > > 15828.00 5.3% kmem_cache_free [kernel.ka= llsyms] > > > 15819.00 5.2% xfs_idestroy_fork [kernel.ka= llsyms] > > > 14893.00 4.9% is_bad_inode [kernel.ka= llsyms] > > > 14666.00 4.9% _raw_spin_unlock_irqrestore [kernel.ka= llsyms] > > > 14191.00 4.7% xfs_reclaim_inode_grab [kernel.ka= llsyms] > > > 14105.00 4.7% xfs_iunlock [kernel.ka= llsyms] > > > 10916.00 3.6% __ticket_spin_lock [kernel.ka= llsyms] > > > 10125.00 3.4% xfs_iflush_cluster [kernel.ka= llsyms] > > > 8221.00 2.7% xfs_qm_dqdetach [kernel.ka= llsyms] > > > 7639.00 2.5% xfs_trans_unlocked_item [kernel.ka= llsyms] > > > 7028.00 2.3% xfs_synchronize_times [kernel.ka= llsyms] > > > 6974.00 2.3% up_write [kernel.ka= llsyms] > > > 5870.00 1.9% call_rcu_sched [kernel.ka= llsyms] > > > 5634.00 1.9% _cond_resched [kernel.ka= llsyms] > > >=20 > > > Which is showing a similar amount of RCU overhead as the unlink a= s above. > > > And this while it was doing dentry cache reclaim (~10s sample): > > >=20 > > > 35921.00 15.7% __d_drop [kernel.= kallsyms] > > > 30056.00 13.1% __ticket_spin_trylock [kernel.= kallsyms] > > > 29066.00 12.7% __ticket_spin_lock [kernel.= kallsyms] > > > 19043.00 8.3% __call_rcu [kernel.= kallsyms] > > > 10098.00 4.4% iput [kernel.= kallsyms] > > > 7013.00 3.1% __shrink_dcache_sb [kernel.= kallsyms] > > > 6774.00 3.0% __percpu_counter_add [kernel.= kallsyms] > > > 6708.00 2.9% radix_tree_tag_set [kernel.= kallsyms] > > > 5362.00 2.3% xfs_inactive [kernel.= kallsyms] > > > 5130.00 2.2% __ticket_spin_unlock [kernel.= kallsyms] > > > 4884.00 2.1% call_rcu_sched [kernel.= kallsyms] > > > 4621.00 2.0% dentry_lru_del [kernel.= kallsyms] > > > 3735.00 1.6% bit_waitqueue [kernel.= kallsyms] > > > 3727.00 1.6% dentry_iput [kernel.= kallsyms] > > > 3473.00 1.5% shrink_icache_memory [kernel.= kallsyms] > > > 3279.00 1.4% kfree [kernel.= kallsyms] > > > 3101.00 1.4% xfs_perag_get [kernel.= kallsyms] > > > 2516.00 1.1% kmem_cache_free [kernel.= kallsyms] > > > 2272.00 1.0% shrink_dentry_list [kernel.= kallsyms] > > >=20 > > > I've never really seen any signficant dentry cache reclaim overhe= ad > > > in profiles of these workloads before, so this was a bit of a > > > surprise.... > >=20 > > call_rcu shouldn't be doing much, except for disabling irqs and lin= king > > the object into the list. I have a patch somewhere to reduce the ir= q > > disable overhead a bit, but it really shouldn't be doing a lot of w= ork. >=20 > Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, an= d > look at rcu/rcudata? There will be a "ql=3D" number printed for each > CPU, and if that number is too large, __call_rcu() does take what it > considers to be corrective action, which can incur some overhead. >=20 > If this is the problem, then increasing the value of the qhimark modu= le > parameter might help. >=20 > If this is not the problem, I could make a patch that disables some o= f > __call_rcu()'s grace-period acceleration code if you are willing to t= ry > it out. Another thing that might help is to reduce the value of CONFIG_RCU_FANO= UT to something like 16. If this does help, then there is a reasonably straightforward change I can make to RCU. Thanx, Paul > > Sometimes you find that touching the rcu head field needs to get a > > cacheline exclusive, so a bit of work gets transferred there.... > >=20 > > But it may also be something going a bit wrong in RCU. I blew it up > > once already, after the files_lock splitup that enabled all CPUs to > > create and destroy files :) >=20 > I would certainly like the opportunity to fix any bugs that might be > in RCU... ;-) >=20 > Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html