public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Nick Piggin <npiggin@gmail.com>
Cc: Nick Piggin <npiggin@kernel.dk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 00/46] rcu-walk and dcache scaling
Date: Wed, 8 Dec 2010 15:28:16 +1100	[thread overview]
Message-ID: <20101208042816.GA32766@dastard> (raw)
In-Reply-To: <20101208033212.GF29333@dastard>

On Wed, Dec 08, 2010 at 02:32:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2010 at 12:47:42PM +1100, Nick Piggin wrote:
> > On Wed, Dec 8, 2010 at 8:56 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Nov 27, 2010 at 09:15:58PM +1100, Nick Piggin wrote:
> > >>
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working
> > >>
> > >> Here is an new set of vfs patches for review, not that there was much interest
> > >> last time they were posted. It is structured like:
> > >>
> > >> * preparation patches
> > >> * introduce new locks to take over dcache_lock, then remove it
> > >> * cleaning up and reworking things for new locks
> > >> * rcu-walk path walking
> > >> * start on some fine grained locking steps
> > >
> > > Stress test doing:
> > >
> > >        single thread 50M inode create
> > >        single thread rm -rf
> > >        2-way 50M inode create
> > >        2-way rm -rf
> > >        4-way 50M inode create
> > >        4-way rm -rf
> > >        8-way 50M inode create
> > >        8-way rm -rf
> > >        8-way 250M inode create
> > >        8-way rm -rf
> > >
> > > Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test)
> > > with a CPU stuck spinning on here:
> > >
> > > [37372.084012] NMI backtrace for cpu 5
> > > [37372.084012] CPU 5
> > > [37372.084012] Modules linked in:
> > > [37372.084012]
> > > [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs
> > > [37372.084012] RIP: 0010:[<ffffffff810643c4>]  [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20
> > > [37372.084012] RSP: 0018:ffff880114643c98  EFLAGS: 00000213
> > > [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688
> > > [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14
> > > [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000
> > > [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14
> > > [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c
> > > [37372.084012] FS:  00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000
> > > [37372.084012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0
> > > [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890)
> > > [37372.084012] Stack:
> > > [37372.084012]  ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7
> > > [37372.084012]  0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800
> > > [37372.084012]  ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000
> > > [37372.084012] Call Trace:
> > > [37372.084012]  [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff81167ae7>] shrink_dentry_list+0x47/0x370
> > > [37372.084012]  [<ffffffff81167f5e>] __shrink_dcache_sb+0x14e/0x1e0
> > > [37372.084012]  [<ffffffff81168456>] shrink_dcache_parent+0x276/0x2d0
> > > [37372.084012]  [<ffffffff81ad044e>] ? _raw_spin_lock+0xe/0x20
> > > [37372.084012]  [<ffffffff8115daa2>] dentry_unhash+0x42/0x80
> > > [37372.084012]  [<ffffffff8115db48>] vfs_rmdir+0x68/0x100
> > > [37372.084012]  [<ffffffff8115fd93>] do_rmdir+0x113/0x130
> > > [37372.084012]  [<ffffffff8114f5ad>] ? filp_close+0x5d/0x90
> > > [37372.084012]  [<ffffffff8115fde5>] sys_unlinkat+0x35/0x40
> > > [37372.084012]  [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
> > 
> > OK good, with any luck, that's the same bug.
> > 
> > Is this XFS?
> 
> Yes.
> 
> > Is there any concurrent activity happening on the same dentries?
> 
> Not from an application perspective.
> 
> > Ie. are the rm -rf threads running on the same directories,
> 
> No, each thread operating on a different directory.
> 
> > or is there any reclaim happening in the background?
> 
> IIRC, kswapd was consuming about 5-10% of a CPU during parallel
> unlink tests. Mainly reclaiming XFS inodes, I think, but there may
> be dentry cache reclaim going as well.

Turns out that the kswapd peaks are upwards of 50% of a CPU for a
few seconds, then idle for 10-15s. Typical perf top output of kswapd
while it is active during unlinks is:

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            17168.00 10.2% __call_rcu                  [kernel.kallsyms]
            13223.00  7.8% kmem_cache_free             [kernel.kallsyms]
            12917.00  7.6% down_write                  [kernel.kallsyms]
            12665.00  7.5% xfs_iunlock                 [kernel.kallsyms]
            10493.00  6.2% xfs_reclaim_inode_grab      [kernel.kallsyms]
             9314.00  5.5% __lookup_tag                [kernel.kallsyms]
             9040.00  5.4% radix_tree_delete           [kernel.kallsyms]
             8694.00  5.1% is_bad_inode                [kernel.kallsyms]
             7639.00  4.5% __ticket_spin_lock          [kernel.kallsyms]
             6821.00  4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
             5484.00  3.2% __d_drop                    [kernel.kallsyms]
             5114.00  3.0% xfs_reclaim_inode           [kernel.kallsyms]
             4626.00  2.7% __rcu_process_callbacks     [kernel.kallsyms]
             3556.00  2.1% up_write                    [kernel.kallsyms]
             3206.00  1.9% _cond_resched               [kernel.kallsyms]
             3129.00  1.9% xfs_qm_dqdetach             [kernel.kallsyms]
             2327.00  1.4% radix_tree_tag_clear        [kernel.kallsyms]
             2327.00  1.4% call_rcu_sched              [kernel.kallsyms]
             2262.00  1.3% __ticket_spin_unlock        [kernel.kallsyms]
             2215.00  1.3% xfs_ilock                   [kernel.kallsyms]
             2200.00  1.3% radix_tree_gang_lookup_tag  [kernel.kallsyms]
             1982.00  1.2% xfs_reclaim_inodes_ag       [kernel.kallsyms]
             1736.00  1.0% xfs_trans_unlocked_item     [kernel.kallsyms]
             1707.00  1.0% __ticket_spin_trylock       [kernel.kallsyms]
             1688.00  1.0% xfs_perag_get_tag           [kernel.kallsyms]
             1660.00  1.0% flat_send_IPI_mask          [kernel.kallsyms]
             1538.00  0.9% xfs_inode_item_destroy      [kernel.kallsyms]
             1312.00  0.8% __shrink_dcache_sb          [kernel.kallsyms]
              940.00  0.6% xfs_perag_put               [kernel.kallsyms]

So there is some dentry cache reclaim going on. 

FWIW, it appears there is quite a lot of RCU freeing overhead (~15%
more CPU time) in the work kswapd is doing during these unlinks, too.
I just had a look at kswapd when a 8-way create is running - it's running at
50-60% of a cpu for seconds at a time. I caught this while it was doing pure
XFS inode cache reclaim (~10s sample, kswapd reclaimed ~1M inodes):

             samples  pcnt function                    DSO
             _______ _____ ___________________________ _________________

            27171.00  9.0% __call_rcu                  [kernel.kallsyms]
            21491.00  7.1% down_write                  [kernel.kallsyms]
            20916.00  6.9% xfs_reclaim_inode           [kernel.kallsyms]
            20313.00  6.7% radix_tree_delete           [kernel.kallsyms]
            15828.00  5.3% kmem_cache_free             [kernel.kallsyms]
            15819.00  5.2% xfs_idestroy_fork           [kernel.kallsyms]
            14893.00  4.9% is_bad_inode                [kernel.kallsyms]
            14666.00  4.9% _raw_spin_unlock_irqrestore [kernel.kallsyms]
            14191.00  4.7% xfs_reclaim_inode_grab      [kernel.kallsyms]
            14105.00  4.7% xfs_iunlock                 [kernel.kallsyms]
            10916.00  3.6% __ticket_spin_lock          [kernel.kallsyms]
            10125.00  3.4% xfs_iflush_cluster          [kernel.kallsyms]
             8221.00  2.7% xfs_qm_dqdetach             [kernel.kallsyms]
             7639.00  2.5% xfs_trans_unlocked_item     [kernel.kallsyms]
             7028.00  2.3% xfs_synchronize_times       [kernel.kallsyms]
             6974.00  2.3% up_write                    [kernel.kallsyms]
             5870.00  1.9% call_rcu_sched              [kernel.kallsyms]
             5634.00  1.9% _cond_resched               [kernel.kallsyms]

Which is showing a similar amount of RCU overhead as the unlink as above.
And this while it was doing dentry cache reclaim (~10s sample):

            35921.00 15.7% __d_drop                      [kernel.kallsyms]
            30056.00 13.1% __ticket_spin_trylock         [kernel.kallsyms]
            29066.00 12.7% __ticket_spin_lock            [kernel.kallsyms]
            19043.00  8.3% __call_rcu                    [kernel.kallsyms]
            10098.00  4.4% iput                          [kernel.kallsyms]
             7013.00  3.1% __shrink_dcache_sb            [kernel.kallsyms]
             6774.00  3.0% __percpu_counter_add          [kernel.kallsyms]
             6708.00  2.9% radix_tree_tag_set            [kernel.kallsyms]
             5362.00  2.3% xfs_inactive                  [kernel.kallsyms]
             5130.00  2.2% __ticket_spin_unlock          [kernel.kallsyms]
             4884.00  2.1% call_rcu_sched                [kernel.kallsyms]
             4621.00  2.0% dentry_lru_del                [kernel.kallsyms]
             3735.00  1.6% bit_waitqueue                 [kernel.kallsyms]
             3727.00  1.6% dentry_iput                   [kernel.kallsyms]
             3473.00  1.5% shrink_icache_memory          [kernel.kallsyms]
             3279.00  1.4% kfree                         [kernel.kallsyms]
             3101.00  1.4% xfs_perag_get                 [kernel.kallsyms]
             2516.00  1.1% kmem_cache_free               [kernel.kallsyms]
             2272.00  1.0% shrink_dentry_list            [kernel.kallsyms]

I've never really seen any signficant dentry cache reclaim overhead
in profiles of these workloads before, so this was a bit of a
surprise....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2010-12-08  4:28 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-27 10:15 [PATCH 00/46] rcu-walk and dcache scaling Nick Piggin
2010-11-27  9:44 ` [PATCH 02/46] fs: d_validate fixes Nick Piggin
2010-12-08  1:53   ` Dave Chinner
2010-12-08  6:59     ` Nick Piggin
2010-12-09  0:50       ` Dave Chinner
2010-12-09  4:50         ` Nick Piggin
2010-11-27  9:44 ` [PATCH 03/46] kernel: kmem_ptr_validate considered harmful Nick Piggin
2010-11-27  9:44 ` [PATCH 04/46] fs: dcache documentation cleanup Nick Piggin
2010-11-27  9:44 ` [PATCH 05/46] fs: change d_delete semantics Nick Piggin
2010-11-27  9:44 ` [PATCH 06/46] cifs: dont overwrite dentry name in d_revalidate Nick Piggin
2010-11-27  9:44 ` [PATCH 07/46] jfs: " Nick Piggin
2010-11-27  9:44 ` [PATCH 08/46] fs: change d_compare for rcu-walk Nick Piggin
2010-11-27  9:44 ` [PATCH 09/46] fs: change d_hash " Nick Piggin
2010-11-27  9:44 ` [PATCH 10/46] hostfs: simplify locking Nick Piggin
2010-11-27  9:44 ` [PATCH 11/46] fs: dcache scale hash Nick Piggin
2010-12-09  6:09   ` Dave Chinner
2010-12-09  6:28     ` Nick Piggin
2010-12-09  8:17       ` Dave Chinner
2010-12-09 12:53         ` Nick Piggin
2010-12-09 23:42           ` Dave Chinner
2010-12-10  2:35             ` Nick Piggin
2010-12-10  9:01               ` Dave Chinner
2010-12-13  4:48                 ` Nick Piggin
2010-12-13  5:05                 ` Nick Piggin
2010-11-27  9:44 ` [PATCH 12/46] fs: dcache scale lru Nick Piggin
2010-12-09  7:22   ` Dave Chinner
2010-12-09 12:34     ` Nick Piggin
2010-11-27  9:44 ` [PATCH 13/46] fs: dcache scale dentry refcount Nick Piggin
2010-11-27  9:44 ` [PATCH 14/46] fs: dcache scale d_unhashed Nick Piggin
2010-11-27  9:44 ` [PATCH 15/46] fs: dcache scale subdirs Nick Piggin
2010-11-27  9:44 ` [PATCH 16/46] fs: scale inode alias list Nick Piggin
2010-11-27  9:44 ` [PATCH 17/46] fs: Use rename lock and RCU for multi-step operations Nick Piggin
2011-01-18 22:32   ` Yehuda Sadeh Weinraub
2011-01-18 22:42     ` Nick Piggin
2011-01-19 22:27       ` Yehuda Sadeh Weinraub
2011-01-19 22:32         ` Nick Piggin
2011-01-25 22:10           ` Yehuda Sadeh Weinraub
2011-01-27  5:18             ` Nick Piggin
2011-02-07 18:52               ` Jim Schutt
2011-02-07 21:04                 ` Yehuda Sadeh Weinraub
2011-02-07 21:31                   ` Jim Schutt
2011-02-07 22:25                   ` Jim Schutt
2011-02-14 17:57               ` Yehuda Sadeh Weinraub
2010-11-27  9:44 ` [PATCH 18/46] fs: increase d_name lock coverage Nick Piggin
2010-11-27  9:44 ` [PATCH 19/46] fs: dcache remove dcache_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 20/46] fs: dcache avoid starvation in dcache multi-step operations Nick Piggin
2010-11-27  9:44 ` [PATCH 21/46] fs: dcache reduce dput locking Nick Piggin
2010-11-27  9:44 ` [PATCH 22/46] fs: dcache reduce locking in d_alloc Nick Piggin
2010-11-27  9:44 ` [PATCH 23/46] fs: dcache reduce dcache_inode_lock Nick Piggin
2010-11-27  9:44 ` [PATCH 24/46] fs: dcache rationalise dget variants Nick Piggin
2010-11-27  9:44 ` [PATCH 25/46] fs: dcache reduce d_parent locking Nick Piggin
2010-11-27  9:44 ` [PATCH 26/46] fs: dcache reduce prune_one_dentry locking Nick Piggin
2010-11-27  9:44 ` [PATCH 27/46] fs: reduce dcache_inode_lock width in lru scanning Nick Piggin
2010-11-27  9:44 ` [PATCH 28/46] fs: use RCU in shrink_dentry_list to reduce lock nesting Nick Piggin
2010-11-27  9:44 ` [PATCH 29/46] fs: consolidate dentry kill sequence Nick Piggin
2010-11-27  9:45 ` [PATCH 30/46] fs: icache RCU free inodes Nick Piggin
2010-11-27  9:45 ` [PATCH 31/46] fs: avoid inode RCU freeing for pseudo fs Nick Piggin
2010-11-27  9:45 ` [PATCH 32/46] kernel: optimise seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 33/46] fs: rcu-walk for path lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 34/46] fs: fs_struct use seqlock Nick Piggin
2010-11-27  9:45 ` [PATCH 35/46] fs: dcache remove d_mounted Nick Piggin
2010-11-27  9:45 ` [PATCH 36/46] fs: dcache reduce branches in lookup path Nick Piggin
2010-11-27  9:45 ` [PATCH 37/46] fs: cache optimise dentry and inode for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 38/46] fs: prefetch inode data in dcache lookup Nick Piggin
2010-11-27  9:45 ` [PATCH 39/46] fs: d_revalidate_rcu for rcu-walk Nick Piggin
2010-11-27  9:45 ` [PATCH 40/46] fs: provide rcu-walk aware permission i_ops Nick Piggin
2010-11-27  9:45 ` [PATCH 41/46] fs: provide simple rcu-walk ACL implementation Nick Piggin
2010-11-27  9:45 ` [PATCH 42/46] kernel: add bl_list Nick Piggin
2010-11-27  9:45 ` [PATCH 43/46] bit_spinlock: add required includes Nick Piggin
2010-11-27  9:45 ` [PATCH 44/46] fs: dcache per-bucket dcache hash locking Nick Piggin
2010-11-27  9:45 ` [PATCH 45/46] fs: dcache per-inode inode alias locking Nick Piggin
2010-11-27  9:45 ` [PATCH 46/46] fs: improve scalability of pseudo filesystems Nick Piggin
2010-11-27  9:56 ` [PATCH 01/46] Revert "fs: use RCU read side protection in d_validate" Nick Piggin
2010-12-08  1:16   ` Dave Chinner
2010-12-08  9:38     ` Nick Piggin
2010-12-09  0:44       ` Dave Chinner
2010-12-09  4:38         ` Nick Piggin
2010-12-09  5:16           ` Nick Piggin
2010-11-27 15:04 ` [PATCH 00/46] rcu-walk and dcache scaling Anca Emanuel
2010-11-28  3:28   ` Nick Piggin
2010-11-28  6:24     ` Sedat Dilek
2010-12-01 18:03 ` David Miller
2010-12-03 16:55   ` Nick Piggin
2010-12-07 11:25 ` Dave Chinner
2010-12-07 15:24   ` Nick Piggin
2010-12-07 15:49     ` Peter Zijlstra
2010-12-07 15:59       ` Nick Piggin
2010-12-07 16:23         ` Peter Zijlstra
2010-12-08  3:28     ` Nick Piggin
2010-12-07 21:56 ` Dave Chinner
2010-12-08  1:47   ` Nick Piggin
2010-12-08  3:32     ` Dave Chinner
2010-12-08  4:28       ` Dave Chinner [this message]
2010-12-08  7:09         ` Nick Piggin
2010-12-10 20:32           ` Paul E. McKenney
2010-12-12 14:54             ` Paul E. McKenney
  -- strict thread matches above, loose matches on Subject: below --
2010-11-27 19:20 Sedat Dilek
2010-11-27 20:53 ` Sedat Dilek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101208042816.GA32766@dastard \
    --to=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@gmail.com \
    --cc=npiggin@kernel.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox