From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, Nick Piggin <npiggin@kernel.dk>,
xfs@oss.sgi.com
Subject: Re: VFS scalability git tree
Date: Fri, 30 Jul 2010 00:03:57 +1000 [thread overview]
Message-ID: <20100729140357.GA7217@amd> (raw)
In-Reply-To: <20100728125717.GI655@dastard>
On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > >
> > > > > Branch vfs-scale-working
> > > >
> > > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > > run the same fs_mark parallel create/unlink workload to show
> > > > scalability as I ran here:
> > > >
> > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > >
> > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > > of a real disk (I don't have easy access to a good disk setup ATM, but
> > > I guess we're more interested in code above the block layer anyway).
> > >
> > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > > yours.
> > >
> > > I found that performance is a little unstable, so I sync and echo 3 >
> > > drop_caches between each run. When it starts reclaiming memory, things
> > > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > > of seconds in inode reclaim).
> >
> > So about this XFS livelock type thingy. It looks like this, and happens
> > periodically while running the above fs_mark benchmark requiring reclaim
> > of inodes:
> ....
>
> > Nothing much happening except 100% system time for seconds at a time
> > (length of time varies). This is on a ramdisk, so it isn't waiting
> > for IO.
> >
> > During this time, lots of things are contending on the lock:
> >
> > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed
> > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed
> > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion
> > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock
> > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed
> > |
> > --- __write_lock_failed
> > |
> > |--99.92%-- xfs_inode_ag_walk
> > | xfs_inode_ag_iterator
> > | xfs_reclaim_inode_shrink
> > | shrink_slab
> > | shrink_zone
> > | balance_pgdat
> > | kswapd
> > | kthread
> > | kernel_thread_helper
> > --0.08%-- [...]
> >
> > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock
> > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find
> > 1.16% fs_mark [kernel.kallsyms] [k] memcpy
> > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil
> > |
> > --- xfs_log_commit_cil
> > _xfs_trans_commit
> > |
> > |--60.00%-- xfs_remove
> > | xfs_vn_unlink
> > | vfs_unlink
> > | do_unlinkat
> > | sys_unlink
> >
> > I'm not sure if there was a long-running read locker in there causing
> > all the write lockers to fail, or if they are just running into one
> > another.
>
> The longest hold is in the inode cluster writeback
> (xfs_iflush_cluster), but if there is no IO then I don't see how
> that would be a problem.
No I wasn't suggesting there was, just that there could have
been one that I didn't notice in profiles (ie. because it
had taken read lock rather than spinning on it).
> I suspect that it might be caused by having several CPUs
> all trying to run the shrinker at the same time and them all
> starting at the same AG and therefore lockstepping and getting
> nothing done because they are all scanning the same inodes.
I think that is the most likely answer, yes.
> Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
> avoid this lockstepping. I've attached a patch below to do this
> - can you give it a try?
Cool yes I will. I could try it in combination with the batching
patch too. Thanks.
> > But anyway, I hacked the following patch which seemed to
> > improve that behaviour. I haven't run any throughput numbers on it yet,
> > but I could if you're interested (and it's not completely broken!)
>
> Batching is certainly something that I have been considering, but
> apart from the excessive scanning bug, the per-ag inode tree lookups
> hve not featured prominently in any profiling I've done, so it
> hasn't been a high priority.
>
> You patch looks like it will work fine, but I think it can be made a
> lot cleaner. I'll have a closer look at this once I get to the bottom of
> the dbench hang you are seeing....
Well I'll see if I can measure any efficiency or lock contention
improvements with it and report back. Might have to wait till the
weekend.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <npiggin@suse.de>
To: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <npiggin@kernel.dk>,
xfs@oss.sgi.com, linux-fsdevel@vger.kernel.org
Subject: Re: VFS scalability git tree
Date: Fri, 30 Jul 2010 00:03:57 +1000 [thread overview]
Message-ID: <20100729140357.GA7217@amd> (raw)
In-Reply-To: <20100728125717.GI655@dastard>
On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > >
> > > > > Branch vfs-scale-working
> > > >
> > > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > > run the same fs_mark parallel create/unlink workload to show
> > > > scalability as I ran here:
> > > >
> > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > >
> > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > > of a real disk (I don't have easy access to a good disk setup ATM, but
> > > I guess we're more interested in code above the block layer anyway).
> > >
> > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > > yours.
> > >
> > > I found that performance is a little unstable, so I sync and echo 3 >
> > > drop_caches between each run. When it starts reclaiming memory, things
> > > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > > of seconds in inode reclaim).
> >
> > So about this XFS livelock type thingy. It looks like this, and happens
> > periodically while running the above fs_mark benchmark requiring reclaim
> > of inodes:
> ....
>
> > Nothing much happening except 100% system time for seconds at a time
> > (length of time varies). This is on a ramdisk, so it isn't waiting
> > for IO.
> >
> > During this time, lots of things are contending on the lock:
> >
> > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed
> > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed
> > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion
> > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock
> > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed
> > |
> > --- __write_lock_failed
> > |
> > |--99.92%-- xfs_inode_ag_walk
> > | xfs_inode_ag_iterator
> > | xfs_reclaim_inode_shrink
> > | shrink_slab
> > | shrink_zone
> > | balance_pgdat
> > | kswapd
> > | kthread
> > | kernel_thread_helper
> > --0.08%-- [...]
> >
> > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock
> > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find
> > 1.16% fs_mark [kernel.kallsyms] [k] memcpy
> > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil
> > |
> > --- xfs_log_commit_cil
> > _xfs_trans_commit
> > |
> > |--60.00%-- xfs_remove
> > | xfs_vn_unlink
> > | vfs_unlink
> > | do_unlinkat
> > | sys_unlink
> >
> > I'm not sure if there was a long-running read locker in there causing
> > all the write lockers to fail, or if they are just running into one
> > another.
>
> The longest hold is in the inode cluster writeback
> (xfs_iflush_cluster), but if there is no IO then I don't see how
> that would be a problem.
No I wasn't suggesting there was, just that there could have
been one that I didn't notice in profiles (ie. because it
had taken read lock rather than spinning on it).
> I suspect that it might be caused by having several CPUs
> all trying to run the shrinker at the same time and them all
> starting at the same AG and therefore lockstepping and getting
> nothing done because they are all scanning the same inodes.
I think that is the most likely answer, yes.
> Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
> avoid this lockstepping. I've attached a patch below to do this
> - can you give it a try?
Cool yes I will. I could try it in combination with the batching
patch too. Thanks.
> > But anyway, I hacked the following patch which seemed to
> > improve that behaviour. I haven't run any throughput numbers on it yet,
> > but I could if you're interested (and it's not completely broken!)
>
> Batching is certainly something that I have been considering, but
> apart from the excessive scanning bug, the per-ag inode tree lookups
> hve not featured prominently in any profiling I've done, so it
> hasn't been a high priority.
>
> You patch looks like it will work fine, but I think it can be made a
> lot cleaner. I'll have a closer look at this once I get to the bottom of
> the dbench hang you are seeing....
Well I'll see if I can measure any efficiency or lock contention
improvements with it and report back. Might have to wait till the
weekend.
next prev parent reply other threads:[~2010-07-29 14:01 UTC|newest]
Thread overview: 76+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-22 19:01 VFS scalability git tree Nick Piggin
2010-07-22 19:01 ` Nick Piggin
2010-07-23 11:13 ` Dave Chinner
2010-07-23 11:13 ` Dave Chinner
2010-07-23 14:04 ` [PATCH 0/2] vfs scalability tree fixes Dave Chinner
2010-07-23 14:04 ` Dave Chinner
2010-07-23 16:09 ` Nick Piggin
2010-07-23 16:09 ` Nick Piggin
2010-07-23 14:04 ` [PATCH 1/2] xfs: fix shrinker build Dave Chinner
2010-07-23 14:04 ` Dave Chinner
2010-07-23 14:04 ` [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count Dave Chinner
2010-07-23 14:04 ` Dave Chinner
2010-07-23 15:51 ` VFS scalability git tree Nick Piggin
2010-07-23 15:51 ` Nick Piggin
2010-07-24 0:21 ` Dave Chinner
2010-07-24 0:21 ` Dave Chinner
2010-07-23 11:17 ` Christoph Hellwig
2010-07-23 11:17 ` Christoph Hellwig
2010-07-23 15:42 ` Nick Piggin
2010-07-23 15:42 ` Nick Piggin
2010-07-23 13:55 ` Dave Chinner
2010-07-23 13:55 ` Dave Chinner
2010-07-23 16:16 ` Nick Piggin
2010-07-23 16:16 ` Nick Piggin
2010-07-27 7:05 ` Nick Piggin
2010-07-27 7:05 ` Nick Piggin
2010-07-27 8:06 ` Nick Piggin
2010-07-27 8:06 ` Nick Piggin
2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30 ` Dave Chinner
2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17 ` Dave Chinner
2010-07-29 14:05 ` Nick Piggin
2010-07-29 22:56 ` Dave Chinner
2010-07-30 3:59 ` Nick Piggin
2010-07-28 12:57 ` VFS scalability git tree Dave Chinner
2010-07-28 12:57 ` Dave Chinner
2010-07-29 14:03 ` Nick Piggin [this message]
2010-07-29 14:03 ` Nick Piggin
2010-07-27 11:09 ` Nick Piggin
2010-07-27 11:09 ` Nick Piggin
2010-07-27 13:18 ` Dave Chinner
2010-07-27 13:18 ` Dave Chinner
2010-07-27 15:09 ` Nick Piggin
2010-07-27 15:09 ` Nick Piggin
2010-07-28 4:59 ` Dave Chinner
2010-07-28 4:59 ` Dave Chinner
2010-07-28 4:59 ` Dave Chinner
2010-07-23 15:35 ` Nick Piggin
2010-07-23 15:35 ` Nick Piggin
2010-07-24 8:43 ` KOSAKI Motohiro
2010-07-24 8:43 ` KOSAKI Motohiro
2010-07-24 8:44 ` [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() KOSAKI Motohiro
2010-07-24 8:44 ` KOSAKI Motohiro
2010-07-24 8:44 ` KOSAKI Motohiro
2010-07-24 12:05 ` KOSAKI Motohiro
2010-07-24 12:05 ` KOSAKI Motohiro
2010-07-24 8:46 ` [PATCH 2/2] vmscan: change shrink_slab() return tyep with void KOSAKI Motohiro
2010-07-24 8:46 ` KOSAKI Motohiro
2010-07-24 8:46 ` KOSAKI Motohiro
2010-07-24 10:54 ` VFS scalability git tree KOSAKI Motohiro
2010-07-24 10:54 ` KOSAKI Motohiro
2010-07-26 5:41 ` Nick Piggin
2010-07-26 5:41 ` Nick Piggin
2010-07-28 10:24 ` Nick Piggin
2010-07-28 10:24 ` Nick Piggin
2010-07-30 9:12 ` Nick Piggin
2010-07-30 9:12 ` Nick Piggin
2010-08-03 0:27 ` john stultz
2010-08-03 0:27 ` john stultz
2010-08-03 0:27 ` john stultz
2010-08-03 5:44 ` Nick Piggin
2010-08-03 5:44 ` Nick Piggin
2010-08-03 5:44 ` Nick Piggin
2010-09-14 22:26 ` Christoph Hellwig
2010-09-14 23:02 ` Frank Mayhar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100729140357.GA7217@amd \
--to=npiggin@suse.de \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=npiggin@kernel.dk \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.