Re: VFS scalability git tree

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: VFS scalability git tree
       [not found]   ` <20100727070538.GA2893@amd>
@ 2010-07-27  8:06     ` Nick Piggin
  2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
  2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Nick Piggin @ 2010-07-27  8:06 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.
> 
> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run. When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

So about this XFS livelock type thingy. It looks like this, and happens
periodically while running the above fs_mark benchmark requiring reclaim
of inodes:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
15  0   6900  31032    192 471852    0    0    28 183296 8520 46672  5 91  4  0
19  0   7044  22928    192 466712   96  144  1056 115586 8622 41695  3 96  1  0
19  0   7136  59884    192 471200  160   92  6768 34564  995  542  1 99 0  0
19  0   7244  17008    192 467860    0  104  2068 32953 1044  630  1 99 0  0
18  0   7244  43436    192 467324    0    0    12     0  817  405  0 100 0  0
18  0   7244  43684    192 467324    0    0     0     0  806  425  0 100 0  0
18  0   7244  43932    192 467324    0    0     0     0  808  403  0 100 0  0
18  0   7244  44924    192 467324    0    0     0     0  808  398  0 100 0  0
18  0   7244  45456    192 467324    0    0     0     0  809  409  0 100 0  0
18  0   7244  45472    192 467324    0    0     0     0  805  412  0 100 0  0
18  0   7244  46392    192 467324    0    0     0     0  807  401  0 100 0  0
18  0   7244  47012    192 467324    0    0     0     0  810  414  0 100 0  0
18  0   7244  47260    192 467324    0    0     0     0  806  396  0 100 0  0
18  0   7244  47752    192 467324    0    0     0     0  806  403  0 100 0  0
18  0   7244  48204    192 467324    0    0     0     0  810  409  0 100 0  0
18  0   7244  48608    192 467324    0    0     0     0  807  412  0 100 0  0
18  0   7244  48876    192 467324    0    0     0     0  805  406  0 100 0  0
18  0   7244  49000    192 467324    0    0     0     0  809  402  0 100 0  0
18  0   7244  49408    192 467324    0    0     0     0  807  396  0 100 0  0
18  0   7244  49908    192 467324    0    0     0     0  809  406  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  404  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  406  0 100 0  0
19  0   7244  73436    192 467324    0    0     0  6340  808  384  0 100 0  0
20  0   7244 490220    192 467324    0    0     0  8411  830  389  0 100 0  0
18  0   7244 620092    192 467324    0    0     0     4  809  435  0 100 0  0
18  0   7244 620344    192 467324    0    0     0     0  806  430  0 100 0  0
16  0   7244 682620    192 467324    0    0    44    80  890  326  0 100 0  0
12  0   7244 604464    192 479308   76    0 11716 73555 2242 14318  2 94 4  0
12  0   7244 556700    192 483488    0    0  4276 77680 6576 92285  1 97 2  0
17  0   7244 502508    192 485456    0    0  2092 98368 6308 91919  1 96 4  0
11  0   7244 416500    192 487116    0    0  1760 114844 7414 63025  2 96  2  0

Nothing much happening except 100% system time for seconds at a time
(length of time varies). This is on a ramdisk, so it isn't waiting
for IO.

During this time, lots of things are contending on the lock:

    60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
     4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
     3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
     3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
     3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
                   |
                   --- __write_lock_failed
                      |
                      |--99.92%-- xfs_inode_ag_walk
                      |          xfs_inode_ag_iterator
                      |          xfs_reclaim_inode_shrink
                      |          shrink_slab
                      |          shrink_zone
                      |          balance_pgdat
                      |          kswapd
                      |          kthread
                      |          kernel_thread_helper
                       --0.08%-- [...]

     3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
     1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
     1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
     0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
     0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
                   |
                   --- xfs_log_commit_cil
                       _xfs_trans_commit
                      |
                      |--60.00%-- xfs_remove
                      |          xfs_vn_unlink
                      |          vfs_unlink
                      |          do_unlinkat
                      |          sys_unlink

I'm not sure if there was a long-running read locker in there causing
all the write lockers to fail, or if they are just running into one
another. But anyway, I hacked the following patch which seemed to
improve that behaviour. I haven't run any throughput numbers on it yet,
but I could if you're interested (and it's not completely broken!)

Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes
that appear to be busy to improve locking efficiency.

Index: source/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- source.orig/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:12:11.000000000 +1000
+++ source/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:58:59.000000000 +1000
@@ -87,6 +87,91 @@ xfs_inode_ag_lookup(
 	return ip;
 }
 
+#define RECLAIM_BATCH_SIZE	32
+STATIC int
+xfs_inode_ag_walk_reclaim(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	int			(*execute)(struct xfs_inode *ip,
+					   struct xfs_perag *pag, int flags),
+	int			flags,
+	int			tag,
+	int			exclusive,
+	int			*nr_to_scan)
+{
+	uint32_t		first_index;
+	int			last_error = 0;
+	int			skipped;
+	xfs_inode_t		*batch[RECLAIM_BATCH_SIZE];
+	int			batchnr;
+	int			i;
+
+	BUG_ON(!exclusive);
+
+restart:
+	skipped = 0;
+	first_index = 0;
+next_batch:
+	batchnr = 0;
+	/* fill the batch */
+	write_lock(&pag->pag_ici_lock);
+	do {
+		xfs_inode_t	*ip;
+
+		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
+		if (!ip)
+			break;	
+		if (!(flags & SYNC_WAIT) &&
+				(!xfs_iflock_free(ip) ||
+				__xfs_iflags_test(ip, XFS_IRECLAIM)))
+			continue;
+
+		/*
+		 * The radix tree lock here protects a thread in xfs_iget from
+		 * racing with us starting reclaim on the inode.  Once we have
+		 * the XFS_IRECLAIM flag set it will not touch us.
+		 */
+		spin_lock(&ip->i_flags_lock);
+		ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
+			/* ignore as it is already under reclaim */
+			spin_unlock(&ip->i_flags_lock);
+			continue;
+		}
+		__xfs_iflags_set(ip, XFS_IRECLAIM);
+		spin_unlock(&ip->i_flags_lock);
+
+		batch[batchnr++] = ip;
+	} while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE);
+	write_unlock(&pag->pag_ici_lock);
+
+	for (i = 0; i < batchnr; i++) {
+		int		error = 0;
+		xfs_inode_t	*ip = batch[i];
+
+		/* execute doesn't require pag->pag_ici_lock */
+		error = execute(ip, pag, flags);
+		if (error == EAGAIN) {
+			skipped++;
+			continue;
+		}
+		if (error)
+			last_error = error;
+
+		/* bail out if the filesystem is corrupted.  */
+		if (error == EFSCORRUPTED)
+			break;
+	}
+	if (batchnr == RECLAIM_BATCH_SIZE)
+		goto next_batch;
+
+	if (0 && skipped) {
+		delay(1);
+		goto restart;
+	}
+	return last_error;
+}
+
 STATIC int
 xfs_inode_ag_walk(
 	struct xfs_mount	*mp,
@@ -113,6 +198,7 @@ restart:
 			write_lock(&pag->pag_ici_lock);
 		else
 			read_lock(&pag->pag_ici_lock);
+
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
@@ -198,8 +284,12 @@ xfs_inode_ag_iterator(
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
 	ag = 0;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+		if (tag == XFS_ICI_RECLAIM_TAG)
+			error = xfs_inode_ag_walk_reclaim(mp, pag, execute,
+						flags, tag, exclusive, &nr);
+		else
+			error = xfs_inode_ag_walk(mp, pag, execute,
+						flags, tag, exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -789,23 +879,6 @@ xfs_reclaim_inode(
 {
 	int	error = 0;
 
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* ignore as it is already under reclaim */
-		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
-		return 0;
-	}
-	__xfs_iflags_set(ip, XFS_IRECLAIM);
-	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
Index: source/fs/xfs/xfs_inode.h
===================================================================
--- source.orig/fs/xfs/xfs_inode.h	2010-07-26 21:10:33.000000000 +1000
+++ source/fs/xfs/xfs_inode.h	2010-07-26 21:11:59.000000000 +1000
@@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_
 	return try_wait_for_completion(&ip->i_flush);
 }
 
+static inline int xfs_iflock_free(xfs_inode_t *ip)
+{
+	return completion_done(&ip->i_flush);
+}
+
 static inline void xfs_ifunlock(xfs_inode_t *ip)
 {
 	complete(&ip->i_flush);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree)
  2010-07-27  8:06     ` VFS scalability git tree Nick Piggin
@ 2010-07-27 11:36       ` Nick Piggin
  2010-07-27 13:30         ` Dave Chinner
  2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2010-07-27 11:36 UTC (permalink / raw)
  To: xfs

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).

On this same system, same setup (vanilla kernel with sha given below),
I have now twice reproduced a complete hang in XFS. I can give more
information, test patches or options etc if required.

setup.sh looks like this:
#!/bin/bash
modprobe rd rd_size=$[2*1024*1024]
dd if=/dev/zero of=/dev/ram0 bs=4K
mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt

The 'dd' is required to ensure rd driver does not allocate pages
during IO (which can lead to out of memory deadlocks). Running just
involves changing into mnt directory and

while true
do
  sync
  echo 3 > /proc/sys/vm/drop_caches
  ../dbench -c ../loadfiles/client.txt -t20 8
  rm -rf clients
done

And wait for it to hang (happend in < 5 minutes here)


Sysrq of blocked tasks looks like this:

Linux version 2.6.35-rc5-00176-gcd5b8f8 (npiggin@amd) (gcc version 4.4.4 (Debian 4.4.4-7) ) #348 SMP Mon Jul 26 22:20:32 EST 2010

brd: module loaded
Enabling EXPERIMENTAL delayed logging feature - use at your own risk.
XFS mounting filesystem ram0
Ending clean XFS mount for filesystem: ram0
SysRq : Show Blocked State
  task                        PC stack   pid father
flush-1:0     D 00000000fffff8fd     0  2799      2 0x00000000
 ffff8800701ff690 0000000000000046 ffff880000000000 ffff8800701fffd8
 ffff8800071531f0 00000000000122c0 0000000000004000 ffff8800701fffd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800071531f0
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0
 [<ffffffff8123275b>] xfs_iomap_write_allocate+0x25b/0x3c0
 [<ffffffff81232abe>] xfs_iomap+0x1fe/0x270
 [<ffffffff8124aaf7>] xfs_map_blocks+0x37/0x40
 [<ffffffff8107b8e1>] ? find_lock_page+0x21/0x70
 [<ffffffff8124bdca>] xfs_page_state_convert+0x35a/0x690
 [<ffffffff8107c38a>] ? find_or_create_page+0x3a/0xa0
 [<ffffffff8124c3a6>] xfs_vm_writepage+0x76/0x110
 [<ffffffff8108eb00>] ? __dec_zone_page_state+0x30/0x40
 [<ffffffff81082c32>] __writepage+0x12/0x40
 [<ffffffff81083367>] write_cache_pages+0x1c7/0x3d0
 [<ffffffff81082c20>] ? __writepage+0x0/0x40
 [<ffffffff8108358f>] generic_writepages+0x1f/0x30
 [<ffffffff8124c2fc>] xfs_vm_writepages+0x4c/0x60
 [<ffffffff810835bc>] do_writepages+0x1c/0x40
 [<ffffffff810d606e>] writeback_single_inode+0xce/0x3b0
 [<ffffffff810d6774>] writeback_sb_inodes+0x174/0x260
 [<ffffffff810d701f>] writeback_inodes_wb+0x8f/0x180
 [<ffffffff810d733b>] wb_writeback+0x22b/0x290
 [<ffffffff810d7536>] wb_do_writeback+0x196/0x1a0
 [<ffffffff810d7583>] bdi_writeback_task+0x43/0x120
 [<ffffffff81050f46>] ? bit_waitqueue+0x16/0xe0
 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0
 [<ffffffff8108fd6c>] bdi_start_fn+0x6c/0xe0
 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0
 [<ffffffff81050bee>] kthread+0x8e/0xa0
 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10
 [<ffffffff81050b60>] ? kthread+0x0/0xa0
 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10
xfssyncd/ram0 D 00000000fffff045     0  2807      2 0x00000000
 ffff880007635d00 0000000000000046 ffff880000000000 ffff880007635fd8
 ffff880007abd370 00000000000122c0 0000000000004000 ffff880007635fd8
 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abd370
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812572b7>] ? xfs_inode_ag_iterator+0x57/0xd0
 [<ffffffff81256c0a>] xfs_commit_dummy_trans+0x4a/0xe0
 [<ffffffff81257454>] xfs_sync_worker+0x74/0x80
 [<ffffffff81256b2a>] xfssyncd+0x13a/0x1d0
 [<ffffffff812569f0>] ? xfssyncd+0x0/0x1d0
 [<ffffffff81050bee>] kthread+0x8e/0xa0
 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10
 [<ffffffff81050b60>] ? kthread+0x0/0xa0
 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10
dbench        D 00000000ffffefc6     0  2975   2974 0x00000000
 ffff88005ecd1ae8 0000000000000082 ffff880000000000 ffff88005ecd1fd8
 ffff8800079ce250 00000000000122c0 0000000000004000 ffff88005ecd1fd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800079ce250
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff81247920>] xfs_create+0x170/0x5d0
 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140
 [<ffffffff810d0724>] ? mntput_no_expire+0x24/0xe0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000fffff045     0  2976   2974 0x00000000
 ffff880060c11b18 0000000000000086 ffff880000000000 ffff880060c11fd8
 ffff880007668630 00000000000122c0 0000000000004000 ffff880060c11fd8
 0000000000004000 00000000000122c0 ffffffff81793020 ffff880007668630
Call Trace:
 [<ffffffff81236320>] xlog_grant_log_space+0x280/0x3d0
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812412c8>] xfs_rename+0x138/0x630
 [<ffffffff810c036e>] ? exec_permission+0x3e/0x70
 [<ffffffff81253111>] xfs_vn_rename+0x61/0x70
 [<ffffffff810c1b4e>] vfs_rename+0x41e/0x480
 [<ffffffff810c3bd6>] sys_renameat+0x236/0x270
 [<ffffffff8122551d>] ? xfs_dir2_sf_getdents+0x21d/0x390
 [<ffffffff810c6800>] ? filldir+0x0/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810b8a4a>] ? fput+0x1aa/0x220
 [<ffffffff810c3c26>] sys_rename+0x16/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed4     0  2977   2974 0x00000000
 ffff88000873fa88 0000000000000082 ffff88000873fac8 ffff88000873ffd8
 ffff880007669710 00000000000122c0 0000000000004000 ffff88000873ffd8
 0000000000004000 00000000000122c0 ffff88007f2e1790 ffff880007669710
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8123dbf6>] ? xfs_icsb_disable_counter+0x16/0xa0
 [<ffffffff812445bb>] ? _xfs_trans_bjoin+0x4b/0x60
 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0
 [<ffffffff8123dcb6>] ? xfs_icsb_balance_counter_locked+0x36/0xc0
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0
 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320
 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0
 [<ffffffff8122fa86>] xfs_iunlink+0x46/0x160
 [<ffffffff81253d21>] ? xfs_mark_inode_dirty_sync+0x21/0x30
 [<ffffffff81253dcf>] ? xfs_ichgtime+0x9f/0xc0
 [<ffffffff81245677>] xfs_droplink+0x57/0x70
 [<ffffffff8124751a>] xfs_remove+0x28a/0x370
 [<ffffffff81253443>] xfs_vn_unlink+0x43/0x90
 [<ffffffff810c161b>] vfs_unlink+0x8b/0x110
 [<ffffffff810c0e20>] ? lookup_hash+0x30/0x40
 [<ffffffff810c3db3>] do_unlinkat+0x183/0x1c0
 [<ffffffff810bb3f1>] ? sys_newstat+0x31/0x50
 [<ffffffff810c3e01>] sys_unlink+0x11/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefa8     0  2978   2974 0x00000000
 ffff880040da7c38 0000000000000082 ffff880000000000 ffff880040da7fd8
 ffff880007668090 00000000000122c0 0000000000004000 ffff880040da7fd8
 0000000000004000 00000000000122c0 ffff88012ff78b90 ffff880007668090
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0
 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20
 [<ffffffff810cdb94>] notify_change+0x104/0x2e0
 [<ffffffff810db270>] utimes_common+0xd0/0x1a0
 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40
 [<ffffffff810db416>] do_utimes+0xd6/0xf0
 [<ffffffff810db5ae>] sys_utime+0x1e/0x70
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefd3     0  2979   2974 0x00000000
 ffff8800072c7ae8 0000000000000082 ffff8800072c7b78 ffff8800072c7fd8
 ffff880007669170 00000000000122c0 0000000000004000 ffff8800072c7fd8
 0000000000004000 00000000000122c0 ffff88012ff785f0 ffff880007669170
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0
 [<ffffffff81247920>] xfs_create+0x170/0x5d0
 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed0     0  2980   2974 0x00000000
 ffff88003dd91688 0000000000000082 0000000000000000 ffff88003dd91fd8
 ffff880007abc290 00000000000122c0 0000000000004000 ffff88003dd91fd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff880007abc290
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0
 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320
 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0
 [<ffffffff812292d9>] xfs_ialloc_read_agi+0x29/0x90
 [<ffffffff8122957b>] xfs_ialloc_ag_select+0x12b/0x260
 [<ffffffff8122abc7>] xfs_dialloc+0x3d7/0x860
 [<ffffffff8124acc8>] ? __xfs_get_blocks+0x1c8/0x210
 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0
 [<ffffffff810ddb9e>] ? unmap_underlying_metadata+0xe/0x50
 [<ffffffff8122ef4d>] xfs_ialloc+0x5d/0x690
 [<ffffffff8124a031>] ? kmem_zone_alloc+0x91/0xe0
 [<ffffffff8124570d>] xfs_dir_ialloc+0x7d/0x320
 [<ffffffff81236552>] ? xfs_log_reserve+0xe2/0xf0
 [<ffffffff81247b83>] xfs_create+0x3d3/0x5d0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed0     0  2981   2974 0x00000000
 ffff88005b79f618 0000000000000086 ffff88005b79f598 ffff88005b79ffd8
 ffff880007abcdd0 00000000000122c0 0000000000004000 ffff88005b79ffd8
 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abcdd0
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8122cd44>] ? xfs_iext_bno_to_ext+0x84/0x160
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff81244a40>] xfs_trans_get_buf+0xc0/0xe0
 [<ffffffff8121ac3f>] xfs_da_do_buf+0x3df/0x6d0
 [<ffffffff8121b0c5>] xfs_da_get_buf+0x25/0x30
 [<ffffffff81220926>] ? xfs_dir2_data_init+0x46/0xe0
 [<ffffffff81220926>] xfs_dir2_data_init+0x46/0xe0
 [<ffffffff8121e829>] xfs_dir2_sf_to_block+0xb9/0x5a0
 [<ffffffff8105106a>] ? wake_up_bit+0x2a/0x40
 [<ffffffff81226a78>] xfs_dir2_sf_addname+0x418/0x5c0
 [<ffffffff8122f3fb>] ? xfs_ialloc+0x50b/0x690
 [<ffffffff8121e61c>] xfs_dir_createname+0x14c/0x1a0
 [<ffffffff81247bf9>] xfs_create+0x449/0x5d0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefbf     0  2982   2974 0x00000000
 ffff88005b7f9c38 0000000000000082 ffff880000000000 ffff88005b7f9fd8
 ffff880007698150 00000000000122c0 0000000000004000 ffff88005b7f9fd8
 0000000000004000 00000000000122c0 ffff88012ff79130 ffff880007698150
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0
 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20
 [<ffffffff810cdb94>] notify_change+0x104/0x2e0
 [<ffffffff810db270>] utimes_common+0xd0/0x1a0
 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40
 [<ffffffff810db416>] do_utimes+0xd6/0xf0
 [<ffffffff810db5ae>] sys_utime+0x1e/0x70
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree)
  2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
@ 2010-07-27 13:30         ` Dave Chinner
  2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2010-07-27 13:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On this same system, same setup (vanilla kernel with sha given below),
> I have now twice reproduced a complete hang in XFS. I can give more
> information, test patches or options etc if required.
> 
> setup.sh looks like this:
> #!/bin/bash
> modprobe rd rd_size=$[2*1024*1024]
> dd if=/dev/zero of=/dev/ram0 bs=4K
> mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> 
> The 'dd' is required to ensure rd driver does not allocate pages
> during IO (which can lead to out of memory deadlocks). Running just
> involves changing into mnt directory and
> 
> while true
> do
>   sync
>   echo 3 > /proc/sys/vm/drop_caches
>   ../dbench -c ../loadfiles/client.txt -t20 8
>   rm -rf clients
> done
> 
> And wait for it to hang (happend in < 5 minutes here)
....
> Call Trace:
>  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0

It's waiting on log space to be freed up. Either there's an
accounting problem (possible), or you've got an xfslogd/xfsaild
spinning and not making progress competing log IOs or pushing the
tail of the log. I'll see if I can reproduce it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-27 13:30         ` Dave Chinner
@ 2010-07-27 14:58           ` Dave Chinner
  2010-07-28 13:17             ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2010-07-27 14:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On this same system, same setup (vanilla kernel with sha given below),
> > I have now twice reproduced a complete hang in XFS. I can give more
> > information, test patches or options etc if required.
> > 
> > setup.sh looks like this:
> > #!/bin/bash
> > modprobe rd rd_size=$[2*1024*1024]
> > dd if=/dev/zero of=/dev/ram0 bs=4K
> > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> > 
> > The 'dd' is required to ensure rd driver does not allocate pages
> > during IO (which can lead to out of memory deadlocks). Running just
> > involves changing into mnt directory and
> > 
> > while true
> > do
> >   sync
> >   echo 3 > /proc/sys/vm/drop_caches
> >   ../dbench -c ../loadfiles/client.txt -t20 8
> >   rm -rf clients
> > done
> > 
> > And wait for it to hang (happend in < 5 minutes here)
> ....
> > Call Trace:
> >  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
> 
> It's waiting on log space to be freed up. Either there's an
> accounting problem (possible), or you've got an xfslogd/xfsaild
> spinning and not making progress competing log IOs or pushing the
> tail of the log. I'll see if I can reproduce it.

Ok, I've just reproduced it. From some tracing:

touch-3340  [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532

The key part here is this:

curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532

This says the tail of the log is roughly 62MB behind the head. i.e
the log is full and we are waiting for tail pushing to write the
item holding the tail in place to disk so it can them be moved
forward. That's better than an accounting problem, at least.

So what is holding the tail in place? The first item on the AIL
appears to be:

xfsaild/ram0-2997  [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock

A stale buffer. Given that the next objects show this trace:

xfsaild/ram0-2997  [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock
xfsaild/ram0-2997  [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find
xfsaild/ram0-2997  [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf

we see the next item on the AIL is an inode but the trace is
followed by a release on the original buffer, than tells me the
inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the
inode buffer out. That results in xfs_inode_item_pushbuf() being
called, and that tries to lock the inode buffer to flush it.
xfs_buf_rele is called if the trylock on the buffer fails.

IOWs, this looks to be another problem with inode cluster freeing.

Ok, so we can't flush the buffer because it is locked. Why is it
locked? Well, that is unclear as yet. None of the blocked processes
should be holding an inode buffer locked, and a stale buffer should
be unlocked during transaction commit and not live longer than
the log IO that writes the transaction to disk. That is, it should
not get locked again before everything is freed up.

That's as much as I can get from post-mortem analysis - I need to
capture a trace that spans the lockup to catch what happens
to the buffer that we are hung on. That will have to wait until the
morning....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
@ 2010-07-28 13:17             ` Dave Chinner
  2010-07-29 14:05               ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2010-07-28 13:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Wed, Jul 28, 2010 at 12:58:09AM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > > On this same system, same setup (vanilla kernel with sha given below),
> > > I have now twice reproduced a complete hang in XFS. I can give more
> > > information, test patches or options etc if required.
> > > 
> > > setup.sh looks like this:
> > > #!/bin/bash
> > > modprobe rd rd_size=$[2*1024*1024]
> > > dd if=/dev/zero of=/dev/ram0 bs=4K
> > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> > > 
> > > The 'dd' is required to ensure rd driver does not allocate pages
> > > during IO (which can lead to out of memory deadlocks). Running just
> > > involves changing into mnt directory and
> > > 
> > > while true
> > > do
> > >   sync
> > >   echo 3 > /proc/sys/vm/drop_caches
> > >   ../dbench -c ../loadfiles/client.txt -t20 8
> > >   rm -rf clients
> > > done
> > > 
> > > And wait for it to hang (happend in < 5 minutes here)
> > ....
> > > Call Trace:
> > >  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
> > 
> > It's waiting on log space to be freed up. Either there's an
> > accounting problem (possible), or you've got an xfslogd/xfsaild
> > spinning and not making progress competing log IOs or pushing the
> > tail of the log. I'll see if I can reproduce it.
> 
> Ok, I've just reproduced it. From some tracing:
> 
> touch-3340  [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532
> 
> The key part here is this:
> 
> curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532
> 
> This says the tail of the log is roughly 62MB behind the head. i.e
> the log is full and we are waiting for tail pushing to write the
> item holding the tail in place to disk so it can them be moved
> forward. That's better than an accounting problem, at least.
> 
> So what is holding the tail in place? The first item on the AIL
> appears to be:
> 
> xfsaild/ram0-2997  [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock
> 
> A stale buffer. Given that the next objects show this trace:
> 
> xfsaild/ram0-2997  [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock
> xfsaild/ram0-2997  [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find
> xfsaild/ram0-2997  [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf
> 
> we see the next item on the AIL is an inode but the trace is
> followed by a release on the original buffer, than tells me the
> inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the
> inode buffer out. That results in xfs_inode_item_pushbuf() being
> called, and that tries to lock the inode buffer to flush it.
> xfs_buf_rele is called if the trylock on the buffer fails.
> 
> IOWs, this looks to be another problem with inode cluster freeing.
> 
> Ok, so we can't flush the buffer because it is locked. Why is it
> locked? Well, that is unclear as yet. None of the blocked processes
> should be holding an inode buffer locked, and a stale buffer should
> be unlocked during transaction commit and not live longer than
> the log IO that writes the transaction to disk. That is, it should
> not get locked again before everything is freed up.
> 
> That's as much as I can get from post-mortem analysis - I need to
> capture a trace that spans the lockup to catch what happens
> to the buffer that we are hung on. That will have to wait until the
> morning....

Ok, so I got a trace of all the inode and buffer locking and log
item operations, and the reason the hang has occurred can bee seen
here:

dbench-3084  [007] 1877156.395784: xfs_buf_item_unlock_stale: dev 1:0 bno 0x80040 len 0x2000 hold 2 pincount 0 lock 0 flags |ASYNC|DONE|STALE|PAGE_CACHE recur 1 refcount 1 bliflags |STALE|INODE_ALLOC|STALE_INODE lidesc 0x(null) liflags IN_AIL

The key points in this trace is that when we are unlocking a stale
buffer during transaction commit - we don't actually unlock it. WHat
we do it:

        /*
         * If the buf item is marked stale, then don't do anything.  We'll
         * unlock the buffer and free the buf item when the buffer is unpinned
         * for the last time.
         */
        if (bip->bli_flags & XFS_BLI_STALE) {
                trace_xfs_buf_item_unlock_stale(bip);
                ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL);
                if (!aborted) {
                        atomic_dec(&bip->bli_refcount);
                        return;
                }
        }

But from the above trace it can be seen that the buffer pincount is zero.
Hence it will never get and unpin callback, and hence never get unlocked.
As a result, this process here is the one that is stuck on the buffer:

dbench        D 0000000000000007     0  3084      1 0x00000004
 ffff880104e07608 0000000000000086 ffff880104e075b8 0000000000014000
 ffff880104e07fd8 0000000000014000 ffff880104e07fd8 ffff880104c1d770
 0000000000014000 0000000000014000 ffff880104e07fd8 0000000000014000
Call Trace:
 [<ffffffff817e59cd>] schedule_timeout+0x1ed/0x2c0
 [<ffffffff810de9de>] ? ring_buffer_lock_reserve+0x9e/0x160
 [<ffffffff817e690e>] __down+0x7e/0xc0
 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290
 [<ffffffff810a05d0>] down+0x40/0x50
 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290
 [<ffffffff812f314d>] xfs_buf_lock+0x4d/0x110
 [<ffffffff812f41d5>] _xfs_buf_find+0x145/0x290
 [<ffffffff812f4380>] xfs_buf_get+0x60/0x1c0
 [<ffffffff812ea8f0>] xfs_trans_get_buf+0xe0/0x180
 [<ffffffff812ccdab>] xfs_ialloc_inode_init+0xcb/0x1c0
 [<ffffffff812cdaf9>] xfs_ialloc_ag_alloc+0x179/0x4a0
 [<ffffffff812cdeff>] xfs_dialloc+0xdf/0x870
 [<ffffffff8105ec88>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff812d2105>] xfs_ialloc+0x65/0x6b0
 [<ffffffff812eb032>] xfs_dir_ialloc+0x82/0x2d0
 [<ffffffff8128ce83>] ? ftrace_raw_event_xfs_lock_class+0xd3/0xe0
 [<ffffffff812ecad7>] ? xfs_create+0x1a7/0x690
 [<ffffffff812ecd37>] xfs_create+0x407/0x690
 [<ffffffff812f9f97>] xfs_vn_mknod+0xa7/0x1c0
 [<ffffffff812fa0e0>] xfs_vn_create+0x10/0x20
 [<ffffffff8114ca6c>] vfs_create+0xac/0xd0
 [<ffffffff8114d6ec>] do_last+0x51c/0x620
 [<ffffffff8114f6d8>] do_filp_open+0x228/0x640
 [<ffffffff812c1f18>] ? xfs_dir2_block_getdents+0x218/0x220
 [<ffffffff8115a62a>] ? alloc_fd+0x10a/0x150
 [<ffffffff8113f919>] do_sys_open+0x69/0x140
 [<ffffffff8113fa30>] sys_open+0x20/0x30
 [<ffffffff81035032>] system_call_fastpath+0x16/0x1b

It is trying to allocate a new inode chunk on disk, which happens to
be the one we just removed and staled. Now to find out why the pin
count on the buffer is wrong.

.....

Now it makes no sense - the buf item pin trace is there directly
directly before the above unlock stale trace. The buf item pinning
increments both the buffer pin count and the buf item refcount,
neith of which are reflected in the unlock stale trace. From the
trace analysis I did:


process
83		84		85		86		87
.....
								get
								trans_read
								stale
								format
								pin
								unlock

								committed
								unpin stale
								free

		trans_get
		init
		format
		pin
		item unlock

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

**** pincount goes from 1 to 0 here without any unpin traces ****

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

		(repeat 1x)

		get
		trans_read
		unlock

		get
		trans_read
		stale
		format
		pin

**** no committed/unlock/free trace for this transaction ****
**** hold count goes from 2 to 1 without any rele traces ****
**** is a new buffer allocated here without the staled buffer being committed? ****

		trans_get
		init
		format
		pin
		item unlock

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

		(repeat 9x)

		get
		trans_read
		unlock

		get
		trans_read
		stale
		format
		pin

**** pin count does not increment! ****

		buf lock from xfs_buf_find => hangs

**** buffer that was found is locked, not pinned ****

Something very strange is happening, and to make matters worse I
cannot reproduce it with a debug kernel (ran for 3 hours without
failing). Hence it smells like a race condition somewhere.

I've reproduced it without delayed logging, so it is not directly
related to that functionality.

I've seen this warning:

Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117

Which indicates we failed to mark an inode stale when freeing an
inode cluster, but I think I've fixed that and the problem still
shows up. It's posible the last version didn't fix it, but....

Now I've got the ag iterator rotor patch in place as well and
possibly a different version of the cluster free fix to what I
previously tested and it's now been running for almost half an hour.
I can't say yet whether I've fixed the bug of just changed the
timing enough to avoid it. I'll leave this test running over night
and redo individual patch testing tomorrow.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-28 13:17             ` Dave Chinner
@ 2010-07-29 14:05               ` Nick Piggin
  2010-07-29 22:56                 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2010-07-29 14:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, xfs

On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> Something very strange is happening, and to make matters worse I
> cannot reproduce it with a debug kernel (ran for 3 hours without
> failing). Hence it smells like a race condition somewhere.
> 
> I've reproduced it without delayed logging, so it is not directly
> related to that functionality.
> 
> I've seen this warning:
> 
> Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> 
> Which indicates we failed to mark an inode stale when freeing an
> inode cluster, but I think I've fixed that and the problem still
> shows up. It's posible the last version didn't fix it, but....

I've seen that one a couple of times too. Keeps coming back each
time you echo 3 > /proc/sys/vm/drop_caches :)


> Now I've got the ag iterator rotor patch in place as well and
> possibly a different version of the cluster free fix to what I
> previously tested and it's now been running for almost half an hour.
> I can't say yet whether I've fixed the bug of just changed the
> timing enough to avoid it. I'll leave this test running over night
> and redo individual patch testing tomorrow.

I reproduced it with fs_stress now too. Any patches I could test
for you just let me know.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-29 14:05               ` Nick Piggin
@ 2010-07-29 22:56                 ` Dave Chinner
  2010-07-30  3:59                   ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2010-07-29 22:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, xfs

On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote:
> On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> > Something very strange is happening, and to make matters worse I
> > cannot reproduce it with a debug kernel (ran for 3 hours without
> > failing). Hence it smells like a race condition somewhere.
> > 
> > I've reproduced it without delayed logging, so it is not directly
> > related to that functionality.
> > 
> > I've seen this warning:
> > 
> > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> > 
> > Which indicates we failed to mark an inode stale when freeing an
> > inode cluster, but I think I've fixed that and the problem still
> > shows up. It's posible the last version didn't fix it, but....
> 
> I've seen that one a couple of times too. Keeps coming back each
> time you echo 3 > /proc/sys/vm/drop_caches :)

Yup - it's an unflushable inode that is pinning the tail of the log,
hence causing the log space hangs.

> > Now I've got the ag iterator rotor patch in place as well and
> > possibly a different version of the cluster free fix to what I
> > previously tested and it's now been running for almost half an hour.
> > I can't say yet whether I've fixed the bug of just changed the
> > timing enough to avoid it. I'll leave this test running over night
> > and redo individual patch testing tomorrow.
> 
> I reproduced it with fs_stress now too. Any patches I could test
> for you just let me know.

You should see them in a few minutes ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-29 22:56                 ` Dave Chinner
@ 2010-07-30  3:59                   ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2010-07-30  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, xfs

On Fri, Jul 30, 2010 at 08:56:58AM +1000, Dave Chinner wrote:
> On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote:
> > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> > > Something very strange is happening, and to make matters worse I
> > > cannot reproduce it with a debug kernel (ran for 3 hours without
> > > failing). Hence it smells like a race condition somewhere.
> > > 
> > > I've reproduced it without delayed logging, so it is not directly
> > > related to that functionality.
> > > 
> > > I've seen this warning:
> > > 
> > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> > > 
> > > Which indicates we failed to mark an inode stale when freeing an
> > > inode cluster, but I think I've fixed that and the problem still
> > > shows up. It's posible the last version didn't fix it, but....
> > 
> > I've seen that one a couple of times too. Keeps coming back each
> > time you echo 3 > /proc/sys/vm/drop_caches :)
> 
> Yup - it's an unflushable inode that is pinning the tail of the log,
> hence causing the log space hangs.
> 
> > > Now I've got the ag iterator rotor patch in place as well and
> > > possibly a different version of the cluster free fix to what I
> > > previously tested and it's now been running for almost half an hour.
> > > I can't say yet whether I've fixed the bug of just changed the
> > > timing enough to avoid it. I'll leave this test running over night
> > > and redo individual patch testing tomorrow.
> > 
> > I reproduced it with fs_stress now too. Any patches I could test
> > for you just let me know.
> 
> You should see them in a few minutes ;)

It's certainly not locking up like it used to... Thanks!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VFS scalability git tree
  2010-07-27  8:06     ` VFS scalability git tree Nick Piggin
  2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
@ 2010-07-28 12:57       ` Dave Chinner
  2010-07-29 14:03         ` Nick Piggin
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, xfs

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).
> 
> So about this XFS livelock type thingy. It looks like this, and happens
> periodically while running the above fs_mark benchmark requiring reclaim
> of inodes:
....

> Nothing much happening except 100% system time for seconds at a time
> (length of time varies). This is on a ramdisk, so it isn't waiting
> for IO.
> 
> During this time, lots of things are contending on the lock:
> 
>     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
>      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
>      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
>      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
>      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
>                    |
>                    --- __write_lock_failed
>                       |
>                       |--99.92%-- xfs_inode_ag_walk
>                       |          xfs_inode_ag_iterator
>                       |          xfs_reclaim_inode_shrink
>                       |          shrink_slab
>                       |          shrink_zone
>                       |          balance_pgdat
>                       |          kswapd
>                       |          kthread
>                       |          kernel_thread_helper
>                        --0.08%-- [...]
> 
>      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
>      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
>      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
>      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
>                    |
>                    --- xfs_log_commit_cil
>                        _xfs_trans_commit
>                       |
>                       |--60.00%-- xfs_remove
>                       |          xfs_vn_unlink
>                       |          vfs_unlink
>                       |          do_unlinkat
>                       |          sys_unlink
> 
> I'm not sure if there was a long-running read locker in there causing
> all the write lockers to fail, or if they are just running into one
> another.

The longest hold is in the inode cluster writeback
(xfs_iflush_cluster), but if there is no IO then I don't see how
that would be a problem.

I suspect that it might be caused by having several CPUs
all trying to run the shrinker at the same time and them all
starting at the same AG and therefore lockstepping and getting
nothing done because they are all scanning the same inodes.

Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
avoid this lockstepping.  I've attached a patch below to do this
- can you give it a try?

> But anyway, I hacked the following patch which seemed to
> improve that behaviour. I haven't run any throughput numbers on it yet,
> but I could if you're interested (and it's not completely broken!)

Batching is certainly something that I have been considering, but
apart from the excessive scanning bug, the per-ag inode tree lookups
hve not featured prominently in any profiling I've done, so it
hasn't been a high priority.

You patch looks like it will work fine, but I think it can be made a
lot cleaner. I'll have a closer look at this once I get to the bottom of
the dbench hang you are seeing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: add an AG iterator start rotor

From: Dave Chinner <dchinner@redhat.com>

To avoid multiple CPUs from executing inode cache shrinkers on the same AG all
at the same time, make every shrinker call start on a different AG. This will
mostly prevent concurrent shrinker calls from competing for the serialising
pag_ici_lock and lock-stepping reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   11 ++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index dfcbd98..5322105 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -181,11 +181,14 @@ xfs_inode_ag_iterator(
 	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
+	xfs_agnumber_t		start_ag;
 	xfs_agnumber_t		ag;
 	int			nr;
+	int			looped = 0;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	ag = 0;
+	start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount;
+	ag = start_ag;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
@@ -197,6 +200,12 @@ xfs_inode_ag_iterator(
 		}
 		if (nr <= 0)
 			break;
+		if (ag >= mp->m_sb.sb_agcount) {
+			looped = 1;
+			ag = 0;
+		}
+		if (ag >= start_ag && looped)
+			break;
 	}
 	if (nr_to_scan)
 		*nr_to_scan = nr;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..fae61bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic_t		m_agiter_rotor; /* ag iterator start rotor */
 } xfs_mount_t;
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: VFS scalability git tree
  2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
@ 2010-07-29 14:03         ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Nick Piggin, xfs

On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > > 
> > > > > Branch vfs-scale-working
> > > > 
> > > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > > run the same fs_mark parallel create/unlink workload to show
> > > > scalability as I ran here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > > 
> > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > > of a real disk (I don't have easy access to a good disk setup ATM, but
> > > I guess we're more interested in code above the block layer anyway).
> > > 
> > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > > yours.
> > > 
> > > I found that performance is a little unstable, so I sync and echo 3 >
> > > drop_caches between each run. When it starts reclaiming memory, things
> > > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > > of seconds in inode reclaim).
> > 
> > So about this XFS livelock type thingy. It looks like this, and happens
> > periodically while running the above fs_mark benchmark requiring reclaim
> > of inodes:
> ....
> 
> > Nothing much happening except 100% system time for seconds at a time
> > (length of time varies). This is on a ramdisk, so it isn't waiting
> > for IO.
> > 
> > During this time, lots of things are contending on the lock:
> > 
> >     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
> >      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
> >      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
> >      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
> >      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
> >                    |
> >                    --- __write_lock_failed
> >                       |
> >                       |--99.92%-- xfs_inode_ag_walk
> >                       |          xfs_inode_ag_iterator
> >                       |          xfs_reclaim_inode_shrink
> >                       |          shrink_slab
> >                       |          shrink_zone
> >                       |          balance_pgdat
> >                       |          kswapd
> >                       |          kthread
> >                       |          kernel_thread_helper
> >                        --0.08%-- [...]
> > 
> >      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
> >      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
> >      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
> >      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
> >      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
> >                    |
> >                    --- xfs_log_commit_cil
> >                        _xfs_trans_commit
> >                       |
> >                       |--60.00%-- xfs_remove
> >                       |          xfs_vn_unlink
> >                       |          vfs_unlink
> >                       |          do_unlinkat
> >                       |          sys_unlink
> > 
> > I'm not sure if there was a long-running read locker in there causing
> > all the write lockers to fail, or if they are just running into one
> > another.
> 
> The longest hold is in the inode cluster writeback
> (xfs_iflush_cluster), but if there is no IO then I don't see how
> that would be a problem.

No I wasn't suggesting there was, just that there could have
been one that I didn't notice in profiles (ie. because it
had taken read lock rather than spinning on it).

 
> I suspect that it might be caused by having several CPUs
> all trying to run the shrinker at the same time and them all
> starting at the same AG and therefore lockstepping and getting
> nothing done because they are all scanning the same inodes.

I think that is the most likely answer, yes.


> Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
> avoid this lockstepping.  I've attached a patch below to do this
> - can you give it a try?

Cool yes I will. I could try it in combination with the batching
patch too. Thanks.

 
> > But anyway, I hacked the following patch which seemed to
> > improve that behaviour. I haven't run any throughput numbers on it yet,
> > but I could if you're interested (and it's not completely broken!)
> 
> Batching is certainly something that I have been considering, but
> apart from the excessive scanning bug, the per-ag inode tree lookups
> hve not featured prominently in any profiling I've done, so it
> hasn't been a high priority.
> 
> You patch looks like it will work fine, but I think it can be made a
> lot cleaner. I'll have a closer look at this once I get to the bottom of
> the dbench hang you are seeing....

Well I'll see if I can measure any efficiency or lock contention
improvements with it and report back. Might have to wait till the
weekend.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-07-29 14:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20100722190100.GA22269@amd>
     [not found] ` <20100723135514.GJ32635@dastard>
     [not found]   ` <20100727070538.GA2893@amd>
2010-07-27  8:06     ` VFS scalability git tree Nick Piggin
2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30         ` Dave Chinner
2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17             ` Dave Chinner
2010-07-29 14:05               ` Nick Piggin
2010-07-29 22:56                 ` Dave Chinner
2010-07-30  3:59                   ` Nick Piggin
2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
2010-07-29 14:03         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox