Re: writeout stalls in current -git

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: writeout stalls in current -git
       [not found]                 ` <64bb37e0711021222q7d12c825mc62d433c4fe19e8@mail.gmail.com>
@ 2007-11-02 20:43                   ` David Chinner
  2007-11-02 21:02                     ` Torsten Kaiser
  2007-11-04 11:19                     ` Torsten Kaiser
  0 siblings, 2 replies; 10+ messages in thread
From: David Chinner @ 2007-11-02 20:43 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, David Chinner, linux-fsdevel, xfs

On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> [  630.000000] SysRq : Emergency Sync
> [  630.120000] Emergency Sync complete
> [  632.850000] SysRq : Show Blocked State
> [  632.850000]   task                        PC stack   pid father
> [  632.850000] pdflush       D ffff81000f091788     0   285      2
> [  632.850000]  ffff810005d4da80 0000000000000046 0000000000000800
> 0000007000000001
> [  632.850000]  ffff81000fd52400 ffffffff8022d61c ffffffff80819b00
> ffffffff80819b00
> [  632.850000]  ffffffff80815f40 ffffffff80819b00 ffff810100316f98
> 0000000000000000
> [  632.850000] Call Trace:
> [  632.850000]  [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90
> [  632.850000]  [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90
> [  632.850000]  [<ffffffff805b16e7>] __down+0xa7/0x11e
> [  632.850000]  [<ffffffff8022da70>] default_wake_function+0x0/0x10
> [  632.850000]  [<ffffffff805b1365>] __down_failed+0x35/0x3a
> [  632.850000]  [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40
> [  632.850000]  [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240
> [  632.850000]  [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190
> [  632.850000]  [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0
> [  632.850000]  [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340
> [  632.850000]  [<ffffffff80352361>] xfs_itobp+0x81/0x1e0
> [  632.850000]  [<ffffffff8026b293>] write_cache_pages+0x123/0x330
> [  632.850000]  [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520

That's stalled waiting on the inode cluster buffer lock. That implies
that the inode lcuser is already being written out and the inode has
been redirtied during writeout.

Does the kernel you are testing have the "flush inodes in ascending
inode number order" patches applied? If so, can you remove that
patch and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-02 20:43                   ` writeout stalls in current -git David Chinner
@ 2007-11-02 21:02                     ` Torsten Kaiser
  2007-11-04 11:19                     ` Torsten Kaiser
  1 sibling, 0 replies; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-02 21:02 UTC (permalink / raw)
  To: David Chinner
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On 11/2/07, David Chinner <dgc@sgi.com> wrote:
> On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote:
> > [  630.000000] SysRq : Emergency Sync
> > [  630.120000] Emergency Sync complete
> > [  632.850000] SysRq : Show Blocked State
> > [  632.850000]   task                        PC stack   pid father
> > [  632.850000] pdflush       D ffff81000f091788     0   285      2
> > [  632.850000]  ffff810005d4da80 0000000000000046 0000000000000800
> > 0000007000000001
> > [  632.850000]  ffff81000fd52400 ffffffff8022d61c ffffffff80819b00
> > ffffffff80819b00
> > [  632.850000]  ffffffff80815f40 ffffffff80819b00 ffff810100316f98
> > 0000000000000000
> > [  632.850000] Call Trace:
> > [  632.850000]  [<ffffffff8022d61c>] task_rq_lock+0x4c/0x90
> > [  632.850000]  [<ffffffff8022c8ea>] __wake_up_common+0x5a/0x90
> > [  632.850000]  [<ffffffff805b16e7>] __down+0xa7/0x11e
> > [  632.850000]  [<ffffffff8022da70>] default_wake_function+0x0/0x10
> > [  632.850000]  [<ffffffff805b1365>] __down_failed+0x35/0x3a
> > [  632.850000]  [<ffffffff803752ce>] xfs_buf_lock+0x3e/0x40
> > [  632.850000]  [<ffffffff8037740e>] _xfs_buf_find+0x13e/0x240
> > [  632.850000]  [<ffffffff8037757f>] xfs_buf_get_flags+0x6f/0x190
> > [  632.850000]  [<ffffffff803776b2>] xfs_buf_read_flags+0x12/0xa0
> > [  632.850000]  [<ffffffff80368824>] xfs_trans_read_buf+0x64/0x340
> > [  632.850000]  [<ffffffff80352361>] xfs_itobp+0x81/0x1e0
> > [  632.850000]  [<ffffffff8026b293>] write_cache_pages+0x123/0x330
> > [  632.850000]  [<ffffffff80354d0e>] xfs_iflush+0xfe/0x520
>
> That's stalled waiting on the inode cluster buffer lock. That implies
> that the inode lcuser is already being written out and the inode has
> been redirtied during writeout.
>
> Does the kernel you are testing have the "flush inodes in ascending
> inode number order" patches applied? If so, can you remove that
> patch and see if the problem goes away?

It's 2.6.23-mm1 with only some small fixes.

In it's broken-out directory I see:
git-xfs.patch

and

writeback-fix-periodic-superblock-dirty-inode-flushing.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch
writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch

I don't know if the patch you mentioned is part of that version of the
mm-patchset.

Torsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-02 20:43                   ` writeout stalls in current -git David Chinner
  2007-11-02 21:02                     ` Torsten Kaiser
@ 2007-11-04 11:19                     ` Torsten Kaiser
  2007-11-05  1:45                       ` David Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-04 11:19 UTC (permalink / raw)
  To: David Chinner
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

[-- Attachment #1: Type: text/plain, Size: 7465 bytes --]

On 11/2/07, David Chinner <dgc@sgi.com> wrote:
> That's stalled waiting on the inode cluster buffer lock. That implies
> that the inode lcuser is already being written out and the inode has
> been redirtied during writeout.
>
> Does the kernel you are testing have the "flush inodes in ascending
> inode number order" patches applied? If so, can you remove that
> patch and see if the problem goes away?

I can now confirm, that I see this also with the current mainline-git-version
I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
plus the fix for the sg changes in ieee1394.
Bisecting would be troublesome, as the sg changes prevent mainline to
boot with my normal config / kill my network.

treogen ~ # vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
-> starting emerge
 1  0      0 3627072    332 157724    0    0    97    13   41  189  2  2 94  2
 0  0      0 3607240    332 163736    0    0   599    10  332  951  2  1 93  4
 0  0      0 3601920    332 167592    0    0   380     2  218  870  1  1 98  0
 0  0      0 3596356    332 171648    0    0   404    21  182  818  0  0 99  0
 0  0      0 3579328    332 180436    0    0   878    12  147  912  1  1 97  2
 0  0      0 3575376    332 182776    0    0   236     4  244  953  1  1 95  3
 2  1      0 3571792    332 185084    0    0   232     7  256 1003  2  1 95  2
 0  0      0 3564844    332 187364    0    0   228   605  246 1167  2  1 93  4
 0  0      0 3562128    332 189784    0    0   230     4  527 1238  2  1 93  4
 0  1      0 3558764    332 191964    0    0   216    24  438 1059  1  1 93  6
 0  0      0 3555120    332 193868    0    0   199    36  406  959  0  0 92  8
 0  0      0 3552008    332 195928    0    0   197    11  458 1023  1  1 90  8
 0  0      0 3548728    332 197660    0    0   183     7  496 1086  1  1 90  8
 0  0      0 3545560    332 199372    0    0   170     8  483 1017  1  1 90  9
 0  1      0 3542124    332 201256    0    0   190     1  544 1137  1  1 88 10
 1  0      0 3536924    332 203296    0    0   195     7  637 1209  2  1 89  8
 1  1      0 3485096    332 249184    0    0   101    16 10372 4537 13  3 76  8
 2  0      0 3442004    332 279728    0    0  1086    40  219 1349  7  3 87  4
-> emerge is done reading its package database
 1  0      0 3254796    332 448636    0    0     0    27  128 8360 24  6 70  0
 2  0      0 3143304    332 554016    0    0    47    33  213 4480 16 11 72  1
-> kernel unpacked
 1  0      0 3125700    332 560416    0    0     1    20  122 1675 24  1 75  0
 1  0      0 3117356    332 567968    0    0     0   674  157 2975 24  2 73  1
 2  0      0 3111636    332 573736    0    0     0  1143  151 1924 23  1 75  1
 2  0      0 3102836    332 581332    0    0     0   890  153 1330 24  1 75  0
 1  0      0 3097236    332 587360    0    0     0   656  194 1593 24  1 74  0
 1  0      0 3086824    332 595480    0    0     0   812  235 2657 25  1 74  0
-> tar.bz2 created, installing starts now
 0  0      0 3091612    332 601024    0    0    82   708  499 2397 17  4 78  1
 0  0      0 3086088    332 602180    0    0    69  2459  769 2237  3  4 88  6
 0  0      0 3085916    332 602236    0    0     2  1752  693  949  1  2 96  1
 0  0      0 3084544    332 603564    0    0    66  4057 1176 2850  3  6 91  0
 0  0      0 3078780    332 605572    0    0    98  3194 1169 3288  5  6 89  0
 0  0      0 3077940    332 605924    0    0    17  1139  823 1547  1  2 97  0
 0  0      0 3078268    332 605924    0    0     0   888  807 1329  0  1 99  0
-> first short stall
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 3077040    332 605924    0    0     0  1950  785 1495  0  2 89  8
 0  0      0 3076588    332 605896    0    0     2  3807  925 2046  1  4 95  0
 0  0      0 3076900    332 606052    0    0    11  2564  768 1471  1  3 95  1
 0  0      0 3071584    332 607928    0    0    87  2499 1108 3433  4  6 90  0
-> second longer stall
(emerge was not able to complete a single filemove until the 'resume' line)
 0  0      0 3071592    332 607928    0    0     0   693  692 1289  0  0 99  0
 0  0      0 3072584    332 607928    0    0     0   792  731 1507  0  1 99  0
 0  0      0 3072840    332 607928    0    0     0   806  707 1521  0  1 99  0
 0  0      0 3072724    332 607928    0    0     0   782  695 1372  0  0 99  0
 0  0      0 3072972    332 607928    0    0     0   677  612 1301  0  0 99  0
 0  0      0 3072772    332 607928    0    0     0   738  681 1352  1  1 99  0
 0  0      0 3073020    332 607928    0    0     0   785  708 1328  0  1 99  0
 0  0      0 3072896    332 607928    0    0     0   833  722 1383  0  0 99  0
-> emerge resumed
 0  0      0 3069476    332 607972    0    0     2  4885  812 2062  1  4 90  5
 1  0      0 3069648    332 608068    0    0     4  4658  833 2158  1  4 93  2
 0  0      0 3064972    332 610364    0    0   106  2494 1095 3620  5  7 88  0
 0  0      0 3057536    332 612444    0    0    86  2023 1012 3440  4  6 90  0
 1  0      0 3054572    332 612368    0    0   102  1526 1024 2277  6  5 87  2
-> emerge finished, but still >100Mb of dirty data accoring to /proc/meminfo
 0  0      0 3048548    332 615764    0    0   337   659  796 1000  3  1 96  0
 0  0      0 3092100    332 615860    0    0    15   616  606 1040  1  0 99  0
 0  0      0 3092148    332 615860    0    0     0   641  622 1085  0  0 99  0
 0  0      0 3092528    332 615860    0    0     0   766  654 1055  1  1 99  0
-> slow writeout until here, might be fixed with Peters patch to scale
the background threshold
 2  0      0 3090828    332 615860    0    0     0  1804  707 1215  0  2 98  0
 0  0      0 3091056    332 615864    0    0     0  3877  831 2047  1  4 94  1
 3  0      0 3090780    332 615864    0    0     0  2048  784 1154  1  2 97  1
 0  0      0 3091096    332 615864    0    0     0  2690  751 1538  0  3 96  1
 0  1      0 3091056    332 615864    0    0     0  2018  748  866  0  2 95  2
 2  0      0 3092960    332 615864    0    0     0  2076  719 1118  0  2 97  0
-> writeout "done", /proc/meminfo showed 0kb of dirty data remaining
 0  0      0 3093072    332 615864    0    0     0   645  646 1104  0  0 99  0
 0  0      0 3093532    332 615864    0    0     0   726  658 1223  0  1 99  0
 0  0      0 3093540    332 615864    0    0     0   801  699 1314  0  1 99  0
 0  0      0 3093580    332 615864    0    0     0   783  738 1350  0  1 99  0
 0  0      0 3093284    332 615920    0    0     6   746  655 1381  1  1 98  0
 0  0      0 3092872    332 615920    0    0     0   862  703 1391  1  1 98  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 3093224    332 615920    0    0     0   799  676 1394  0  0 99  0
 0  0      0 3093304    332 615920    0    0     0   835  672 1514  1  1 98  0
 0  0      0 3093476    332 615920    0    0     0   784  641 1404  1  1 98  0
 0  0      0 3093264    332 615920    0    0     0   722  626 1483  1  1 99  0
 0  0      0 3093476    332 615920    0    0     0     7  328  350  0  0 99  0
 0  0      0 3093628    332 615920    0    0     0    11  332  407  0  0 99  0
-> disks finally go idle

Torsten

.config for 2.6.24-rc1+git attached

[-- Attachment #2: config.gz --]
[-- Type: application/x-gzip, Size: 11732 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-04 11:19                     ` Torsten Kaiser
@ 2007-11-05  1:45                       ` David Chinner
  2007-11-05  7:01                         ` Torsten Kaiser
  2007-11-05 18:27                         ` Torsten Kaiser
  0 siblings, 2 replies; 10+ messages in thread
From: David Chinner @ 2007-11-05  1:45 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: David Chinner, Peter Zijlstra, Fengguang Wu, Maxim Levitsky,
	linux-kernel, Andrew Morton, linux-fsdevel, xfs

[-- Attachment #1: Type: text/plain, Size: 1083 bytes --]

On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
> On 11/2/07, David Chinner <dgc@sgi.com> wrote:
> > That's stalled waiting on the inode cluster buffer lock. That implies
> > that the inode lcuser is already being written out and the inode has
> > been redirtied during writeout.
> >
> > Does the kernel you are testing have the "flush inodes in ascending
> > inode number order" patches applied? If so, can you remove that
> > patch and see if the problem goes away?
> 
> I can now confirm, that I see this also with the current mainline-git-version
> I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
> plus the fix for the sg changes in ieee1394.

Ok, so it's probably a side effect of the writeback changes.

Attached are two patches (two because one was in a separate patchset as
a standalone change) that should prevent async writeback from blocking
on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
Can you see if this fixes the problem?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group



[-- Attachment #2: xfs-factor-inotobp --]
[-- Type: text/plain, Size: 9595 bytes --]

---
 fs/xfs/xfs_inode.c |  283 ++++++++++++++++++++++++-----------------------------
 1 file changed, 129 insertions(+), 154 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2007-09-12 15:41:22.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2007-09-13 08:57:06.395641940 +1000
@@ -124,6 +124,126 @@ xfs_inobp_check(
 #endif
 
 /*
+ * Simple wrapper for calling xfs_imap() that includes error
+ * and bounds checking
+ */
+STATIC int
+xfs_ino_to_imap(
+	xfs_mount_t	*mp,
+	xfs_trans_t	*tp,
+	xfs_ino_t	ino,
+	xfs_imap_t	*imap,
+	uint		imap_flags)
+{
+	int		error;
+
+	error = xfs_imap(mp, tp, ino, imap, imap_flags);
+	if (error) {
+		cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap()  returned an "
+				"error %d on %s.  Returning error.",
+				error, mp->m_fsname);
+		return error;
+	}
+
+	/*
+	 * If the inode number maps to a block outside the bounds
+	 * of the file system then return NULL rather than calling
+	 * read_buf and panicing when we get an error from the
+	 * driver.
+	 */
+	if ((imap->im_blkno + imap->im_len) >
+	    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
+		xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: "
+			"(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > "
+			" XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)",
+			(unsigned long long) imap->im_blkno,
+			(unsigned long long) imap->im_len,
+			XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks));
+		return XFS_ERROR(EINVAL);
+	}
+	return 0;
+}
+
+/*
+ * Find the buffer associated with the given inode map
+ * We do basic validation checks on the buffer once it has been
+ * retrieved from disk.
+ */
+STATIC int
+xfs_imap_to_bp(
+	xfs_mount_t	*mp,
+	xfs_trans_t	*tp,
+	xfs_imap_t	*imap,
+	xfs_buf_t	**bpp,
+	uint		buf_flags,
+	uint		imap_flags)
+{
+	int		error;
+	int		i;
+	int		ni;
+	xfs_buf_t	*bp;
+
+	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
+				   (int)imap->im_len, XFS_BUF_LOCK, &bp);
+	if (error) {
+		cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned "
+				"an error %d on %s.  Returning error.",
+				error, mp->m_fsname);
+		return error;
+	}
+
+	/*
+	 * Validate the magic number and version of every inode in the buffer
+	 * (if DEBUG kernel) or the first inode in the buffer, otherwise.
+	 */
+#ifdef DEBUG
+	ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog;
+#else	/* usual case */
+	ni = 1;
+#endif
+
+	for (i = 0; i < ni; i++) {
+		int		di_ok;
+		xfs_dinode_t	*dip;
+
+		dip = (xfs_dinode_t *)xfs_buf_offset(bp,
+					(i << mp->m_sb.sb_inodelog));
+		di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC &&
+			    XFS_DINODE_GOOD_VERSION(dip->di_core.di_version);
+		if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
+						XFS_ERRTAG_ITOBP_INOTOBP,
+						XFS_RANDOM_ITOBP_INOTOBP))) {
+			if (imap_flags & XFS_IMAP_BULKSTAT) {
+				xfs_trans_brelse(tp, bp);
+				return XFS_ERROR(EINVAL);
+			}
+			XFS_CORRUPTION_ERROR("xfs_imap_to_bp",
+						XFS_ERRLEVEL_HIGH, mp, dip);
+#ifdef DEBUG
+			cmn_err(CE_PANIC,
+					"Device %s - bad inode magic/vsn "
+					"daddr %lld #%d (magic=%x)",
+				XFS_BUFTARG_NAME(mp->m_ddev_targp),
+				(unsigned long long)imap->im_blkno, i,
+				be16_to_cpu(dip->di_core.di_magic));
+#endif
+			xfs_trans_brelse(tp, bp);
+			return XFS_ERROR(EFSCORRUPTED);
+		}
+	}
+
+	xfs_inobp_check(mp, bp);
+
+	/*
+	 * Mark the buffer as an inode buffer now that it looks good
+	 */
+	XFS_BUF_SET_VTYPE(bp, B_FS_INO);
+
+	*bpp = bp;
+	return 0;
+}
+
+/*
  * This routine is called to map an inode number within a file
  * system to the buffer containing the on-disk version of the
  * inode.  It returns a pointer to the buffer containing the
@@ -145,72 +265,19 @@ xfs_inotobp(
 	xfs_buf_t	**bpp,
 	int		*offset)
 {
-	int		di_ok;
 	xfs_imap_t	imap;
 	xfs_buf_t	*bp;
 	int		error;
-	xfs_dinode_t	*dip;
 
-	/*
-	 * Call the space management code to find the location of the
-	 * inode on disk.
-	 */
 	imap.im_blkno = 0;
-	error = xfs_imap(mp, tp, ino, &imap, XFS_IMAP_LOOKUP);
-	if (error != 0) {
-		cmn_err(CE_WARN,
-	"xfs_inotobp: xfs_imap()  returned an "
-	"error %d on %s.  Returning error.", error, mp->m_fsname);
+	error = xfs_ino_to_imap(mp, tp, ino, &imap, XFS_IMAP_LOOKUP);
+	if (error)
 		return error;
-	}
-
-	/*
-	 * If the inode number maps to a block outside the bounds of the
-	 * file system then return NULL rather than calling read_buf
-	 * and panicing when we get an error from the driver.
-	 */
-	if ((imap.im_blkno + imap.im_len) >
-	    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
-		cmn_err(CE_WARN,
-	"xfs_inotobp: inode number (%llu + %d) maps to a block outside the bounds "
-	"of the file system %s.  Returning EINVAL.",
-			(unsigned long long)imap.im_blkno,
-			imap.im_len, mp->m_fsname);
-		return XFS_ERROR(EINVAL);
-	}
-
-	/*
-	 * Read in the buffer.  If tp is NULL, xfs_trans_read_buf() will
-	 * default to just a read_buf() call.
-	 */
-	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap.im_blkno,
-				   (int)imap.im_len, XFS_BUF_LOCK, &bp);
 
-	if (error) {
-		cmn_err(CE_WARN,
-	"xfs_inotobp: xfs_trans_read_buf()  returned an "
-	"error %d on %s.  Returning error.", error, mp->m_fsname);
+	error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, 0);
+	if (error)
 		return error;
-	}
-	dip = (xfs_dinode_t *)xfs_buf_offset(bp, 0);
-	di_ok =
-		be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC &&
-		XFS_DINODE_GOOD_VERSION(dip->di_core.di_version);
-	if (unlikely(XFS_TEST_ERROR(!di_ok, mp, XFS_ERRTAG_ITOBP_INOTOBP,
-			XFS_RANDOM_ITOBP_INOTOBP))) {
-		XFS_CORRUPTION_ERROR("xfs_inotobp", XFS_ERRLEVEL_LOW, mp, dip);
-		xfs_trans_brelse(tp, bp);
-		cmn_err(CE_WARN,
-	"xfs_inotobp: XFS_TEST_ERROR()  returned an "
-	"error on %s.  Returning EFSCORRUPTED.",  mp->m_fsname);
-		return XFS_ERROR(EFSCORRUPTED);
-	}
-
-	xfs_inobp_check(mp, bp);
 
-	/*
-	 * Set *dipp to point to the on-disk inode in the buffer.
-	 */
 	*dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset);
 	*bpp = bp;
 	*offset = imap.im_boffset;
@@ -251,41 +318,15 @@ xfs_itobp(
 	xfs_imap_t	imap;
 	xfs_buf_t	*bp;
 	int		error;
-	int		i;
-	int		ni;
 
 	if (ip->i_blkno == (xfs_daddr_t)0) {
-		/*
-		 * Call the space management code to find the location of the
-		 * inode on disk.
-		 */
 		imap.im_blkno = bno;
-		if ((error = xfs_imap(mp, tp, ip->i_ino, &imap,
-					XFS_IMAP_LOOKUP | imap_flags)))
+		error = xfs_ino_to_imap(mp, tp, ip->i_ino, &imap,
+					XFS_IMAP_LOOKUP | imap_flags);
+		if (error)
 			return error;
 
 		/*
-		 * If the inode number maps to a block outside the bounds
-		 * of the file system then return NULL rather than calling
-		 * read_buf and panicing when we get an error from the
-		 * driver.
-		 */
-		if ((imap.im_blkno + imap.im_len) >
-		    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
-#ifdef DEBUG
-			xfs_fs_cmn_err(CE_ALERT, mp, "xfs_itobp: "
-					"(imap.im_blkno (0x%llx) "
-					"+ imap.im_len (0x%llx)) > "
-					" XFS_FSB_TO_BB(mp, "
-					"mp->m_sb.sb_dblocks) (0x%llx)",
-					(unsigned long long) imap.im_blkno,
-					(unsigned long long) imap.im_len,
-					XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks));
-#endif /* DEBUG */
-			return XFS_ERROR(EINVAL);
-		}
-
-		/*
 		 * Fill in the fields in the inode that will be used to
 		 * map the inode to its buffer from now on.
 		 */
@@ -303,76 +344,10 @@ xfs_itobp(
 	}
 	ASSERT(bno == 0 || bno == imap.im_blkno);
 
-	/*
-	 * Read in the buffer.  If tp is NULL, xfs_trans_read_buf() will
-	 * default to just a read_buf() call.
-	 */
-	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap.im_blkno,
-				   (int)imap.im_len, XFS_BUF_LOCK, &bp);
-	if (error) {
-#ifdef DEBUG
-		xfs_fs_cmn_err(CE_ALERT, mp, "xfs_itobp: "
-				"xfs_trans_read_buf() returned error %d, "
-				"imap.im_blkno 0x%llx, imap.im_len 0x%llx",
-				error, (unsigned long long) imap.im_blkno,
-				(unsigned long long) imap.im_len);
-#endif /* DEBUG */
+	error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, imap_flags);
+	if (error)
 		return error;
-	}
-
-	/*
-	 * Validate the magic number and version of every inode in the buffer
-	 * (if DEBUG kernel) or the first inode in the buffer, otherwise.
-	 * No validation is done here in userspace (xfs_repair).
-	 */
-#if !defined(__KERNEL__)
-	ni = 0;
-#elif defined(DEBUG)
-	ni = BBTOB(imap.im_len) >> mp->m_sb.sb_inodelog;
-#else	/* usual case */
-	ni = 1;
-#endif
-
-	for (i = 0; i < ni; i++) {
-		int		di_ok;
-		xfs_dinode_t	*dip;
-
-		dip = (xfs_dinode_t *)xfs_buf_offset(bp,
-					(i << mp->m_sb.sb_inodelog));
-		di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC &&
-			    XFS_DINODE_GOOD_VERSION(dip->di_core.di_version);
-		if (unlikely(XFS_TEST_ERROR(!di_ok, mp,
-						XFS_ERRTAG_ITOBP_INOTOBP,
-						XFS_RANDOM_ITOBP_INOTOBP))) {
-			if (imap_flags & XFS_IMAP_BULKSTAT) {
-				xfs_trans_brelse(tp, bp);
-				return XFS_ERROR(EINVAL);
-			}
-#ifdef DEBUG
-			cmn_err(CE_ALERT,
-					"Device %s - bad inode magic/vsn "
-					"daddr %lld #%d (magic=%x)",
-				XFS_BUFTARG_NAME(mp->m_ddev_targp),
-				(unsigned long long)imap.im_blkno, i,
-				be16_to_cpu(dip->di_core.di_magic));
-#endif
-			XFS_CORRUPTION_ERROR("xfs_itobp", XFS_ERRLEVEL_HIGH,
-					     mp, dip);
-			xfs_trans_brelse(tp, bp);
-			return XFS_ERROR(EFSCORRUPTED);
-		}
-	}
-
-	xfs_inobp_check(mp, bp);
 
-	/*
-	 * Mark the buffer as an inode buffer now that it looks good
-	 */
-	XFS_BUF_SET_VTYPE(bp, B_FS_INO);
-
-	/*
-	 * Set *dipp to point to the on-disk inode in the buffer.
-	 */
 	*dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset);
 	*bpp = bp;
 	return 0;

[-- Attachment #3: xfs-iflush-blocking-fix --]
[-- Type: text/plain, Size: 6403 bytes --]

---
 fs/xfs/linux-2.6/xfs_super.c |    3 +-
 fs/xfs/linux-2.6/xfs_vnode.h |    5 ---
 fs/xfs/xfs_inode.c           |   33 ++++++++++++++++---------
 fs/xfs/xfs_inode.h           |    7 +++--
 fs/xfs/xfs_vnodeops.c        |   55 +++++++++----------------------------------
 5 files changed, 41 insertions(+), 62 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c	2007-11-05 10:17:36.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c	2007-11-05 10:33:49.590268027 +1100
@@ -306,14 +306,15 @@ xfs_inotobp(
  * 0 for the disk block address.
  */
 int
-xfs_itobp(
+xfs_itobp_flags(
 	xfs_mount_t	*mp,
 	xfs_trans_t	*tp,
 	xfs_inode_t	*ip,
 	xfs_dinode_t	**dipp,
 	xfs_buf_t	**bpp,
 	xfs_daddr_t	bno,
-	uint		imap_flags)
+	uint		imap_flags,
+	uint		buf_flags)
 {
 	xfs_imap_t	imap;
 	xfs_buf_t	*bp;
@@ -344,10 +345,17 @@ xfs_itobp(
 	}
 	ASSERT(bno == 0 || bno == imap.im_blkno);
 
-	error = xfs_imap_to_bp(mp, tp, &imap, &bp, XFS_BUF_LOCK, imap_flags);
+	error = xfs_imap_to_bp(mp, tp, &imap, &bp, buf_flags, imap_flags);
 	if (error)
 		return error;
 
+	if (!bp) {
+		ASSERT(buf_flags & XFS_BUF_TRYLOCK);
+		ASSERT(tp == NULL);
+		*bpp = NULL;
+		return EAGAIN;
+	}
+
 	*dipp = (xfs_dinode_t *)xfs_buf_offset(bp, imap.im_boffset);
 	*bpp = bp;
 	return 0;
@@ -3068,15 +3076,6 @@ xfs_iflush(
 	}
 
 	/*
-	 * Get the buffer containing the on-disk inode.
-	 */
-	error = xfs_itobp(mp, NULL, ip, &dip, &bp, 0, 0);
-	if (error) {
-		xfs_ifunlock(ip);
-		return error;
-	}
-
-	/*
 	 * Decide how buffer will be flushed out.  This is done before
 	 * the call to xfs_iflush_int because this field is zeroed by it.
 	 */
@@ -3125,6 +3124,16 @@ xfs_iflush(
 	}
 
 	/*
+	 * Get the buffer containing the on-disk inode.
+	 */
+	error = xfs_itobp_flags(mp, NULL, ip, &dip, &bp, 0, 0,
+			(flags == INT_ASYNC) ? XFS_BUF_TRYLOCK : XFS_BUF_LOCK);
+	if (error ||!bp) {
+		xfs_ifunlock(ip);
+		return error;
+	}
+
+	/*
 	 * First flush out the inode that xfs_iflush was called with.
 	 */
 	error = xfs_iflush_int(ip, bp);
Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.h	2007-11-02 13:44:46.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_inode.h	2007-11-05 10:25:44.885153248 +1100
@@ -488,9 +488,12 @@ int		xfs_finish_reclaim_all(struct xfs_m
 /*
  * xfs_inode.c prototypes.
  */
-int		xfs_itobp(struct xfs_mount *, struct xfs_trans *,
+int		xfs_itobp_flags(struct xfs_mount *, struct xfs_trans *,
 			  xfs_inode_t *, struct xfs_dinode **, struct xfs_buf **,
-			  xfs_daddr_t, uint);
+			  xfs_daddr_t, uint, uint);
+#define xfs_itobp(mp, tp, ip, dipp, bpp, bno, iflags)	\
+	xfs_itobp_flags(mp, tp, ip, dipp, bpp, bno, iflags, XFS_BUF_LOCK)
+
 int		xfs_iread(struct xfs_mount *, struct xfs_trans *, xfs_ino_t,
 			  xfs_inode_t **, xfs_daddr_t, uint);
 int		xfs_iread_extents(struct xfs_trans *, xfs_inode_t *, int);
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c	2007-11-02 13:44:50.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c	2007-11-05 10:39:05.969204451 +1100
@@ -840,7 +840,8 @@ xfs_fs_write_inode(
 	struct inode		*inode,
 	int			sync)
 {
-	int			error = 0, flags = FLUSH_INODE;
+	int			error = 0;
+	int			flags = 0;
 
 	xfs_itrace_entry(XFS_I(inode));
 	if (sync) {
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vnode.h
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_vnode.h	2007-10-02 16:01:47.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_vnode.h	2007-11-05 10:40:49.103817818 +1100
@@ -73,12 +73,9 @@ typedef enum bhv_vrwlock {
 #define IO_INVIS	0x00020		/* don't update inode timestamps */
 
 /*
- * Flags for vop_iflush call
+ * Flags for xfs_inode_flush
  */
 #define FLUSH_SYNC		1	/* wait for flush to complete	*/
-#define FLUSH_INODE		2	/* flush the inode itself	*/
-#define FLUSH_LOG		4	/* force the last log entry for
-					 * this inode out to disk	*/
 
 /*
  * Flush/Invalidate options for vop_toss/flush/flushinval_pages.
Index: 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_vnodeops.c	2007-11-05 10:02:05.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_vnodeops.c	2007-11-05 10:37:53.398623943 +1100
@@ -3556,29 +3556,6 @@ xfs_inode_flush(
 	    ((iip == NULL) || !(iip->ili_format.ilf_fields & XFS_ILOG_ALL)))
 		return 0;
 
-	if (flags & FLUSH_LOG) {
-		if (iip && iip->ili_last_lsn) {
-			xlog_t		*log = mp->m_log;
-			xfs_lsn_t	sync_lsn;
-			int		s, log_flags = XFS_LOG_FORCE;
-
-			s = GRANT_LOCK(log);
-			sync_lsn = log->l_last_sync_lsn;
-			GRANT_UNLOCK(log, s);
-
-			if ((XFS_LSN_CMP(iip->ili_last_lsn, sync_lsn) > 0)) {
-				if (flags & FLUSH_SYNC)
-					log_flags |= XFS_LOG_SYNC;
-				error = xfs_log_force(mp, iip->ili_last_lsn, log_flags);
-				if (error)
-					return error;
-			}
-
-			if (ip->i_update_core == 0)
-				return 0;
-		}
-	}
-
 	/*
 	 * We make this non-blocking if the inode is contended,
 	 * return EAGAIN to indicate to the caller that they
@@ -3586,30 +3563,22 @@ xfs_inode_flush(
 	 * blocking on inodes inside another operation right
 	 * now, they get caught later by xfs_sync.
 	 */
-	if (flags & FLUSH_INODE) {
-		int	flush_flags;
-
-		if (flags & FLUSH_SYNC) {
-			xfs_ilock(ip, XFS_ILOCK_SHARED);
-			xfs_iflock(ip);
-		} else if (xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
-			if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip)) {
-				xfs_iunlock(ip, XFS_ILOCK_SHARED);
-				return EAGAIN;
-			}
-		} else {
+	if (flags & FLUSH_SYNC) {
+		xfs_ilock(ip, XFS_ILOCK_SHARED);
+		xfs_iflock(ip);
+	} else if (xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) {
+		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip)) {
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
 			return EAGAIN;
 		}
-
-		if (flags & FLUSH_SYNC)
-			flush_flags = XFS_IFLUSH_SYNC;
-		else
-			flush_flags = XFS_IFLUSH_ASYNC;
-
-		error = xfs_iflush(ip, flush_flags);
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	} else {
+		return EAGAIN;
 	}
 
+	error = xfs_iflush(ip, (flags & FLUSH_SYNC) ? XFS_IFLUSH_SYNC
+						    : XFS_IFLUSH_ASYNC);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+
 	return error;
 }
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-05  1:45                       ` David Chinner
@ 2007-11-05  7:01                         ` Torsten Kaiser
  2007-11-05 18:27                         ` Torsten Kaiser
  1 sibling, 0 replies; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-05  7:01 UTC (permalink / raw)
  To: David Chinner
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On 11/5/07, David Chinner <dgc@sgi.com> wrote:
> On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote:
> > I can now confirm, that I see this also with the current mainline-git-version
> > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c
> > plus the fix for the sg changes in ieee1394.
>
> Ok, so it's probably a side effect of the writeback changes.
>
> Attached are two patches (two because one was in a separate patchset as
> a standalone change) that should prevent async writeback from blocking
> on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> Can you see if this fixes the problem?

Applied both patches against the kernel mentioned above.
This blows up at boot:
[   80.807589] Filesystem "dm-0": Disabling barriers, not supported by
the underlying device
[   80.820241] XFS mounting filesystem dm-0
[   80.913144] ------------[ cut here ]------------
[   80.914932] kernel BUG at drivers/md/raid5.c:143!
[   80.916751] invalid opcode: 0000 [1] SMP
[   80.918338] CPU 3
[   80.919142] Modules linked in:
[   80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3
[   80.922628] RIP: 0010:[<ffffffff804b6ee4>]  [<ffffffff804b6ee4>]
__release_stripe+0x164/0x170
[   80.925935] RSP: 0018:ffff8100060e7dd0  EFLAGS: 00010002
[   80.927987] RAX: 0000000000000000 RBX: ffff81010141c288 RCX: 0000000000000000
[   80.930738] RDX: 0000000000000000 RSI: ffff81010141c288 RDI: ffff810004fb3200
[   80.933488] RBP: ffff810004fb3200 R08: 0000000000000000 R09: 0000000000000005
[   80.936240] R10: 0000000000000e00 R11: ffffe200038465e8 R12: ffff81010141c298
[   80.938990] R13: 0000000000000286 R14: ffff810004fb3330 R15: 0000000000000000
[   80.941741] FS:  000000000060c870(0000) GS:ffff810100313700(0000)
knlGS:0000000000000000
[   80.944861] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   80.947080] CR2: 00007fff7b295000 CR3: 0000000101842000 CR4: 00000000000006e0
[   80.949830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   80.952580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   80.955332] Process md1_raid5 (pid: 974, threadinfo
ffff8100060e6000, task ffff81000645c730)
[   80.958584] Stack:  ffff81010141c288 00000000000001f4
ffff810004fb3200 ffffffff804b6f2d
[   80.961761]  00000000000001f4 ffff81010141c288 ffffffff804c8bd0
0000000000000000
[   80.964681]  ffff8100060e7ee8 ffffffff804bd094 ffff81000645c730
ffff8100060e7e70
[   80.967518] Call Trace:
[   80.968558]  [<ffffffff804b6f2d>] release_stripe+0x3d/0x60
[   80.970677]  [<ffffffff804c8bd0>] md_thread+0x0/0x100
[   80.972629]  [<ffffffff804bd094>] raid5d+0x344/0x450
[   80.974549]  [<ffffffff8023df10>] process_timeout+0x0/0x10
[   80.976668]  [<ffffffff805ae1ca>] schedule_timeout+0x5a/0xd0
[   80.978855]  [<ffffffff804c8bd0>] md_thread+0x0/0x100
[   80.980807]  [<ffffffff804c8c00>] md_thread+0x30/0x100
[   80.982794]  [<ffffffff80249f20>] autoremove_wake_function+0x0/0x30
[   80.985214]  [<ffffffff804c8bd0>] md_thread+0x0/0x100
[   80.987167]  [<ffffffff80249b3b>] kthread+0x4b/0x80
[   80.989054]  [<ffffffff8020c9c8>] child_rip+0xa/0x12
[   80.990972]  [<ffffffff80249af0>] kthread+0x0/0x80
[   80.992824]  [<ffffffff8020c9be>] child_rip+0x0/0x12
[   80.994743]
[   80.995588]
[   80.995588] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 48 83 ec 28
48 89 5c 24
[   80.999307] RIP  [<ffffffff804b6ee4>] __release_stripe+0x164/0x170
[   81.001711]  RSP <ffff8100060e7dd0>

Switching back to unpatched 2.6.23-mm1 boots sucessfull...

Torsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-05  1:45                       ` David Chinner
  2007-11-05  7:01                         ` Torsten Kaiser
@ 2007-11-05 18:27                         ` Torsten Kaiser
  2007-11-06  4:25                           ` David Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-05 18:27 UTC (permalink / raw)
  To: David Chinner
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On 11/5/07, David Chinner <dgc@sgi.com> wrote:
> Ok, so it's probably a side effect of the writeback changes.
>
> Attached are two patches (two because one was in a separate patchset as
> a standalone change) that should prevent async writeback from blocking
> on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> Can you see if this fixes the problem?

Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
Applying your two patches ontop of that does not fix the stalls.

vmstat 10 output from unmerging (uninstalling) a kernel:
 1  0      0 3512188    332 192644    0    0   185    12  368  735 10  3 85  1
-> emerge starts to remove the kernel source files
 3  0      0 3506624    332 192836    0    0    15  9825 2458 8307  7 12 81  0
 0  0      0 3507212    332 192836    0    0     0   554  630 1233  0  1 99  0
 0  0      0 3507292    332 192836    0    0     0   537  580 1328  0  1 99  0
 0  0      0 3507168    332 192836    0    0     0   633  626 1380  0  1 99  0
 0  0      0 3507116    332 192836    0    0     0  1510  768 2030  1  2 97  0
 0  0      0 3507596    332 192836    0    0     0   524  540 1544  0  0 99  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 3507540    332 192836    0    0     0   489  551 1293  0  0 99  0
 0  0      0 3507528    332 192836    0    0     0   527  510 1432  1  1 99  0
 0  0      0 3508052    332 192840    0    0     0  2088  910 2964  2  3 95  0
 0  0      0 3507888    332 192840    0    0     0   442  565 1383  1  1 99  0
 0  0      0 3508704    332 192840    0    0     0   497  529 1479  0  0 99  0
 0  0      0 3508704    332 192840    0    0     0   594  595 1458  0  0 99  0
 0  0      0 3511492    332 192840    0    0     0  2381 1028 2941  2  3 95  0
 0  0      0 3510684    332 192840    0    0     0   699  600 1390  0  0 99  0
 0  0      0 3511636    332 192840    0    0     0   741  661 1641  0  0 100  0
 0  0      0 3524020    332 192840    0    0     0  2452 1080 3910  2  3 95  0
 0  0      0 3524040    332 192844    0    0     0   530  617 1297  0  0 99  0
 0  0      0 3524128    332 192844    0    0     0   812  674 1667  0  1 99  0
 0  0      0 3527000    332 193672    0    0   339   721  754 1681  3  2 93  1
-> emerge is finished, no dirty or writeback data in /proc/meminfo
 0  0      0 3571056    332 194768    0    0   111   639  632 1344  0  1 99  0
 0  0      0 3571260    332 194768    0    0     0   757  688 1405  1  0 99  0
 0  0      0 3571156    332 194768    0    0     0   753  641 1361  0  0 99  0
 0  0      0 3571404    332 194768    0    0     0   766  653 1389  0  0 99  0
 1  0      0 3571136    332 194768    0    0     6   764  669 1488  0  0 99  0
 0  0      0 3571668    332 194824    0    0     0   764  657 1482  0  0 99  0
 0  0      0 3571848    332 194824    0    0     0   673  659 1406  0  0 99  0
 0  0      0 3571908    332 195052    0    0    22   753  638 1500  0  1 99  0
 0  0      0 3573052    332 195052    0    0     0   765  631 1482  0  1 99  0
 0  0      0 3574144    332 195052    0    0     0   771  640 1497  0  0 99  0
 0  0      0 3573468    332 195052    0    0     0   458  485 1251  0  0 99  0
 0  0      0 3574184    332 195052    0    0     0   427  474 1192  0  0 100  0
 0  0      0 3575092    332 195052    0    0     0   461  482 1235  0  0 99  0
 0  0      0 3576368    332 195056    0    0     0   582  556 1310  0  0 99  0
 0  0      0 3579300    332 195056    0    0     0   695  571 1402  0  0 99  0
 0  0      0 3580376    332 195056    0    0     0   417  568  906  0  0 99  0
 0  0      0 3581212    332 195056    0    0     0   421  559  977  0  1 99  0
 0  0      0 3583780    332 195060    0    0     0   494  555 1080  0  1 99  0
 0  0      0 3584352    332 195060    0    0     0    99  347  559  0  0 99  0
 0  0      0 3585232    332 195060    0    0     0    11  301  621  0  0 99  0
-> disks go idle.

So these patches do not seem to be the source of these excessive disk writes...

Torsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-05 18:27                         ` Torsten Kaiser
@ 2007-11-06  4:25                           ` David Chinner
  2007-11-06  7:10                             ` Torsten Kaiser
  2007-11-06 19:01                             ` Peter Zijlstra
  0 siblings, 2 replies; 10+ messages in thread
From: David Chinner @ 2007-11-06  4:25 UTC (permalink / raw)
  To: Torsten Kaiser
  Cc: David Chinner, Peter Zijlstra, Fengguang Wu, Maxim Levitsky,
	linux-kernel, Andrew Morton, linux-fsdevel, xfs

On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> On 11/5/07, David Chinner <dgc@sgi.com> wrote:
> > Ok, so it's probably a side effect of the writeback changes.
> >
> > Attached are two patches (two because one was in a separate patchset as
> > a standalone change) that should prevent async writeback from blocking
> > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > Can you see if this fixes the problem?
> 
> Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
> Applying your two patches ontop of that does not fix the stalls.

So you are having RAID5 problems as well?

I'm struggling to understand what possible changed in XFS or writeback that
would lead to stalls like this, esp. as you appear to be removing files when
the stalls occur. Rather than vmstat, can you use something like iostat to
show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
some such issue.

OOC, what is the 'xfs_info <mtpt>' output for your filesystem? 

> vmstat 10 output from unmerging (uninstalling) a kernel:
>  1  0      0 3512188    332 192644    0    0   185    12  368  735 10  3 85  1
> -> emerge starts to remove the kernel source files
>  3  0      0 3506624    332 192836    0    0    15  9825 2458 8307  7 12 81  0
>  0  0      0 3507212    332 192836    0    0     0   554  630 1233  0  1 99  0
>  0  0      0 3507292    332 192836    0    0     0   537  580 1328  0  1 99  0
>  0  0      0 3507168    332 192836    0    0     0   633  626 1380  0  1 99  0
>  0  0      0 3507116    332 192836    0    0     0  1510  768 2030  1  2 97  0
>  0  0      0 3507596    332 192836    0    0     0   524  540 1544  0  0 99  0
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  0      0 3507540    332 192836    0    0     0   489  551 1293  0  0 99  0
>  0  0      0 3507528    332 192836    0    0     0   527  510 1432  1  1 99  0
>  0  0      0 3508052    332 192840    0    0     0  2088  910 2964  2  3 95  0
>  0  0      0 3507888    332 192840    0    0     0   442  565 1383  1  1 99  0
>  0  0      0 3508704    332 192840    0    0     0   497  529 1479  0  0 99  0
>  0  0      0 3508704    332 192840    0    0     0   594  595 1458  0  0 99  0
>  0  0      0 3511492    332 192840    0    0     0  2381 1028 2941  2  3 95  0
>  0  0      0 3510684    332 192840    0    0     0   699  600 1390  0  0 99  0
>  0  0      0 3511636    332 192840    0    0     0   741  661 1641  0  0 100  0
>  0  0      0 3524020    332 192840    0    0     0  2452 1080 3910  2  3 95  0
>  0  0      0 3524040    332 192844    0    0     0   530  617 1297  0  0 99  0
>  0  0      0 3524128    332 192844    0    0     0   812  674 1667  0  1 99  0
>  0  0      0 3527000    332 193672    0    0   339   721  754 1681  3  2 93  1
> -> emerge is finished, no dirty or writeback data in /proc/meminfo

At this point, can you run a "sync" and see how long that takes to
complete? The only thing I can think that woul dbe written out after
this point is inodes, but even then it seems to go on for a long,
long time and it really doesn't seem like XFS is holding up the
inode writes.

Another option is to use blktrace/blkparse to determine which process is
issuing this I/O.

>  0  0      0 3583780    332 195060    0    0     0   494  555 1080  0  1 99  0
>  0  0      0 3584352    332 195060    0    0     0    99  347  559  0  0 99  0
>  0  0      0 3585232    332 195060    0    0     0    11  301  621  0  0 99  0
> -> disks go idle.
> 
> So these patches do not seem to be the source of these excessive disk writes...

Well, the patches I posted should prevent blocking in the places that it
was seen, so if that does not stop the slowdowns then either the writeback
code is not feeding us inodes fast enough or the block device below is
having some kind of problem....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-06  4:25                           ` David Chinner
@ 2007-11-06  7:10                             ` Torsten Kaiser
  2007-11-06 19:01                             ` Peter Zijlstra
  1 sibling, 0 replies; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-06  7:10 UTC (permalink / raw)
  To: David Chinner
  Cc: Peter Zijlstra, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On 11/6/07, David Chinner <dgc@sgi.com> wrote:
> On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> > On 11/5/07, David Chinner <dgc@sgi.com> wrote:
> > > Ok, so it's probably a side effect of the writeback changes.
> > >
> > > Attached are two patches (two because one was in a separate patchset as
> > > a standalone change) that should prevent async writeback from blocking
> > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > > Can you see if this fixes the problem?
> >
> > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
> > Applying your two patches ontop of that does not fix the stalls.
>
> So you are having RAID5 problems as well?

The first 2.6.24-rc1-git-kernel that I patched with your patches did
not boot for me. (Oops send in one of my previous mails) But given
that the stacktrace was not xfs related and I had seen this patch on
the lkml, I tried to fix this Oops this way.
I did not have troubles with the RAID5 otherwise.

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. Rather than vmstat, can you use something like iostat to
> show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
> some such issue.

Will do this this evening.

> OOC, what is the 'xfs_info <mtpt>' output for your filesystem?

meta-data=/dev/mapper/root       isize=256    agcount=32, agsize=4731132 blks
         =                       sectsz=512   attr=1
data     =                       bsize=4096   blocks=151396224, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=32768, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0


> > vmstat 10 output from unmerging (uninstalling) a kernel:
> >  1  0      0 3512188    332 192644    0    0   185    12  368  735 10  3 85  1
> > -> emerge starts to remove the kernel source files
> >  3  0      0 3506624    332 192836    0    0    15  9825 2458 8307  7 12 81  0
> >  0  0      0 3507212    332 192836    0    0     0   554  630 1233  0  1 99  0
> >  0  0      0 3507292    332 192836    0    0     0   537  580 1328  0  1 99  0
> >  0  0      0 3507168    332 192836    0    0     0   633  626 1380  0  1 99  0
> >  0  0      0 3507116    332 192836    0    0     0  1510  768 2030  1  2 97  0
> >  0  0      0 3507596    332 192836    0    0     0   524  540 1544  0  0 99  0
> > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> >  0  0      0 3507540    332 192836    0    0     0   489  551 1293  0  0 99  0
> >  0  0      0 3507528    332 192836    0    0     0   527  510 1432  1  1 99  0
> >  0  0      0 3508052    332 192840    0    0     0  2088  910 2964  2  3 95  0
> >  0  0      0 3507888    332 192840    0    0     0   442  565 1383  1  1 99  0
> >  0  0      0 3508704    332 192840    0    0     0   497  529 1479  0  0 99  0
> >  0  0      0 3508704    332 192840    0    0     0   594  595 1458  0  0 99  0
> >  0  0      0 3511492    332 192840    0    0     0  2381 1028 2941  2  3 95  0
> >  0  0      0 3510684    332 192840    0    0     0   699  600 1390  0  0 99  0
> >  0  0      0 3511636    332 192840    0    0     0   741  661 1641  0  0 100  0
> >  0  0      0 3524020    332 192840    0    0     0  2452 1080 3910  2  3 95  0
> >  0  0      0 3524040    332 192844    0    0     0   530  617 1297  0  0 99  0
> >  0  0      0 3524128    332 192844    0    0     0   812  674 1667  0  1 99  0
> >  0  0      0 3527000    332 193672    0    0   339   721  754 1681  3  2 93  1
> > -> emerge is finished, no dirty or writeback data in /proc/meminfo
>
> At this point, can you run a "sync" and see how long that takes to
> complete?

Already tried that: http://lkml.org/lkml/2007/11/2/178
See the logs from the second unmerge in the second half of the mail.

The sync did not stop this writeout, but returned immediately.

> The only thing I can think that woul dbe written out after
> this point is inodes, but even then it seems to go on for a long,
> long time and it really doesn't seem like XFS is holding up the
> inode writes.

Yes, I completly agree that this is much to long. Thats why I included
the after-emerge-finished parts of the logs. But I still partly
suspect xfs, because the xfssyncd shows up when I hip SysRq+W.

> Another option is to use blktrace/blkparse to determine which process is
> issuing this I/O.
>
> >  0  0      0 3583780    332 195060    0    0     0   494  555 1080  0  1 99  0
> >  0  0      0 3584352    332 195060    0    0     0    99  347  559  0  0 99  0
> >  0  0      0 3585232    332 195060    0    0     0    11  301  621  0  0 99  0
> > -> disks go idle.
> >
> > So these patches do not seem to be the source of these excessive disk writes...
>
> Well, the patches I posted should prevent blocking in the places that it
> was seen, so if that does not stop the slowdowns then either the writeback
> code is not feeding us inodes fast enough or the block device below is
> having some kind of problem....

I don't think its the block device, because reading/writing larger
files do not seem to be troubled. It looks much more like an inode
problem. For example both installing and uninstalling kernel source
trees show these stalls, but during uninstalling this is much more
noticeable.

But I agree that this might not be xfs specific, as this showed up at
the same time as other people started reporting about the 100% iowait
bug. Could be that this is the same bug and the differences between
reiserfs and xfs might explain the iowait vs. idle. Or that I don't
see the 100% iowait is something else on my system...

Torsten

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-06  4:25                           ` David Chinner
  2007-11-06  7:10                             ` Torsten Kaiser
@ 2007-11-06 19:01                             ` Peter Zijlstra
  2007-11-06 20:26                               ` Torsten Kaiser
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2007-11-06 19:01 UTC (permalink / raw)
  To: David Chinner
  Cc: Torsten Kaiser, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. 

Just a crazy idea,..

Could there be a set_page_dirty() that doesn't have
balance_dirty_pages() call near? For example modifying meta data in
unlink?

Such a situation could lead to an excess of dirty pages and the next
call to balance_dirty_pages() would appear to stall, as it would
desperately try to get below the limit again.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: writeout stalls in current -git
  2007-11-06 19:01                             ` Peter Zijlstra
@ 2007-11-06 20:26                               ` Torsten Kaiser
  0 siblings, 0 replies; 10+ messages in thread
From: Torsten Kaiser @ 2007-11-06 20:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Chinner, Fengguang Wu, Maxim Levitsky, linux-kernel,
	Andrew Morton, linux-fsdevel, xfs

On 11/6/07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote:
>
> > I'm struggling to understand what possible changed in XFS or writeback that
> > would lead to stalls like this, esp. as you appear to be removing files when
> > the stalls occur.
>
> Just a crazy idea,..
>
> Could there be a set_page_dirty() that doesn't have
> balance_dirty_pages() call near? For example modifying meta data in
> unlink?
>
> Such a situation could lead to an excess of dirty pages and the next
> call to balance_dirty_pages() would appear to stall, as it would
> desperately try to get below the limit again.

Only if accounting of the dirty pages is also broken.
In the unmerge testcase I see most of the time only <200kb of dirty
data in /proc/meminfo.

The system has 4Gb of RAM so I'm not sure if it should ever be valid
to stall even the emerge/install testcase.

Torsten

Now building a kernel with the skipped-pages-accounting-patch reverted...

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-11-06 20:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200710221505.35397.maximlevitsky@gmail.com>
     [not found] ` <20071022131045.GA5357@mail.ustc.edu.cn>
     [not found]   ` <393060478.03650@ustc.edu.cn>
     [not found]     ` <64bb37e0710310822r5ca6b793p8fd97db2f72a8655@mail.gmail.com>
     [not found]       ` <393903856.06449@ustc.edu.cn>
     [not found]         ` <64bb37e0711011120i63cdfe3ci18995d57b6649a8@mail.gmail.com>
     [not found]           ` <64bb37e0711011200n228e708eg255640388f83da22@mail.gmail.com>
     [not found]             ` <E1InmAI-0003ME-2i@localhost>
     [not found]               ` <1193998532.27652.343.camel@twins>
     [not found]                 ` <64bb37e0711021222q7d12c825mc62d433c4fe19e8@mail.gmail.com>
2007-11-02 20:43                   ` writeout stalls in current -git David Chinner
2007-11-02 21:02                     ` Torsten Kaiser
2007-11-04 11:19                     ` Torsten Kaiser
2007-11-05  1:45                       ` David Chinner
2007-11-05  7:01                         ` Torsten Kaiser
2007-11-05 18:27                         ` Torsten Kaiser
2007-11-06  4:25                           ` David Chinner
2007-11-06  7:10                             ` Torsten Kaiser
2007-11-06 19:01                             ` Peter Zijlstra
2007-11-06 20:26                               ` Torsten Kaiser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox