From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Dave Chinner <david@fromorbit.com>,
Christoph Hellwig <hch@infradead.org>,
Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
Date: Sun, 9 Oct 2011 08:27:36 +0800 [thread overview]
Message-ID: <20111009002736.GA4575@localhost> (raw)
In-Reply-To: <20111008134927.GA30910@localhost>
On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > >
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > >
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list). Basically I noticed many writeback_wait traces that
> > > never appear w/o this patch. In the btrfs cases that see larger
> > > regressions, I see large fluctuations in the writeout bandwidth and
> > > long disk idle periods. It's still a bit puzzling how all these
> > > happen..
> >
> > Sorry I find that part of the regressions (about 2-3%) are caused by
> > change of my test scripts recently. Here are the more fair compares
> > and they show only regressions in btrfs and xfs:
> >
> > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+
> > ------------------------ ------------------------
> > 37.34 +0.8% 37.65 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> > 44.44 +3.4% 45.96 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> > 41.70 +1.0% 42.14 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> > 46.45 -0.3% 46.32 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> > 56.60 -0.3% 56.41 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> > 54.14 +0.9% 54.63 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> > 30.66 -0.7% 30.44 thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> > 35.24 +1.6% 35.82 thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> > 43.58 +0.5% 43.80 thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> > 50.42 -0.6% 50.14 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> > 56.23 -1.0% 55.64 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> > 58.12 -0.5% 57.84 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> > 45.37 +1.4% 46.03 thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> > 43.71 +2.2% 44.69 thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> > 35.58 +0.5% 35.77 thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> > 56.39 +1.4% 57.16 thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> > 51.26 +1.5% 52.04 thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> > 787.25 +0.7% 792.47 TOTAL
> >
> > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+
> > ------------------------ ------------------------
> > 44.53 -18.6% 36.23 thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> > 55.89 -0.4% 55.64 thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> > 51.11 +0.5% 51.35 thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> > 41.76 -4.8% 39.77 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> > 48.34 -0.3% 48.18 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> > 52.36 -0.2% 52.26 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> > 31.07 -1.1% 30.74 thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> > 55.44 -0.6% 55.09 thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> > 47.59 -31.2% 32.74 thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> > 428.07 -6.1% 401.99 TOTAL
> >
> > 3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue+
> > ------------------------ ------------------------
> > 58.23 -82.6% 10.13 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> > 58.43 -80.3% 11.54 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> > 58.53 -79.9% 11.76 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> > 56.55 -31.7% 38.63 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> > 56.11 -30.1% 39.25 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> > 56.21 -18.3% 45.93 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> > 344.06 -54.3% 157.24 TOTAL
> >
> > I'm now bisecting the patches to find out the root cause.
>
> Current findings are, when only applying the first patch, or reduce the second
> patch to the below one, the btrfs regressions are restored:
And the below reduced patch is also OK:
3.1.0-rc8-ioless6a+ 3.1.0-rc8-ioless6-requeue4+
------------------------ ------------------------
58.23 -0.4% 57.98 thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
58.43 -2.2% 57.13 thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
58.53 -1.2% 57.83 thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
37.34 -0.7% 37.07 thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
44.44 +0.2% 44.52 thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
41.70 +0.0% 41.72 thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
46.45 -0.7% 46.10 thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
56.60 -0.8% 56.15 thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
54.14 +0.3% 54.33 thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
44.53 -7.3% 41.29 thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
55.89 +0.9% 56.39 thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
51.11 +1.0% 51.60 thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
56.55 -1.0% 55.97 thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
56.11 -1.5% 55.28 thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
56.21 -1.9% 55.16 thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
30.66 -2.7% 29.82 thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
35.24 -0.7% 35.00 thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
43.58 -2.1% 42.65 thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
50.42 -2.4% 49.21 thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
56.23 -2.2% 55.00 thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
58.12 -1.8% 57.08 thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
41.76 -5.1% 39.61 thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
48.34 -2.6% 47.06 thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
52.36 -3.3% 50.64 thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
45.37 +0.7% 45.70 thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
43.71 +0.7% 44.00 thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
35.58 +0.7% 35.82 thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
56.39 -1.1% 55.77 thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
51.26 -0.6% 50.94 thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
31.07 -13.3% 26.94 thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
55.44 +0.5% 55.72 thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
47.59 +1.6% 48.33 thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
1559.39 -1.4% 1537.83 TOTAL
Subject: writeback: Replace some redirty_tail() calls with requeue_io()
Date: Thu, 8 Sep 2011 01:46:42 +0200
From: Jan Kara <jack@suse.cz>
Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.
Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.
CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2011-10-08 21:51:00.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
long nr_to_write = wbc->nr_to_write;
unsigned dirty;
int ret;
+ bool inode_written = false;
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
int err = write_inode(inode, wbc);
+ if (!err)
+ inode_written = true;
if (ret == 0)
ret = err;
}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
* Filesystems can dirty the inode during writeback
* operations, such as delayed allocation during
* submission or metadata updates after data IO
- * completion.
+ * completion. Also inode could have been dirtied by
+ * some process aggressively touching metadata.
+ * Finally, filesystem could just fail to write the
+ * inode for some reason. We have to distinguish the
+ * last case from the previous ones - in the last case
+ * we want to give the inode quick retry, in the
+ * other cases we want to put it back to the dirty list
+ * to avoid livelocking of writeback.
*/
- redirty_tail(inode, wb);
+ if (inode_written)
+ redirty_tail(inode, wb);
+ else
+ requeue_io(inode, wb);
} else {
/*
* The inode is clean. At this point we either have
@@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
wrote++;
if (wbc.pages_skipped) {
/*
- * writeback is not making progress due to locked
- * buffers. Skip this inode for now.
+ * Writeback is not making progress due to unavailable
+ * fs locks or similar condition. Retry in next round.
*/
- redirty_tail(inode, wb);
+ requeue_io(inode, wb);
}
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
@@ -632,12 +645,7 @@ static long __writeback_inodes_wb(struct
struct super_block *sb = inode->i_sb;
if (!grab_super_passive(sb)) {
- /*
- * grab_super_passive() may fail consistently due to
- * s_umount being grabbed by someone else. Don't use
- * requeue_io() to avoid busy retrying the inode/sb.
- */
- redirty_tail(inode, wb);
+ requeue_io(inode, wb);
continue;
}
wrote += writeback_sb_inodes(sb, wb, work);
next prev parent reply other threads:[~2011-10-09 0:27 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-08 0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08 0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08 1:22 ` Wu Fengguang
2011-09-08 15:03 ` Jan Kara
2011-09-18 14:07 ` Wu Fengguang
2011-10-05 17:39 ` Jan Kara
2011-10-07 13:43 ` Wu Fengguang
2011-10-07 14:22 ` Jan Kara
2011-10-07 14:29 ` Wu Fengguang
2011-10-07 14:45 ` Jan Kara
2011-10-07 15:29 ` Wu Fengguang
2011-10-08 4:00 ` Wu Fengguang
2011-10-08 11:52 ` Wu Fengguang
2011-10-08 13:49 ` Wu Fengguang
2011-10-09 0:27 ` Wu Fengguang [this message]
2011-10-09 8:44 ` Wu Fengguang
2011-10-10 11:21 ` Jan Kara
2011-10-10 11:31 ` Wu Fengguang
2011-10-10 23:30 ` Jan Kara
2011-10-11 2:36 ` Wu Fengguang
2011-10-11 21:53 ` Jan Kara
2011-10-12 2:44 ` Wu Fengguang
2011-10-12 19:34 ` Jan Kara
2011-09-08 0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49 ` Jan Kara
-- strict thread matches above, loose matches on Subject: below --
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111009002736.GA4575@localhost \
--to=fengguang.wu@intel.com \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).