linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
Date: Sun, 9 Oct 2011 08:27:36 +0800	[thread overview]
Message-ID: <20111009002736.GA4575@localhost> (raw)
In-Reply-To: <20111008134927.GA30910@localhost>

On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > > 
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > > 
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list). Basically I noticed many writeback_wait traces that
> > > never appear w/o this patch. In the btrfs cases that see larger
> > > regressions, I see large fluctuations in the writeout bandwidth and
> > > long disk idle periods. It's still a bit puzzling how all these
> > > happen..
> > 
> > Sorry I find that part of the regressions (about 2-3%) are caused by
> > change of my test scripts recently. Here are the more fair compares
> > and they show only regressions in btrfs and xfs:
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> >                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> >                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> >                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> >                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> >                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> >                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> >                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> >                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> >                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> >                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> >                   787.25        +0.7%       792.47  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> >                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> >                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> >                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> >                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> >                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> >                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> >                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> >                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> >                   428.07        -6.1%       401.99  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> >                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> >                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> >                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> >                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> >                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> >                   344.06       -54.3%       157.24  TOTAL
> > 
> > I'm now bisecting the patches to find out the root cause.
> 
> Current findings are, when only applying the first patch, or reduce the second
> patch to the below one, the btrfs regressions are restored:

And the below reduced patch is also OK:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue4+
------------------------  ------------------------
                   58.23        -0.4%        57.98  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -2.2%        57.13  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -1.2%        57.83  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.34        -0.7%        37.07  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        +0.2%        44.52  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +0.0%        41.72  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        -0.7%        46.10  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.8%        56.15  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.3%        54.33  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.53        -7.3%        41.29  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        +0.9%        56.39  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        +1.0%        51.60  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        -1.0%        55.97  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        -1.5%        55.28  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        -1.9%        55.16  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   30.66        -2.7%        29.82  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        -0.7%        35.00  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        -2.1%        42.65  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -2.4%        49.21  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -1.8%        57.08  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        -5.1%        39.61  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -2.6%        47.06  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -3.3%        50.64  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   45.37        +0.7%        45.70  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +0.7%        44.00  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.7%        35.82  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        -1.1%        55.77  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        -0.6%        50.94  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07       -13.3%        26.94  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        +0.5%        55.72  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +1.6%        48.33  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1559.39        -1.4%      1537.83  TOTAL

Subject: writeback: Replace some redirty_tail() calls with requeue_io()
Date: Thu, 8 Sep 2011 01:46:42 +0200

From: Jan Kara <jack@suse.cz>

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-08 21:51:00.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -632,12 +645,7 @@ static long __writeback_inodes_wb(struct
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);

  reply	other threads:[~2011-10-09  0:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08  1:22   ` Wu Fengguang
2011-09-08 15:03     ` Jan Kara
2011-09-18 14:07       ` Wu Fengguang
2011-10-05 17:39         ` Jan Kara
2011-10-07 13:43           ` Wu Fengguang
2011-10-07 14:22             ` Jan Kara
2011-10-07 14:29               ` Wu Fengguang
2011-10-07 14:45                 ` Jan Kara
2011-10-07 15:29                   ` Wu Fengguang
2011-10-08  4:00                   ` Wu Fengguang
2011-10-08 11:52                     ` Wu Fengguang
2011-10-08 13:49                       ` Wu Fengguang
2011-10-09  0:27                         ` Wu Fengguang [this message]
2011-10-09  8:44                           ` Wu Fengguang
2011-10-10 11:21                     ` Jan Kara
2011-10-10 11:31                       ` Wu Fengguang
2011-10-10 23:30                         ` Jan Kara
2011-10-11  2:36                           ` Wu Fengguang
2011-10-11 21:53                             ` Jan Kara
2011-10-12  2:44                               ` Wu Fengguang
2011-10-12 19:34                                 ` Jan Kara
2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111009002736.GA4575@localhost \
    --to=fengguang.wu@intel.com \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).